RAG has become the default reach in any "we want AI to know our stuff" conversation. It is also, in our experience, the wrong reach in roughly half the cases we are called into. The fix is rarely a better embedding. It is reaching for a different tool.
This is a short note on the four times we have stopped using RAG, and what we use instead.
01. When the answer is a number, not a paragraph
If the question the agent needs to answer is fundamentally a SQL query — how many of these did we ship last month, and what was the average margin? — RAG is the wrong shape. You are not retrieving prose; you are aggregating structured data. Putting a vector store in front of this is, almost always, a long road to a worse version of a SELECT statement.
The right tool is text-to-SQL, against the data warehouse you already have. Modern frontier models do this well enough that the surrounding work is the schema description, the access controls, and the eval set. The retrieval is happening — but it is happening against tables, not chunks.
02. When the corpus is small and stable
If the entire knowledge base fits comfortably in the model's context window — say, a hundred-page operating policy that does not change weekly — the answer is to not retrieve at all. Put the document in the prompt. Cache it. Done.
This sounds too simple to be a real recommendation. It is the recommendation. Vector stores have a fixed cost — operationally, in inference, in the engineering you have to do to keep them fresh — that you do not pay if you can simply hand the model the whole document. We have replaced more than one production RAG system with three lines of prompt-caching code.
Half of "let's add RAG" projects should have been "let's just put the document in the context window and cache it."
— working note, internal review
03. When the question requires reasoning, not lookup
If the agent's job is to think about something — to compare, to plan, to compose — adding retrieval often makes the answer worse. The model spends part of its budget summarising the chunks instead of doing the reasoning, and the chunks themselves are often beside the point.
The right move is to keep the retrieval scope tight — only the documents that really anchor the question — and to leave the model room to reason. Sometimes the right answer is to retrieve nothing and rely on the model's own knowledge, especially for questions about widely understood concepts. RAG is for grounding. Reasoning needs space, not more grounding.
04. When the data changes faster than the index
Retrieval works when the index is, broadly, current. When the underlying data changes faster than your indexing pipeline can keep up — pricing, inventory, real-time customer state — RAG produces confidently stale answers. The model has no way to know the chunk it just retrieved is two hours old.
The right move here is function calling: give the model a tool that queries the source of truth at request time. The agent asks the API, gets the current state, and answers. The boundary moves from "retrieve, then generate" to "look up, then generate," and the answers are correct in a way that an index could never make them.
When RAG is right
To be balanced: RAG is the correct answer when the corpus is large, the data is mostly textual, the questions are open-ended, and the answer is expected to cite or quote source material. Knowledge assistants over a documentation corpus, support agents over a help center, internal Q&A across years of policies — these are the canonical fits, and they are real.
The mistake is not using RAG. The mistake is reaching for it before asking which of the four conditions above might apply.
Decision rule: before adding a vector store, answer four questions. Is the answer a number? Does the corpus fit in context? Does the task require reasoning? Does the data change faster than the index? Any "yes" should make you reach for a different tool first.
A short closing
RAG is a useful technique. It is not the answer to "we want AI to know our stuff." It is the answer to one specific shape of that question — and there are at least three other shapes, each with a better-fit tool. Pick deliberately. The agent that results will be cheaper to run, faster to answer, and easier to defend when the question gets harder.
RAG je postao podrazumevani potez u svakom razgovoru u kome se kaže „želimo da AI poznaje naše stvari". Ali je, po našem iskustvu, pogrešan potez u otprilike polovini slučajeva u koje nas zovu. Rešenje retko je bolji embedding. Posezanje za drugim alatom jeste.
Ovo je kratka beleška o četiri situacije u kojima smo prestali da koristimo RAG, i šta koristimo umesto toga.
01. Kada je odgovor broj, ne pasus
Ako je pitanje na koje agent treba da odgovori suštinski SQL upit — koliko smo ovih isporučili prošlog meseca, i kakva je bila prosečna marža? — RAG je pogrešan oblik. Ne preuzimate prozu; agregirate strukturirane podatke. Stavljanje vektorske baze ispred ovoga je, skoro uvek, dug put do gore verzije SELECT izjave.
Pravi alat je text-to-SQL, na vašem postojećem skladištu podataka. Moderni frontijer modeli to rade dovoljno dobro da je okolni posao opis šeme, kontrola pristupa i evaluacioni skup. Preuzimanje se dešava — ali se dešava nad tabelama, ne nad delovima.
02. Kada je korpus mali i stabilan
Ako se cela baza znanja udobno smesti u kontekst modela — recimo, sto strana operativne politike koja se ne menja sedmično — odgovor je da se uopšte ne preuzima. Stavite dokument u prompt. Kešujte ga. Gotovo.
Ovo zvuči previše jednostavno da bi bila prava preporuka. To je prava preporuka. Vektorske baze imaju fiksni trošak — operativno, u inferenciji, u inženjeringu koji morate da uradite da bi bile sveže — koji ne plaćate ako modelu jednostavno možete predati ceo dokument. Zamenili smo više od jednog produkcionog RAG sistema sa tri linije koda za keširanje prompta.
Polovina projekata „dodajmo RAG" trebalo je da budu „samo stavimo dokument u kontekst i kešujmo ga".
— radna beleška, interni pregled
03. Kada pitanje zahteva zaključivanje, ne pretragu
Ako je posao agenta da misli o nečemu — da poredi, planira, sastavlja — dodavanje preuzimanja često čini odgovor lošijim. Model troši deo svog budžeta na sumiranje preuzetih delova umesto na zaključivanje, a sami delovi često nisu ni od značaja.
Pravi potez je da držite obim preuzimanja uskim — samo dokumenti koji zaista usidruju pitanje — i da ostavite modelu prostora da zaključuje. Ponekad je pravi odgovor da se ne preuzima ništa i da se oslonite na sopstveno znanje modela, posebno za pitanja o širokim, opštepoznatim pojmovima. RAG je za uzemljenje. Zaključivanje traži prostor, ne više uzemljenja.
04. Kada se podaci menjaju brže nego indeks
Preuzimanje radi kada je indeks, uglavnom, ažuran. Kada se osnovni podaci menjaju brže nego što vaš indeksacioni pipeline može da prati — cene, zalihe, stanje klijenta u realnom vremenu — RAG proizvodi samouvereno zastarele odgovore. Model nema načina da zna da je deo koji je upravo preuzeo star dva sata.
Pravi potez ovde je function calling: dajte modelu alat koji upituje izvor istine u trenutku zahteva. Agent pita API, dobija trenutno stanje, i odgovara. Granica se pomera sa „preuzmi, pa generiši" na „pretraži, pa generiši", i odgovori su tačni na način na koji indeks ne bi mogao da ih učini.
Kada jeste RAG pravi izbor
Da budemo uravnoteženi: RAG je tačan odgovor kada je korpus velik, podaci su uglavnom tekstualni, pitanja su otvorena, i odgovor treba da citira ili navodi izvorni materijal. Pomoćnici za znanje nad korpusom dokumentacije, agenti podrške nad bazom pomoći, interno pitanje-i-odgovor preko godina politika — to su kanonski slučajevi, i stvarni su.
Greška nije korišćenje RAG-a. Greška je posezanje za njim pre nego što se zapita koja od četiri prethodna uslova može da važi.
Pravilo odluke: pre nego što dodate vektorsku bazu, odgovorite na četiri pitanja. Da li je odgovor broj? Da li korpus staje u kontekst? Da li zadatak zahteva zaključivanje? Da li se podaci menjaju brže od indeksa? Bilo koje „da" treba da vas natera da prvo posegnete za drugim alatom.
Kratko zatvaranje
RAG je korisna tehnika. Nije odgovor na „želimo da AI poznaje naše stvari". Odgovor je na jedan specifičan oblik tog pitanja — a postoje bar tri druga oblika, svaki sa boljim alatom. Birajte promišljeno. Agent koji nastane biće jeftiniji za pokretanje, brži za odgovor, i lakše branjiv kada pitanje postane teže.
Filed under: RAG · METHOD First published: Nov 14, 2025