The first time you wire a vector database to a language model and watch it answer a question with text from your own documents, something that looks like magic happens. The second time, it gets the answer subtly wrong. The third time, you can't tell whether it's right or not — and neither can the model.
This is the part of retrieval-augmented generation nobody tells you about up front. It is not a smarter search engine. It is a system that makes confident statements based on whatever you happened to retrieve — including when what you retrieved was beside the point.
A web search engine ranks documents and lets you decide which one to read. The ambiguity stays with you, the human. You scroll, you skim, you back out, you try a different query.
A retrieval system pulls a fixed number of chunks from your corpus and hands them to the model as if they were the truth. The model then writes a fluent paragraph as if those chunks were the only material that existed. There is no scrolling, no skimming, no back-button. Whatever you retrieved is the world.
This is the silent failure mode. If retrieval returns the wrong four paragraphs, the model writes a confident, well-cited answer that is wrong in a way that reads exactly like an answer that is right.
In search, ambiguity is shared between you and the index. In retrieval, the model resolves it for you — silently, and with confidence.
— working note, internal
02. Three places where it goes quiet
In our work with operators rolling out their first agent, the same three failures show up almost every time:
The corpus is contradictory. Two policies, both current, both retrieved, both treated as ground truth. The model splits the difference. Nobody notices.
The chunks are too small. A clause that depended on the sentence above it gets retrieved alone. The answer is technically grounded — and meaningfully wrong.
The query is the wrong shape. A user asks "how do we handle this?" but the document indexed under that phrase describes how it used to be handled, not how it is now.
None of these are model problems. None of them are solved by upgrading to a better embedding. They are content problems, masquerading as ML problems.
03. What to do instead of "switch to a better model"
When a RAG system disappoints, the operator instinct is to swap the LLM. Almost always wrong. The shortest path back to a working system is, in this order:
Look at what was retrieved, not what was generated. Pull twenty real questions, log the chunks, read them. You will know within an hour whether retrieval is working.
Trim the corpus before you tune the embeddings. Decommissioned policies, draft documents, duplicates of duplicates — most knowledge bases are 30% noise by volume. Cutting noise outperforms upgrading models.
Rewrite the questions before rewriting the prompts. If users ask one thing and the corpus is indexed under another, the embedding can't bridge that gap. A glossary or a query-rewrite step often does more than a vector-store change.
Rule of thumb: if you can't show a colleague the four chunks the agent saw before it answered, you don't have a RAG system. You have a black box that happens to use embeddings.
A short takeaway
RAG is not a synonym for "smart search." It is a way of putting the model on the hook for synthesizing whatever you give it. That makes it powerful when the corpus is clean and confident — and quietly dangerous when it isn't.
The teams that get value from retrieval treat it the way good editors treat sources: skeptically, traceably, and with the assumption that the citation matters more than the prose around it.
Prvi put kada povežete vektorsku bazu sa jezičkim modelom i vidite kako odgovara na pitanje tekstom iz vaših dokumenata, događa se nešto što izgleda kao magija. Drugi put, odgovor je suptilno pogrešan. Treći put, ne možete da prosudite da li je tačan ili ne — a ni model ne može.
Ovo je deo retrieval-augmented generation o kojem vam niko ne govori unapred. Nije pametniji pretraživač. To je sistem koji daje samouverene tvrdnje na osnovu onoga što ste slučajno preuzeli — uključujući i kad je preuzeto bilo izvan teme.
01. Pretraga vraća linkove. Preuzimanje vraća odgovore.
Web pretraživač rangira dokumente i ostavlja vama da odlučite koji ćete čitati. Dvosmislenost ostaje na vama, čoveku. Skrolujete, prelistavate, vraćate se, probate drugi upit.
Sistem za preuzimanje vuče fiksan broj delova iz vašeg korpusa i predaje ih modelu kao da su istina. Model onda piše tečnu rečenicu kao da je to jedini postojeći materijal. Nema skrolovanja, prelistavanja, dugmeta nazad. Ono što ste preuzeli — to je svet.
Ovo je tihi način otkazivanja. Ako preuzimanje vrati pogrešna četiri pasusa, model napiše samouveren, citatima potkrepljen odgovor koji je pogrešan na način koji se čita potpuno isto kao i tačan odgovor.
U pretrazi, dvosmislenost je podeljena između vas i indeksa. U preuzimanju, model je rešava za vas — tiho i samouvereno.
— interna beleška
02. Tri mesta gde sve utihne
U radu sa operaterima koji puštaju prvog agenta, iste tri greške se javljaju gotovo uvek:
Korpus je kontradiktoran. Dve politike, obe važeće, obe preuzete, obe tretirane kao istina. Model nađe sredinu. Niko ne primeti.
Delovi su premali. Klauzula koja je zavisila od rečenice iznad nje preuzeta je sama. Odgovor je tehnički potkrepljen — i suštinski pogrešan.
Upit je pogrešnog oblika. Korisnik pita „kako rešavamo ovo?", a dokument indeksiran pod tom frazom opisuje kako se nekad rešavalo, ne kako se sada rešava.
Nijedan od ovih problema nije problem modela. Nijedan se ne rešava boljim embeddingom. To su problemi sadržaja, prerušeni u probleme mašinskog učenja.
03. Šta raditi umesto „prebaciti se na bolji model"
Kada RAG sistem razočara, instinkt operatera je da promeni LLM. Skoro uvek pogrešno. Najkraći put nazad ka funkcionalnom sistemu je, ovim redom:
Pogledajte šta je preuzeto, a ne šta je generisano. Uzmite dvadeset stvarnih pitanja, logujte delove, pročitajte ih. Za sat vremena znaćete da li preuzimanje radi.
Skratite korpus pre nego što štelujete embeddinge. Povučene politike, radne verzije dokumenata, duplikati duplikata — većina baza znanja ima 30% buke po obimu. Sečenje buke daje bolje rezultate od nadogradnje modela.
Prepišite pitanja pre nego što prepišete promptove. Ako korisnici pitaju jedno, a korpus je indeksiran pod drugim, embedding ne može premostiti taj jaz. Rečnik pojmova ili korak prepisivanja upita često uradi više od promene vektorske baze.
Pravilo: ako ne možete da pokažete kolegi četiri dela koje je agent video pre nego što je odgovorio, nemate RAG sistem. Imate crnu kutiju koja koristi embeddinge.
Kratak zaključak
RAG nije sinonim za „pametnu pretragu". To je način da stavite model pod odgovornost da sintetizuje sve što mu date. To ga čini moćnim kada je korpus čist i pouzdan — i tiho opasnim kada nije.
Timovi koji izvuku vrednost iz preuzimanja tretiraju ga onako kako dobri urednici tretiraju izvore: skeptično, sa praćenjem traga i sa pretpostavkom da je citat važniji od proze oko njega.
Filed under: RAG · METHOD · PRIMER First published: Apr 22, 2026