Evals are the only feature you should ship before launch.

There is a question we ask in every kickoff that decides, more reliably than any other signal, whether a project will ship: if the model answer changes tomorrow morning, how will you know?

If the answer is "we'll feel it" — the project is in trouble. If the answer is "the eval will fail and Slack will tell us" — the project is on track. Almost everything else in an AI engagement is downstream of this.

01. The eval is the contract

Most teams treat the evaluation harness as testing infrastructure — something they will set up later, after the model works. This is exactly backwards. The eval is the contract between you and the model: a written, runnable specification of what counts as success.

Without it, you do not have a project — you have a feeling about whether the model is working. Feelings drift. Models change. Six weeks in, no one will be able to say with confidence whether the system is better or worse than it was, because no one wrote down what better was.

An eval is the smallest possible artifact that can disagree with you. That is its job. — working principle

02. What a minimum eval looks like

We are not advocating for elaborate ML infrastructure. The first eval on most projects is a spreadsheet:

A column for the input — a real question, a real document, a real ticket.
A column for the expected output — the answer a senior member of the team would give.
A column for the actual output — what the model produced.
A column for pass/fail — and a one-line note when it fails.

Twenty rows. Forty if you can. A junior engineer can wire this into a nightly job in an afternoon. The cost is small. The protection is enormous.

The teams who do this stop having "the model seems worse this week" conversations. They have "row twelve regressed; here's the diff" conversations. The difference is the difference between engineering and superstition.

03. The unglamorous part

Building an eval set forces a conversation that almost everyone wants to avoid: what does good actually mean here? This is harder than it sounds, because the people who can answer it — the senior member of the team who handles the work today — have never had to write it down.

Plan for this conversation. Allocate a person, allocate a week, accept that the conversation will surface disagreements that have been quietly costing the business for years. The eval is the deliverable; the disagreements that surface during the build are arguably the more valuable output.

04. The two failure modes the eval prevents

Two specific kinds of failure show up in projects that ship without an eval, and almost never in projects that ship with one:

Silent regression on model upgrade. A frontier model upgrade lands. Nobody changes anything. The system gets worse on a category of inputs nobody is checking. Three weeks later, a customer complains. The eval would have flagged this overnight.
Confident hallucination on edge cases. The system answers a question it should have refused. The answer reads correct. Nobody catches it because the system is "working." The eval, if it covered the edge case, would have failed loudly.

Both of these are unfixable in retrospect. They are routine in projects with an eval.

Operating principle: if you cannot show the eval to the person paying for the project, you do not have a project. You have an experiment that has not yet decided whether it succeeded.

A short closing

Every other feature in an AI rollout — the model, the prompt, the retrieval, the UI — is downstream of the eval. The eval is what tells you whether your other choices were correct. Build it first. Build it small. Run it on every change. Show it to the CFO. The teams who do this ship pilot two. The teams who don't, don't.

Postoji pitanje koje postavljamo na svakom početnom razgovoru koje, pouzdanije od bilo kog drugog signala, odlučuje da li će projekat biti isporučen: ako se sutra ujutro odgovor modela promeni, kako ćete saznati?

Ako je odgovor „osetićemo to" — projekat je u nevolji. Ako je odgovor „evaluacija će pasti i Slack će nam reći" — projekat je na putu. Skoro sve ostalo u AI angažmanu nizvodno je od ovoga.

01. Evaluacija je ugovor

Većina timova tretira evaluacioni harness kao test infrastrukturu — nešto što će postaviti kasnije, kad model proradi. Ovo je tačno obrnuto. Evaluacija je ugovor između vas i modela: pisana, izvodljiva specifikacija šta se računa kao uspeh.

Bez nje, nemate projekat — imate utisak o tome da li model radi. Utisci se pomeraju. Modeli se menjaju. Šest nedelja kasnije, niko neće moći samouvereno da kaže da li je sistem bolji ili gori nego ranije, jer niko nije zapisao šta je značilo „bolji".

Evaluacija je najmanji mogući artefakt koji može da se ne složi sa vama. To je njen posao. — radni princip

02. Kako izgleda minimalna evaluacija

Ne zalažemo se za razrađenu ML infrastrukturu. Prva evaluacija u većini projekata je tabela:

Kolona za ulaz — stvarno pitanje, stvarni dokument, stvarni tiket.
Kolona za očekivani izlaz — odgovor koji bi dao senior član tima.
Kolona za stvarni izlaz — šta je model proizveo.
Kolona za prošlo/palo — i jednolinijska beleška kad padne.

Dvadeset redova. Četrdeset ako možete. Junior inženjer može da to poveže u noćni job za jedno popodne. Trošak je mali. Zaštita je ogromna.

Timovi koji to rade prestaju da imaju razgovore „čini se da je model gori ove nedelje". Imaju razgovore „red dvanaest je regresirao; evo razlike". Razlika je razlika između inženjeringa i praznoverja.

03. Neglamurozni deo

Pravljenje evaluacionog skupa primorava razgovor koji skoro svi žele da izbegnu: šta zapravo znači dobro ovde? Ovo je teže nego što zvuči, jer ljudi koji mogu da odgovore — senior član tima koji obavlja taj posao danas — nikad nisu morali da to zapišu.

Planirajte za ovaj razgovor. Dodelite osobu, dodelite nedelju, prihvatite da će razgovor izneti neslaganja koja su tiho koštala posao godinama. Evaluacija je isporuka; neslaganja koja izrone tokom izgradnje verovatno su vredniji izlaz.

04. Dva načina otkazivanja koje evaluacija sprečava

Dve specifične vrste otkazivanja javljaju se u projektima koji se puštaju bez evaluacije, a skoro nikada u projektima koji se puštaju sa njom:

Tiha regresija pri nadogradnji modela. Nadogradnja frontijer modela stigne. Niko ništa ne menja. Sistem postane gori na kategoriji ulaza koju niko ne proverava. Tri nedelje kasnije, klijent se žali. Evaluacija bi to označila preko noći.
Samouverena halucinacija na rubnim slučajevima. Sistem odgovori na pitanje koje je trebalo da odbije. Odgovor zvuči tačno. Niko ne primeti jer sistem „radi". Evaluacija bi, da je pokrivala rubni slučaj, glasno pala.

Oba ova problema neispravljiva su u retrospektivi. Rutina su u projektima sa evaluacijom.

Operativni princip: ako ne možete da pokažete evaluaciju osobi koja plaća projekat, nemate projekat. Imate eksperiment koji još nije odlučio da li je uspeo.

Kratko zatvaranje

Svaka druga funkcionalnost u AI puštanju — model, prompt, preuzimanje, UI — nizvodno je od evaluacije. Evaluacija je ono što vam kaže da li su vaši ostali izbori bili tačni. Napravite je prvo. Napravite je malu. Pokrenite je na svakoj promeni. Pokažite je CFO-u. Timovi koji to rade puštaju drugi pilot. Timovi koji ne, ne.

Evals are the only feature you should ship before launch.

01. The eval is the contract

02. What a minimum eval looks like

03. The unglamorous part

04. The two failure modes the eval prevents

A short closing

01. Evaluacija je ugovor

02. Kako izgleda minimalna evaluacija

03. Neglamurozni deo

04. Dva načina otkazivanja koje evaluacija sprečava

Kratko zatvaranje

Relatedessays.

What Happens When You Debate an AI

Why most AI pilots fail at month three — and the four habits that save them.

RAG isn't search: a primer for operators.

Related
essays.