Journal/RAG/Evals are the only feature you should ship before launch.

Evals are the only feature you should ship before launch.

Every AI rollout is a measurement problem dressed as a build problem. The teams who figure this out early are the ones who ship the second pilot.

Published
Jan 24, 2026
Reading time
7 minutes
Category
RAG

There is a question we ask in every kickoff that decides, more reliably than any other signal, whether a project will ship: if the model answer changes tomorrow morning, how will you know?

If the answer is "we'll feel it" — the project is in trouble. If the answer is "the eval will fail and Slack will tell us" — the project is on track. Almost everything else in an AI engagement is downstream of this.

01. The eval is the contract

Most teams treat the evaluation harness as testing infrastructure — something they will set up later, after the model works. This is exactly backwards. The eval is the contract between you and the model: a written, runnable specification of what counts as success.

Without it, you do not have a project — you have a feeling about whether the model is working. Feelings drift. Models change. Six weeks in, no one will be able to say with confidence whether the system is better or worse than it was, because no one wrote down what better was.

An eval is the smallest possible artifact that can disagree with you. That is its job. — working principle

02. What a minimum eval looks like

We are not advocating for elaborate ML infrastructure. The first eval on most projects is a spreadsheet:

  • A column for the input — a real question, a real document, a real ticket.
  • A column for the expected output — the answer a senior member of the team would give.
  • A column for the actual output — what the model produced.
  • A column for pass/fail — and a one-line note when it fails.

Twenty rows. Forty if you can. A junior engineer can wire this into a nightly job in an afternoon. The cost is small. The protection is enormous.

The teams who do this stop having "the model seems worse this week" conversations. They have "row twelve regressed; here's the diff" conversations. The difference is the difference between engineering and superstition.

03. The unglamorous part

Building an eval set forces a conversation that almost everyone wants to avoid: what does good actually mean here? This is harder than it sounds, because the people who can answer it — the senior member of the team who handles the work today — have never had to write it down.

Plan for this conversation. Allocate a person, allocate a week, accept that the conversation will surface disagreements that have been quietly costing the business for years. The eval is the deliverable; the disagreements that surface during the build are arguably the more valuable output.

04. The two failure modes the eval prevents

Two specific kinds of failure show up in projects that ship without an eval, and almost never in projects that ship with one:

  • Silent regression on model upgrade. A frontier model upgrade lands. Nobody changes anything. The system gets worse on a category of inputs nobody is checking. Three weeks later, a customer complains. The eval would have flagged this overnight.
  • Confident hallucination on edge cases. The system answers a question it should have refused. The answer reads correct. Nobody catches it because the system is "working." The eval, if it covered the edge case, would have failed loudly.

Both of these are unfixable in retrospect. They are routine in projects with an eval.

Operating principle: if you cannot show the eval to the person paying for the project, you do not have a project. You have an experiment that has not yet decided whether it succeeded.

A short closing

Every other feature in an AI rollout — the model, the prompt, the retrieval, the UI — is downstream of the eval. The eval is what tells you whether your other choices were correct. Build it first. Build it small. Run it on every change. Show it to the CFO. The teams who do this ship pilot two. The teams who don't, don't.


Filed under: RAG · METHOD
First published: Jan 24, 2026