Journal/Strategy/Why most AI pilots fail at month three — and the four habits that save them.

Why most AI pilots fail at month three — and the four habits that save them.

A pilot that "works in the demo" is the easy part. Pilots die in the boring middle: when the data drifts, the champion gets pulled, and nobody owns the eval.

Published
May 04, 2026
Reading time
12 minutes
Category
Strategy

We've helped fifteen small and mid-sized businesses ship AI into production over the last two years. Three of them are spectacular. The rest are useful. None of them looked like the demo we ran in week three.

Here's a thing nobody tells you about AI pilots: they don't fail in the way you expect. They don't fall over because the model is wrong, or the prompt is bad, or the vector database is slow. They fail because at month three the champion gets pulled onto a different fire, the data the agent was trained on drifts, and the person who would have caught the regression isn't watching anymore.

This is a piece about the four habits we've started insisting on — sometimes politely, sometimes not — before we'll take the engagement.

01. Own the eval before you own the model

If you cannot describe, in one paragraph, what "good" looks like for your agent, you do not have a project. You have a vibe. The first deliverable in any engagement we run is an evaluation harness — a dataset of real questions, real answers, and an automated way to check the agent against them.

Building this is unsexy. It is also the thing that lets you sleep at night. When the model upgrades next month — and it will — your eval is the only signal that tells you whether you got better, worse, or the same. Without it you are operating on faith.

The eval is not a checkbox. It's the only feature that matters in the first thirty days. — from a postmortem we wish we hadn't written

The companies who fail at month three almost always built the eval at month two-and-a-half. By then they were patching, not measuring.

02. Pick a wedge, not a transformation

The biggest favor you can do your future self is to pick a workflow you can describe on a Post-it. Not "we want to use AI in operations." Not "intelligent automation across the back office." A wedge looks like:

  • Quote turnaround for incoming RFQs.
  • First-draft replies to support tickets in a single category.
  • Reconciling line items between two systems that should agree but don't.

Wedges are sized to fit one team's annoyance. They cost less than a senior engineer's quarterly comp. They ship in weeks, not quarters. And — critically — when they work, they finance the next wedge without anyone needing to write a memo about transformation.

Rule of thumb: if the workflow can't be described in a single sentence ending with the word "today" — as in "how we do this today" — it's not a wedge yet. It's a strategy. Put it back in the drawer.

03. Wire the lake before you build the agent

This is the habit that separates the engagements that compound from the ones that stay one-off projects. Most teams reach for the agent first because the agent is the visible part. The lake — the boring, deduplicated, schemaed pile of your company's actual data — is the part that determines whether the second pilot is cheap or expensive.

Wire it once. Reflect every system your business actually depends on into a place you control. Postgres, S3, Iceberg, whatever your team can operate. Don't boil the ocean — start with the systems your wedge needs — but write the connectors as if you'll need ten more.

The third pilot will pay for the first lake. The fifth pilot will be free.

04. Stay boring on purpose

The best AI rollouts we've run look almost identical to good software rollouts from a decade ago. Code review. CI. A staging environment. Feature flags. A real on-call rotation when the agent is doing real work. The model is exotic. The way you ship the model should not be.

This is harder than it sounds, because the field rewards novelty in conversation and punishes it in production. Every quarter there is a new framework, a new pattern, a new way of orchestrating tool calls. Most of them are fine. Almost none of them are worth re-platforming for.

The four-habit checklist, in plain English

  1. Eval first. Write down what good looks like, in tests, before you write prompts.
  2. Wedge, not vision. One workflow, one team, one quarter.
  3. Lake before agent. The boring infrastructure determines the cost of pilot #2.
  4. Boring delivery. Treat AI rollouts like the production software they are.

A working theory

If you take one thing from this essay, take this: the failure mode of AI pilots is not technical. It is organizational. The team that ships pilot two has built habits the team that doesn't ship pilot two has not. They are unfashionable habits. They are also the only ones that survive contact with month three.

If you'd like help building those habits, we do this for a living. If you'd rather build them yourself — and many of our best clients did, before they hired us — these four are the place to start.


Filed under: STRATEGY · POSTMORTEM · METHOD
First published: May 04, 2026