Journal/Data lakes/Schemas are opinions — write them like it.

Schemas are opinions — write them like it.

Your data lake is a working theory of how your business runs. Treat it that way and the agents on top of it stop hallucinating.

Published
Feb 21, 2026
Reading time
8 minutes
Category
Data lakes

Engineers think of schemas as a way to organize columns. Operators should think of them as something stranger: a theory of how their business operates, written down in tables and types. When the schema is wrong, the agent on top of it cannot help being wrong. When the schema is right, the agent cannot help being useful.

This is the part of data work that nobody talks about, because the people who do it well find it boring and the people who do it badly do not realize they are doing it.

01. A schema is a claim, not a container

When you decide that customers has a column called tier, you are not choosing a database structure. You are claiming, in writing, that every customer can be placed into one of a finite set of tiers. The agent reading that schema will believe you. The next person reading the data will believe you. Nine months from now, the founder will write strategy on top of this claim.

If that claim is false — if some customers do not fit any tier, if a tier was renamed and the old name is still in the data, if the tiers were never agreed across teams — the schema is lying. The agent's downstream answers, which the founder will trust, will be confidently wrong.

A schema is the smallest possible essay on what your business is. Almost nobody writes it that way. — working note

02. Three places schemas are usually wrong

In every diagnostic we run, the same three patterns show up in the data layer:

  • The "status" column with twenty-seven values. Originally three. Grew by accretion. Half are duplicates of each other in different cases. The agent will treat them as distinct categories. They are not.
  • The "type" column nobody owns. The CRM team uses it for one thing, the finance team uses it for another, the support team has stopped using it. The schema does not record this. The agent will average across all three meanings.
  • The foreign key that is mostly null. A relationship that exists in theory and rarely in practice. The agent will assume the relationship is meaningful when it sees it. It usually is not.

None of these are technical bugs. The data is "fine." The schema is what is wrong, because the schema is making a claim that the data has stopped supporting.

03. Write the opinion down

The exercise we ask operators to do, before any agent is built on a data lake, is unfashionably old: write a short essay — half a page per important table — that says, in English:

  • What this table is a record of.
  • What it is not a record of.
  • Which fields are reliably populated, and which are aspirational.
  • Which fields disagree across systems, and which we have made authoritative.

This document is not optional documentation. It is the schema, written in the language the agent will be given. When the agent answers a question wrong, this is where you go to figure out whether the answer was wrong because the model misread the schema, or because the schema was lying.

The operators who do this exercise stop arguing with their agents. The operators who do not, never quite trust them.

Rule of thumb: if you cannot describe a table's purpose in two sentences without naming a system, the schema is incomplete. The system is implementation; the table is supposed to outlive it.

A short closing

The agents you build will be exactly as honest as the schema underneath them. If the schema is a confident essay about how your business works, the agent will sound confident for the right reasons. If the schema is a pile of column names that nobody has audited in two years, the agent will hallucinate — and it will sound the same.

Treat your schema like writing. It is.


Filed under: DATA · METHOD
First published: Feb 21, 2026