<- Logs//AI Reliability

Tests, Evals, and the Substrate That Moves

Why production AI needs behavioral measurement, not just correctness checks.

P
PraveenEngineering · 7 min read

Software engineering has evolved under a useful assumption: the substrate your code runs on is predictable and observable. It isn't actually static — compilers upgrade, libraries drift, hardware varies, and flaky tests have been a discipline of their own for decades. But the substrate changes on a schedule you mostly control, and when it changes you get to know about it. You upgrade the compiler, the test suite runs, and any new failures surface immediately in CI.

Production AI weakens that assumption.

The layer underneath your prompts is a model — a probabilistic function defined by weights you don't own, trained on data you didn't see, served through infrastructure you can't inspect. When you call an alias endpoint like claude-3-5-sonnet, the underlying version is allowed to change. When you pin to a dated version like claude-3-5-sonnet-20241022, weights are generally frozen — but behavior can still shift over time because of the surrounding system: safety filters, moderation pipelines, routing, tool-use plumbing, rate-limit handling. (Anthropic and OpenAI both document this to varying degrees in their model deprecation and versioning policies.)

The practical takeaway: your prompts, schemas, and code are the most stable part of the system. The layer that turns them into decisions is not.

This is not a failure of frontier providers. It is a structural property of building on systems whose behavior is not a specification. And it has implications — most sharply for AI systems where decisions carry real consequences, less so for lower-stakes applications — for how you think about quality.

The difference between verifying correctness and detecting change

A test answers one question: is this output acceptable? It defines a boundary — a specific input should produce a specific output, or an output within a bounded tolerance. When the test passes, you have evidence that at this moment, for this input, the system behaves within that boundary.

A test does not tell you whether anything has changed since yesterday. It has no memory. Passing a test today and passing the same test last week are, from the test's perspective, identical outcomes.

In conventional software, this is usually fine. Substrate changes are announced, versioned, and human-initiated, so changed and broken are close to the same thing. A test that once passed and now fails typically reflects a change in your own codebase, and you find it in your commit history.

In production AI, changed and broken can diverge. A test can still pass. Decisions can still be correct. And yet the system's behavior can have shifted in ways that matter — confidence scores that used to cluster at 0.9 now cluster at 0.7, edge cases that used to resolve one way now resolve another. Users experience the system differently even when tests continue to report green.

That is the gap evals address. Not as a replacement for tests, but as a different instrument aimed at a different question.

Evals as measurement, not verification

It helps to separate two concerns.

A test verifies. It is a binary instrument: the output either satisfies the specification or it doesn't. You write tests against a specification you wrote yourself. They encode your intent.

An eval measures. It captures the shape of the system's behavior — decisions, confidences, reasoning traces, latencies, token counts — across a representative set of inputs. It does not primarily ask is this right? It asks what is this doing?

These are different disciplines. Tests are for things you care about enough to specify. Evals are for things you need to observe whether you can fully specify them or not.

For production AI, evals deserve first-class status, because the behavior of a probabilistic system is not something you can fully specify in advance. You build a representative sample of real decisions, run it on a cadence, and compare what you see to what you saw last time. The instrument has memory; the test does not.

Baselines: what evals give you that tests cannot

The central artifact evals produce, and tests do not, is a baseline.

A baseline is a frozen record of system behavior at a moment in time, across a specific set of inputs. It is not itself a correctness judgment. It is a photograph. Later, when the model has changed, or the prompt has changed, or a tool response schema has changed, you take another photograph and compare.

The comparison is what tests cannot do. A test cannot tell you that decisions flipped for 12% of your scenarios, because a test doesn't know what yesterday's decisions were. A baseline can.

The comparison is also what humans cannot easily do at scale. You can read a handful of outputs side by side. You cannot read thousands. The instrument has to do it for you.

This framing reshapes what a mature AI engineering practice looks like. It is no longer enough to write tests and watch for red builds. You also need to capture what the system is doing, record it, and look at the delta when anything in the environment shifts.

Three consequences of taking evals seriously

First, evals should run on a cadence, not only on demand. A baseline is only useful if you have a recent point of comparison. A monthly eval gives you much less actionable signal than a daily or weekly one, because the longer the gap, the harder it is to attribute a delta to a specific change. The right cadence depends on how quickly the environment around your system actually moves.

Second, the delta is often more informative than the absolute number. A report that your system is 97% accurate is less actionable than a report that says 97% and that is down two points from last week. The trend is the signal; a single reading in isolation is harder to act on.

Third, evals inform deployment gates, not only dashboards. The point of detecting change is to prevent bad changes from reaching production. That requires the eval loop to be wired into whatever promotes models, prompts, and configurations. A human watching a dashboard weekly is a start — but for higher-stakes systems the loop should close automatically, with humans inspecting flagged diffs rather than browsing charts.

None of this is revolutionary. All of it is more investment than the testing infrastructure most teams currently have.

Closing

The underlying observation is simple. Software engineering evolved a set of practices under the assumption that substrate changes are announced, versioned, and human-initiated. Production AI relaxes that assumption in ways that vary by stakes and by provider policy — but rarely in your favor when it matters.

Tests are not going away. They still encode intent. They still catch errors you had the foresight to anticipate. What tests cannot do — and what you now need, at least for systems where decisions carry real weight — is tell you when the system behaves differently than it did yesterday, even when that difference does not cross a specification boundary.

Evals are the discipline for that question. Treat them as the measurement layer they are, run them continuously, and care about the delta as much as the absolute value. The outputs of a production AI system can drift whether you look or not. The only real choice is whether you see the drift before your users do.