Evals Over Gut Feeling: Why AI Features Fail Without Measurement // nh labs

The Feature That Works in the Meeting

There's a moment every team recognises once it has built its first AI feature. The prototype runs, someone types in a question, the model answers fluently and cleverly, and a warm feeling spreads through the room: this works. Three weeks later it goes live, and suddenly the same chatbot is inventing order numbers that never existed, the summariser is cutting the one paragraph that mattered, and the classifier is dropping every second complaint into the wrong bucket.

The uncomfortable question in the crisis meeting that follows is always the same: how much worse did it get – and since when? And nobody can answer it. There's no number, no comparison, no before. There was only a good feeling in a demo.

That's demo-driven development, and it's the single most common way AI projects fail. Not because the model is too weak, but because nobody ever measured what it actually does. A demo shows that a system can work in the best case. About the normal case, about the long tails of real usage, about regressions, it says precisely nothing.

What an Eval Actually Is

The answer to this problem is old and unglamorous: measure. In AI development the tool is called an eval – short for evaluation. And at its core it's surprisingly simple:

A dataset of representative inputs. Real examples the feature will face in production – the typical ones, the tricky ones, the annoying ones. Not the three pretty cases from the demo.
A way to score the outputs. A function that, for each input, says: good or bad, right or wrong, a number between 0 and 1.

That's it. An eval is to an AI feature what a test is to ordinary code: a repeatable, automated statement about whether the system does what it's supposed to. Anyone shipping software without tests is rightly considered reckless. Anyone shipping AI features without evals is doing exactly the same thing – it just goes unnoticed for longer, because a model sounds confident even when it's wrong.

How you score depends on the task:

Structured tasks can be checked hard. If the model has to parse an invoice into JSON, sort an email into one of five categories, or extract a figure, there's a correct answer. Exact-match, assertions, schema validation – like a classic unit test.
Open-ended tasks need a softer judgement. For a summary, a support-chat reply, a generated paragraph, there's no single correct output. Here you rely either on a human rating or on a second model as a grader – LLM-as-judge – scoring against clear criteria.
Golden datasets are the backbone: a curated set of inputs paired with the expected or ideal output, which you check against again and again.

Collect those examples and run them automatically on every change, and you have a regression suite – the only authority that will honestly tell you whether your latest "improvement" actually was one.

Why This Matters More Than for Ordinary Code

You might object here: we already know tests, this is nothing new. True – and yet the situation with AI is fundamentally more dangerous, for two reasons.

First: changes are non-deterministic. Classic code does the same thing every time for the same input. A language model doesn't. The same request can produce a good answer today and a poor one tomorrow. A single spot-check – "I just tried it, looked fine" – is therefore worthless. You need many examples and a distribution, not an anecdote.

Second, and this is the genuinely dangerous part: changes are global. In ordinary code a change is local. You alter a function, and only what calls that function is affected. With AI, every meaningful lever is global. You rephrase one line of the prompt, swap the model for a new version, nudge the temperature – and that potentially affects every output at once.

This is exactly where the trap sits. You tune the prompt to make the model answer more politely, you're pleased with the three cases you spot-checked – and you don't notice that the same change quietly dropped your hit rate on data extraction by 30%. Without an eval you don't see it. You see it when your customers do.

This risk becomes concrete the moment you switch models – which happens constantly, because providers deprecate versions and newer models are cheaper or faster. Moving from one Claude or GPT model to the next is not a drop-in replacement. The new model may be better on average and still worse on exactly the slice that matters for your product. Without an eval, that switch is a blindfolded leap. With an eval it's a number: 87% before, 91% after, format errors down from 4% to 1% – green light.

What to Actually Measure

"Good or bad" is a start, but too blunt a measure. A useful eval breaks quality into dimensions you can track separately. Which ones depend on the feature – but this catalogue covers most cases:

Task success. The core question: did the system solve the task? Hit the right category, return the correct figure, give the usable answer.
Faithfulness / hallucination rate. For anything grounded in sources – RAG, summaries, Q&A over documents – the single most important number: how often does the model assert something that isn't in the source? A fluent, confident fabrication is more dangerous than an honest "I don't know."
Format adherence. When downstream systems consume the output, the format has to be right. Valid JSON, the correct schema, no prose preamble before the code. One extra field can break an entire pipeline.
Latency and cost. Even when quality holds: a feature that takes 30 seconds per answer or blows the token budget is unusable in production. Both belong in the eval, not in the surprise on the monthly invoice.
Refusal / over-caution rate. The quiet counterpart to hallucination: models that too often block with "I can't answer that" when they could and should. Excessive caution is just as much a product defect as excessive invention.
Failure-mode buckets. The most valuable thing about a mature eval isn't the overall grade, it's the sorting of failures by type. Does the model struggle with long inputs? With a particular language? With tables? Only the clusters tell you what to work on next.

A single percentage is reassuring and misleading. It's the breakdown that turns an eval from a grade into a diagnostic tool.

Where the Effort Doesn't Pay Off

Now the honest part – otherwise you draw the wrong conclusions and build eval infrastructure for things that don't need it.

Evals aren't free. A good dataset wants curating, an LLM judge wants calibrating, the suite wants maintaining when requirements shift. That effort is only justified when the feature is worth it. For a throwaway prototype, an internal script one person runs twice a month, or a one-off data migration, "try it and eyeball it" is perfectly legitimate. Vibes are a sensible strategy when the stakes are low. Nobody writes unit tests for a one-line bash command, and nobody should build a golden dataset for a weekend experiment.

There's a second, subtler danger: the Goodhart effect. "When a measure becomes a target, it ceases to be a good measure." Tune your prompt against the same 50-example dataset long enough to push the number to 99%, and you may have learned only how to solve those 50 examples – and nothing beyond them. A stale eval set you've trained against for months becomes its own illusion of quality. That's why the dataset has to stay alive: regularly extended, fed with real new failure cases, occasionally checked against a fresh, unseen set.

The line doesn't run between "evals always" and "evals never" – it runs along the stakes. Proportional rigour, not ceremony. A customer-critical AI feature with revenue or trust riding on it deserves a serious eval suite. An internal helper deserves a sceptical glance and that's it. The skill is in honestly placing which category you're working in.

How to Start

The most common reason teams have no evals isn't resistance – it's the belief that this is a huge project. It isn't. Getting started is small and concrete:

Begin with 20 to 50 real examples. Not invented, not idealised. Reach into real or realistic inputs and write down, for each, what a good answer looks like. That act alone – articulating what you expect – sharpens your understanding of the feature more than any demo.
Score them. For structured tasks, a hard assertion. For open-ended tasks, a human rating in a spreadsheet to begin with, then an LLM judge with clear criteria. It doesn't need to be perfect, only consistent.
Wire it into CI. The decisive step: the eval runs automatically on every prompt change, every model swap, every pull request. Just like the unit tests. A change that drops the hit rate becomes visible before it ships – not after.
Grow it from production failures. Every real failure in production is a gift: it becomes the next eval case. The dataset grows organically in exactly the direction where the system is weak, and every bug you fix once stays fixed forever.

After a few weeks you'll have, almost as a side effect, a regression suite that has your back. And the feeling at the next model switch is entirely different – not "let's hope nothing breaks" but "the numbers say it got better."

Conclusion

Serious AI work is engineering, not prompt roulette. The difference between a team that runs an AI feature reliably in production and one that lurches from crisis meeting to crisis meeting is rarely the model. It's whether anyone is measuring.

A demo produces a feeling. An eval produces a number – and only numbers can be compared, tracked, and improved. Without them every team is flying blind: it can't catch regressions, can't swap models safely, can't improve prompts systematically, and in the end can't even say whether the product is better today than it was last week.

Evaluation-driven development is to AI features what tests are to software: not a luxury, but the foundation you need before you can build anything serious at all. You don't have to start big. But you do have to start measuring. Everything else is hope – and hope is no strategy for a product people are meant to trust.