Skip to content
Ryan de Melo
Go back

Eval-Driven Development: How I Actually Build LLM Features Now

I do not write the prompt first anymore. I write the eval.

Two years ago I argued that evals were the new unit tests and that most teams shipping LLM features were shipping without any. That was the diagnosis. This is what the treatment turned into once I actually lived with it day to day. The slogan grew up into a workflow, and the workflow is closer to test-driven development than anything I expected: red, green, refactor, except the red is a model getting an answer wrong on a case I care about.

Here is the loop. Curate a small set of real cases. Write the grader before the feature. Run the whole set on every change. When production fails, the failure becomes a new case and never leaves. That is the entire method, and the discipline is in refusing to skip the first step when you are excited and the demo already works.

Write the eval first, and watch what it does to you

The reason to write the eval first is the same reason you write the test first. It forces you to say what “correct” means before you have fallen in love with an output that happens to look correct. The moment you have a fluent answer on your screen, your standards quietly drop to “well, that reads fine.” Writing the grader first locks the bar in place while you are still honest.

So a feature starts as a file of cases and a function that judges them. Not a prompt. Not a chain. A judgement.

# eval_extract_invoice.py
# Each case is a real document we got wrong, or a real one we must not regret.
# id is stable so a case can be tracked across months. Do not renumber these.

CASES = [
    {
        "id": "inv-0007",
        "doc": "fixtures/two_currencies_one_total.txt",
        "expect": {"total": "1840.00", "currency": "EUR"},
        # caught in prod: model summed a USD line into a EUR invoice
        "tags": ["currency", "regression"],
    },
    {
        "id": "inv-0012",
        "doc": "fixtures/handwritten_total_struck_through.txt",
        "expect": {"total": "920.50", "currency": "GBP"},
        "tags": ["ocr-noise"],
    },
]

def grade(case, output):
    want = case["expect"]
    # exact match on money. "close enough" on a total is how you lose trust.
    if output.get("total") != want["total"]:
        return 0.0, f"total {output.get('total')} != {want['total']}"
    if output.get("currency") != want["currency"]:
        return 0.0, f"currency {output.get('currency')} != {want['currency']}"
    return 1.0, "ok"

The first time I run this, the feature does not exist and every case scores zero. That is the red. The job is now narrow and concrete: make these specific cases pass without breaking the ones that already passed. I write the prompt next, then the code that wires it up, then I run the set again and watch the score climb. The model stops being a magic box and becomes a thing I am fitting to a fixed target.

The cases are the asset, and small is the point

The temptation is to chase a thousand cases and a dashboard. Resist it. A few dozen cases that each represent a real failure mode will teach you more than a thousand synthetic ones that all test the happy path in slightly different clothes. I keep the set small enough that I can read every case in one sitting and remember why it is there.

Each case earns its place by being a thing that actually went wrong, or a thing that would be expensive to get wrong. The invoice that mixed two currencies. The support reply that leaked a customer’s full name into a summary. The classification that was confidently in the wrong category for a regulated workflow. (The synthetic cases I generated early on mostly tested whether the model could read English, which it can, so they told me nothing.)

Here is the part nobody tells you. The eval set is not a measurement tool you build once. It is a living record of every mistake the system has ever made in front of a real user, and its value is almost entirely in the cases you added the hard way. A green run on a set that has never met production is a green light on an empty road.

Run it on every change, automatically

The loop only works if the cost of running it is near zero, so I make it a script that runs on every change to the prompt, the parsing, the model version, the retrieval, anything. Same input, every commit. When a number drops, I know which change did it, because nothing else moved.

def run_suite(generate):
    # generate(doc) -> dict, the actual feature under test.
    # we deliberately do not catch exceptions here: a crash is a fail,
    # and a fail that scores 0 is more honest than one that skips quietly.
    results = []
    for case in CASES:
        doc = open(case["doc"]).read()
        out = generate(doc)
        score, why = grade(case, out)
        results.append((case["id"], score, why))

    passed = sum(s for _, s, _ in results)
    print(f"{passed:.0f}/{len(results)} passed")
    for cid, score, why in results:
        if score < 1.0:
            print(f"  FAIL {cid}: {why}")     # the failures are the report
    return passed / len(results)

That prints the failures, not the passes, because the passes are not news. I have watched a “let me just try the newer model” change quietly tank three currency cases while improving the overall number, and the only reason I caught it was that the per-case output put the regressions in my face. The aggregate score lies by averaging. The case list does not.

Where the model judge helps, and where it quietly lies

Not everything has a string you can match. “Is this summary faithful to the source.” “Is this tone appropriate for a customer.” For those I use a model as the grader, and I want to be honest about how often that goes wrong, because the LLM-as-judge pattern gets sold as a free lunch and it is not.

A model judge works when you give it a narrow, almost mechanical question and the evidence to answer it. It misleads when you ask it for taste. Ask “is this good” and you will get a confident, well-argued score that drifts with the phase of the moon. Ask “does every claim in the summary appear in the source, yes or no, and quote the one that does not,” and you get something you can almost trust.

def judge_faithful(source, summary):
    # narrow question, evidence required, no "rate it 1-10" mush.
    # the judge must point at the unsupported claim or shut up.
    prompt = f"""Compare the summary to the source.
Find any claim in the summary not supported by the source.
Reply JSON: {{"faithful": true|false, "unsupported": "<exact quote or empty>"}}

SOURCE:
{source}

SUMMARY:
{summary}"""
    verdict = call_judge(prompt)        # cheap model is fine for a yes/no
    # spot-check: every few weeks I hand-grade 20 judged cases.
    # the day the judge and I disagree more than I like, the judge is fired.
    return verdict["faithful"], verdict["unsupported"]

The trap I fell into early was trusting the judge the way I trust the exact-match grader. They are not the same animal. A string comparison is right by construction. A model judge is a second model with its own failure modes, and if you never check it against your own judgement, you have built a system that grades itself and reports back that it is doing wonderfully. I now keep a slice of judged cases that I hand-grade on a schedule, and the judge stays only as long as it agrees with me. The moment it drifts, it is overruled and rewritten.

Where eval-driven development is overkill

It is not always worth it. If the feature is a one-off internal helper that three people use and nobody is hurt when it is wrong, writing a grader first is ceremony. If you genuinely cannot define correct yet, because the product question is still open, an eval will just freeze a wrong target and give you false confidence that you are converging on it. Evals are for features you will maintain, that face users, where a regression is expensive and silent. The whole point is to catch the regression you cannot see by reading one good output. If there is no such risk, skip the ritual and ship.

But that describes fewer features than people think it does, and almost no feature that touches money or a customer’s trust. For those, I will not start the prompt until the grader exists. The 2024 version of me said evals were the new unit tests. The 2026 version stopped treating them as a thing you add after, and started treating them as the thing you write first. That is the only change, and it changed everything downstream of it.

What was the last LLM feature you shipped without a single case you could fail it on, and how would you know if it broke tomorrow?


Share this post:

Previous Post
The Cost of Agents: A FinOps Model for Token-Hungry Systems
Next Post
Agentic Transaction Systems: Moving Money With Machines You Can Audit