My eval framework accused my agent of hallucinating. The bug was in the judge.
I built agent-eval-harness to answer a question most agent projects never ask out loud: does the multi-agent architecture actually beat a simpler baseline, measured honestly? It benchmarks my wayfinder Supervisor against a single-agent ReAct loop — same model, same five MCP tools, 40 tasks across 10 Python OSS repos — so the comparison isolates orchestration, not tooling.
One of its four metrics is citation grounding: the share of code symbols an agent cites that actually exist in the repository. It's the anti-hallucination metric. An agent that says "look at parse_args in cli.py" when no such function exists should pay for it.
Midway through analyzing the full 40-task run, wayfinder's citation score came back at 0.37. That number didn't smell right — this is a system whose entire design refuses to name symbols it can't ground in AST evidence. Either my agent was quietly worse than its architecture promised, or my measurement was wrong.
The bug
The measurement was wrong. The RepoSymbolResolver that powers the metric only credited top-level def and class names as "real" symbols. So when an agent cited perfectly legitimate attribute and method references — self.callback, ctx.params — the resolver couldn't find them among top-level names, and scored them as hallucinations.
Read that again from the judge's perspective: the agent was being penalized for precision. Citing the actual attribute a maintainer would grep for got scored as fabrication, while a vaguer citation of the enclosing function would have passed. The metric was optimizing against exactly the behavior I wanted.
The fix credits a dotted reference when the attribute genuinely occurs in the repo source — while preserving the anti-hallucination guarantee, because an invented attribute still appears nowhere and still fails.
Why the fix cost zero API dollars
Here's where an early design decision paid for itself. The harness has a strict run/score split: agent runs — the expensive part; a ReAct loop burns ~10× the tokens — are persisted to <arch>.runs.jsonl, and metrics are pure functions over those persisted traces. Fixing the resolver and re-scoring was a local, offline operation:
agent-eval rescore --runs-dir runs/full_v1 --dataset datasets/full_v1.jsonl
wayfinder's citation grounding went from 0.37 to ~0.80 without re-running a single agent. No re-spend, no fresh nondeterminism muddying the comparison — the same frozen traces, measured correctly this time.
What I took away
- Evals are code, and code has bugs. If a score surprises you, suspect the judge with the same energy you'd suspect the agent. A benchmark number you never audited is a rumor with decimals.
- Separate the expensive thing from the cheap thing. Running agents is costly and nondeterministic; scoring is neither. Persisting raw traces makes every future metric fix retroactive and free.
- Honest reporting compounds. The final report still shows the ReAct baseline beating wayfinder on raw answer quality (0.70 vs 0.48 factual) while failing 6/40 tasks at ~12× the token cost. Keeping the unflattering number in the headline is what makes the flattering ones believable.