The Evaluator Problem: How Self-Improving AI Systems Game Their Own Judges

David Borish
8 hours ago
7 min read

When AI Judges AI, the Results Are Skewed

AI systems that evaluate other AI systems have a measurable preference for AI-generated content. A preprint from researchers at the University of Cambridge, NVIDIA, and Flower Labs quantifies this directly: in paper-reviewing experiments, baseline AI reviewers accepted AI-generated papers at 1.42x to 1.91x the rate they accepted human-written ones. The bias wasn't subtle, and it wasn't correctable by swapping in a better fixed reviewer.

This finding sits at the center of a broader problem in AI self-improvement research. Systems that improve themselves by editing their own code and retaining versions that score better on benchmarks have reached impressive performance on coding and reasoning tasks. But they depend on a fixed evaluation criterion held outside the improvement loop. The benchmark doesn't change as the agent improves, which leaves these systems exposed to reward hacking: agents that learn to satisfy the metric rather than genuinely improve at the task.

The Red Queen Gödel Machine (RQGM), named for biologist Leigh Van Valen's hypothesis that species must continually adapt to maintain fitness relative to competitors doing the same, treats evaluation as part of the improvement process itself. Instead of holding the benchmark fixed, agents and evaluators improve together across a structured sequence of epochs.

The Stationary Benchmark Problem

Current state-of-the-art self-improving agents, including the Darwin and Huxley Gödel Machines and the HyperAgents framework that preceded this work, rely on evaluation criteria that remain fixed for the duration of a run. This works reasonably well when the task has an objective verifier, such as a coding benchmark that checks whether a program passes its tests. It breaks down in three situations the researchers identify: when no direct benchmark exists for the task, as is the case for paper writing and mathematical proof writing; when evaluation is slow or only weakly informative; and when static benchmarks saturate or become susceptible to gaming as agents improve.

The saturation problem is particularly relevant now, as self-improving agents routinely approach the ceiling of their benchmarks. A frozen evaluator that can no longer discriminate between strong and weak agents stops providing useful signal. In the proof-writing experiments, fixed-evaluator baselines stagnated at longer training horizons, while co-evolved systems continued improving.

Co-Evolution and Controlled Utility Evolution

The RQGM's core mechanism is what the researchers call controlled utility evolution. Search runs in fixed-evaluator epochs: within each epoch, one evaluator is frozen and grades every task agent, providing a stable signal. At the boundary between epochs, the system checks whether any challenger evaluator, developed in parallel during the search, outperforms the incumbent on a held-out ground-truth dataset. If one does, it replaces the incumbent, and records scored by the displaced evaluator are erased so the new evaluator can re-rank agents on its own terms.

This selective erasure turns out to be essential. When the researchers ran a control condition that kept stale scores in the archive rather than erasing them, the ranking system stayed pinned to the old evaluator's ordering and the new, stricter criterion couldn't reshape the search. With erasure, each evaluator replacement permanently re-ranked the archive. Spearman correlation analyses tracking agent rankings after each replacement showed that the reordering was substantial and didn't recover toward the original order, meaning each later evaluator enforced a genuinely stricter standard.

The framework also takes advantage of a structural feature of how the agents are organized. In the coding experiments, 90% of accepted code changes modified functionality shared by both the coder and the reviewer, rather than role-specific code. A single patch improved both roles simultaneously, which is why co-evolution tends to enrich the search rather than split effort across competing objectives.

Results Across Three Domains

The researchers tested the RQGM across coding (the Polyglot benchmark), scientific paper writing and reviewing, and Olympiad-level mathematical proof writing and grading. Each domain pairs a generator role with a learned evaluator and a ground-truth anchor that remains fixed throughout the run.

On coding, where objective tests already exist, adding a co-evolved code reviewer still improved performance. The RQGM exceeded the prior state-of-the-art held-out pass rate of 69.9% with a rate of 71.7%, while using 1.35x to 1.72x fewer tokens. The efficiency gain comes from the reviewer querying agents once during code generation, while standard coding agents require multi-turn evaluation. The researchers note that the reviewer's signal complements test execution rather than replacing it.

Paper writing produced the most dramatic results. Co-evolved writers achieved average acceptance rates of 38.8% (generalist) to 40.5% (specialist) across a panel of four independent reviewers, compared to 21.8% for the prior state-of-the-art fixed-evaluator baseline. At matched compute cost, the RQGM writer already showed a 1.78x higher acceptance rate. The specialist, trained for longer, reached 1.86x.

Mathematical proof writing showed more modest gains at shorter training horizons, with divergence appearing over time. The RQGM specialist prover reached a mean score of 4.33 out of 7 on International Mathematical Olympiad problems, compared to 3.73 for the fixed-evaluator baseline and 4.07 for a human-engineered verification pipeline that achieved gold-medal performance at IMO 2025. The RQGM specialist outperformed that system on near-complete proofs (scoring 6 of 7) but conceded ground on full solutions. The researchers attribute the remaining gap to search budget constraints.

Correcting the Bias Problem

The self-preference bias finding is worth examining in detail because it illustrates both the problem and the intervention. AI language models used as judges tend to accept AI-generated text more readily than human-written text. In the paper-reviewing experiments, this bias operated in the same direction as raw benchmark accuracy: accepting more papers in general looked like better performance on the metric. A reviewer that was lenient toward AI-generated papers appeared to perform well while providing a weak and gameable signal to the writer it was supposed to evaluate.

The RQGM addressed this by introducing an adversarial objective at an epoch boundary. After the first evaluator replacement, papers that the displaced reviewer had accepted formed an adversarial pool. The next epoch rewarded evolved reviewers specifically for rejecting those papers while maintaining accuracy on held-out human papers. The resulting reviewer accepted AI-generated and human-written papers at similar rates while retaining 80% ground-truth accuracy, positioned between the over-lenient baseline and an over-harsh fixed reviewer from prior work.

The correction required the epoch structure. A fixed evaluator cannot accumulate evidence from a prior epoch and replay it as an adversarial objective in the next. The temporal design of controlled utility evolution made the intervention possible.

A Curriculum That Doesn't Scatter the Search

One of the more unexpected findings is how evaluator replacements shape the search over time. Each time a new, typically stricter evaluator replaces the incumbent, it re-ranks the agent archive. Agents that performed well under the old evaluator may lose ground; agents the old evaluator ranked poorly may move up. The researchers describe this as a curriculum-like effect: each generation of agents faces a more demanding judge, and the population adapts accordingly.

The curriculum doesn't scatter the search. The strongest lineage, the chain of improvements leading to the best-performing agent, remained resilient across evaluator replacements. The re-ranking pulled different candidates into contention without disrupting the backbone lineage that carried the strongest gains forward. The researchers describe this as the curriculum raising the population around a stable backbone rather than scattering it across competing directions.

What the System Actually Learned to Do

The appendix of the paper provides a close look at what the co-evolution discovered, and it's instructive. The evolved code reviewer progressively narrowed what counted as a failure, moving toward evidence-bounded review: a patch could only fail if a specific defect was demonstrable from the diff itself, not inferred from missing project context. The co-evolved proof grader first introduced a conservative rubric-aware grading system, then calibrated it more generously, learning not to penalize terse or compressed proofs that cited standard theorems without spelling out every step.

One of the most consequential changes in the proof-writing experiments wasn't a prompt rewrite. A single agent, labeled node 66, disabled an inherited self-revision feature on the grounds that the revision step could degrade a correct, detailed proof into a weaker answer relying on an unproved named theorem. That three-line code change propagated forward through subsequent generations. It was the system recognizing, through accumulated evidence, that revision was actively harmful to the proofs it was producing.

Limitations and What Remains Open

The researchers are transparent about what this work does and doesn't establish. This is a preliminary preprint based on short search horizons, with all main experiments using a single foundation model (GPT-5.5, low-tier compute setting). The theoretical guarantees hold within each epoch but don't cover long-term convergence or cumulative regret from erased evidence. No human grading of the generated papers or proofs was performed; the paper-reviewing results measure cross-reviewer acceptance behavior across an automated reviewer panel rather than objective scientific merit.

The anchor dependence is a structural limitation: if the ground-truth dataset used to select evaluators is weak or biased, the evaluators will drift accordingly. The researchers note this is most problematic in domains without fully verifiable ground truth, which describes most of the domains where co-evolution is most useful.

For enterprise AI practitioners, the most immediate practical implication is the self-preference bias finding. Any pipeline using AI systems to evaluate AI-generated outputs, whether that means reviewing code, scoring proposals, or assessing writing quality, faces the same structural problem the researchers quantified. The evaluator will tend to favor AI-generated content, and that bias compounds when the evaluator is also shaping what gets produced.

The broader question the RQGM raises is about where capability in AI actually comes from. The results here show that what gets produced at the end of a search process depends critically on what the search was optimizing against throughout. When the evaluation criterion evolves alongside the agent, inside a controlled loop with structured boundaries and selective memory, the agent that emerges is substantially different from one trained against a fixed target. That pattern, capability developing inside the controlled loop before shaping what enters the world, is one that enterprise teams building on self-improving systems will need to account for as these architectures move from research toward deployment.

DAVID BORISH