Coding Agent Raises Its Own SWE-bench Score from 20% to 50% by Rewriting Its Own Code
- David Borish

- 1 day ago
- 5 min read

A coding agent developed by researchers at the University of British Columbia, Vector Institute, and Sakana AI has demonstrated something that has been theorized for decades but rarely demonstrated empirically: a system that modifies its own source code, evaluates the results on real benchmarks, and uses successful modifications to become more effective at making future modifications.
The Darwin Gödel Machine (DGM), described in a paper initially published on arXiv in May 2025, and updated March, 2026 improved its performance on SWE-bench — a widely used benchmark requiring agents to resolve real-world GitHub issues — from 20% to 50% through autonomous self-modification over 80 iterations. On Polyglot, a multi-language coding benchmark, performance climbed from 14.2% to 30.7%. Neither gain came from researchers adjusting the agent's design. The agent did it by rewriting its own Python codebase.
What the System Actually Does
The DGM starts from a single base agent equipped with two tools: a bash executor and a file editor. At each iteration, an agent from the archive is selected and prompted to examine its own code, identify a promising improvement, implement it, and then run a subset of benchmark tasks to see whether the modification helped. Modified agents that meet a performance threshold move into a growing archive. The cycle then repeats, with future iterations drawing from any agent in the archive — not just the most recent one.
The self-modification phase uses Claude 3.5 Sonnet. The evaluation phase uses Claude 3.5 Sonnet for SWE-bench tasks and o3-mini for Polyglot. Experiments ran for 80 iterations, with two or four iterations running in parallel depending on the benchmark. A single full SWE-bench run took approximately two weeks and cost an estimated $22,000 in API calls.
The improvements the system discovered on its own are concrete and varied. It added better code editing tools, developed long-context window management strategies, and implemented peer-review mechanisms in which the agent checks its own proposed changes before committing them. These were not features specified by the researchers. They emerged from the self-modification loop.
The Gödel Machine Problem, and How DGM Sidesteps It
The name references Jürgen Schmidhuber's theoretical Gödel Machine, proposed in 2007, which described a self-improving AI that modifies itself only when it can formally prove the modification is net beneficial. The practical problem with that design is that proving most changes are beneficial is computationally intractable in real-world settings. Adding a new tool to an agent, for instance, might seem obviously useful, but the actual effect depends heavily on the underlying model and the tasks it faces — a testing tool optimized for one configuration may actively confuse the agent in another.
DGM abandons the proof requirement and replaces it with empirical validation. Rather than proving that a change will help, the system tries the change and measures the result. This mirrors how biological evolution operates: modifications are not vetted in advance, they are trialed and selected based on observed outcomes.
The "Darwin" in the name refers specifically to the archive mechanism. Rather than a simple hill-climbing approach that always builds off the most recent and best-performing version, DGM maintains all generated variants, including less-performant ones, as potential stepping stones. This matters because the experiments showed that some earlier agents that were initially outperformed by successors later proved essential for discovering better downstream variants — paths that a strictly greedy search would have pruned.
What Ablations Show
The paper tests two stripped-down variants to isolate which components drive performance. A version without self-improvement — where the same base agent repeatedly generates new agents but cannot modify its own code — showed stagnation. A version without open-ended exploration — where the agent always builds from the most recent version rather than sampling from the full archive — also fell significantly short.
Both components were necessary. Self-improvement without diversity in the search process hits local optima. Diversity without self-improvement means the agents generating variants never get better at that task. The combination of recursive self-modification and archive-based exploration is what produced the gains documented in the paper.
Scope, Limitations, and What's Not Claimed
The paper is careful about what the DGM does and doesn't do. The system modifies the design of the coding agent — tool use, workflows, context management — but does not modify the underlying foundation model weights. The researchers explicitly note that rewriting training scripts or fine-tuning foundation models is technically within the framework's scope but was excluded from these experiments due to computational cost and added complexity. That work is left for future research.
The DGM's SWE-bench performance, while substantially improved from its starting point, still falls short of top closed-source solutions on the full benchmark. Those solutions are built by teams of expert engineers using carefully handcrafted techniques, and the authors note that current foundation models do not yet match that level of reasoning capability. Whether extended DGM runs would eventually close that gap is an open question.
The computational expense is a real constraint. A $22,000 API cost per full experiment limits who can run this work and how quickly the field can iterate on it. Reducing that cost is identified as necessary for broader adoption.
The paper also acknowledges a structural vulnerability: the entire process assumes that benchmark performance is a reliable proxy for actual coding capability. If a modification improves benchmark scores through narrow optimization rather than genuine skill, the self-improvement loop could drift in unproductive directions. The researchers describe this risk but note they did not observe clear evidence of benchmark gaming in the current experiments.
Safety Constraints
All experiments were conducted with sandboxed execution environments, preventing agents from making changes outside their isolated containers. Human oversight was maintained throughout. The paper includes a dedicated section on safety, written proactively — the authors say they wanted to raise awareness about self-improving AI systems before such systems become substantially more capable than current ones.
The safety discussion acknowledges that the current DGM operates within bounds partly set by the limitations of frontier models. As foundation models improve, a system with this architecture would become more capable of making meaningful modifications. The researchers advocate for ongoing work on alignment and interpretability in self-improving systems, noting that as agents self-modify, their internal logic can grow increasingly difficult to trace and understand.
Traceability of modifications is built into the current design. Every agent in the archive represents a documented step in the evolutionary tree, and the branching structure of that tree is fully logged. This is a design choice with safety implications: if something goes wrong, researchers can trace which modification introduced the problem and understand how it propagated.
Relevance to AI Development Timelines
The DGM touches directly on a question that has become increasingly concrete in AI research: at what point do AI systems become meaningfully capable of accelerating their own improvement? METR's time horizon benchmarks have tracked AI agents' ability to complete extended autonomous tasks, and the trajectory suggests agents are gaining ground on tasks that previously required sustained human effort. The DGM is a specific instance of that dynamic — an agent that now handles, autonomously, a task that has historically required significant human engineering expertise.
The researchers frame this as a step toward AI that can automate the process of advancing AI itself. That framing is not incidental. If agents can reliably improve their own architecture and tooling, the rate at which AI capabilities develop could accelerate independent of direct human research effort. The DGM doesn't demonstrate that at scale, but it demonstrates the mechanism.
The open-source release of the full codebase means the research community can build on and probe the design directly. That transparency, combined with the detailed safety documentation in the paper, suggests the team is attempting to model responsible development of a capability that carries genuine risk if mishandled.
Comments