Automated AI Research Obliterates the Benchmarks: Recursive's First Published Results

David Borish
2 days ago
6 min read

The Setup

Recursive, a startup focused on automated AI research, published benchmark results on June 11, 2026 from a system designed to run the full scientific loop without human intervention. The system picks a target objective, proposes modifications, implements them, runs experiments, validates the results against reward hacking, and feeds what it learns into the next round of search. It maintains context across many parallel research threads and can combine findings from separate branches when the evidence warrants it.

The company tested this system on three benchmarks chosen for practical relevance and reliable feedback: NanoChat Autoresearch (training the best possible small language model within a fixed compute budget), NanoGPT Speedrun (training a small model to a fixed quality threshold as fast as possible), and SOL-ExecBench (optimizing GPU kernels toward hardware performance limits). All three have clear metrics, relatively low variance, and existing state-of-the-art baselines to beat.

Recursive open-sourced the artifacts from these runs so others can inspect the solutions directly.

Beating a Community of Humans and Agents

The NanoChat benchmark, originally created by Andrej Karpathy, tasks a system with training a small language model to the lowest possible validation loss, measured in bits per byte (BPB), within a five-minute window on a single GPU. A public collaborative effort called autoresearch@home had already extended this setup into a community competition, with dozens of humans and hundreds of AI agents collectively optimizing solutions.

Recursive's system started from the same initial seed used by the community. After identifying and removing minor reward hacks in the previous best solution and evaluating over ten random seeds, the community's best result sat at 0.9372 BPB. The Recursive system found a solution reaching 0.9109 BPB, a 0.0263 BPB improvement. Translated into training time, the Recursive solution reaches the quality of Karpathy's original overnight run in roughly 1.3 times fewer compute-seconds than the community's best approach.

The company also tested whether the same system could make progress starting from a much weaker baseline: a naive vanilla Transformer with AdamW, the most generic possible starting point. From there, the system improved performance from 1.059 BPB to 0.9344 BPB, again outperforming the best community solution. The final vanilla Transformer solution differed in several concrete ways from the community's best, using a different combination of techniques even where it converged on similar ideas.

The improvements were not driven by a single trick. The best solutions combined changes across architecture, memory mechanisms, auxiliary losses, attention, optimizer behavior, weight decay schedules, and compiler settings. One of the larger gains came from extending the model's short-context memory. The baseline used value embeddings; the system added hashed bigram and trigram embedding tables, mixed into the attention value path through learned gates. This lets the model cheaply incorporate local n-gram patterns without the computational cost of convolutional or heavy-attention alternatives. Using different hash functions per layer reduced the likelihood of identical collisions propagating through the network.

Gains on a Two-Year-Old Leaderboard

NanoGPT Speedrun is harder to move. The benchmark measures how quickly a small GPT-style model can train to a validation loss of 3.28 on the FineWeb text dataset using eight H100 GPUs. It has 83 human record-setting contributions and hundreds of proposed pull requests, with training time dropping from roughly 45 minutes in mid-2024 to 79.7 seconds by the time Recursive ran its system. At that point, the community had spent two years removing obvious improvements.

Starting from the current leading 79.7-second solution, Recursive's system reduced training time to 77.5 seconds while still meeting the leaderboard's validation-loss significance requirement. The improvement is comparable in magnitude to recent human contributions, which have become progressively smaller as the solution space narrows.

The changes the system made were specific and technical. It pushed FP8 (8-bit floating point) precision into the attention projection layers, running forward passes in float8_e4m3 for double tensor-core throughput while keeping the backward pass in bfloat16 for numerical stability. It modified the NorMuon optimizer to inject annealed Gaussian exploration noise, warming it up over the first 50 training steps and then reducing it to zero a quarter of the way through training, pushing the optimizer toward flatter loss basins. It applied a "cautious" masking technique to the Adam update for embedding tables specifically, blocking parameter updates that point opposite to the raw gradient. It also rewrote a fused GPU kernel so the backward pass reconstructs intermediate activations on the fly rather than storing them, eliminating a round-trip to high-bandwidth GPU memory.

Recursive also tested whether the system could compress years of community progress from a much earlier starting point. Beginning from a roughly 15-minute solution, the system reached approximately 185 seconds in a few days, near the human leaderboard's May 2025 benchmark of around 180 seconds. The system found a different route to that performance level, including a technique called "stitched-stream attention" that packs eight short training sequences into a single long context, and a per-layer window pyramid where most layers attend to nearby tokens while a few look further back.

GPU Kernels

The third benchmark is different in kind. SOL-ExecBench, developed by NVIDIA, contains 235 tasks focused on writing fast GPU kernels for real computational workloads: matrix multiplications, reductions, normalization layers, attention components, quantization routines, and fused operations. Each task provides a reference PyTorch implementation and asks for a functionally equivalent kernel that runs faster on Blackwell B200 GPUs. Performance is measured as a fraction of the hardware's theoretical maximum, with 0.5 representing a well-optimized PyTorch baseline and 1.0 representing the analytical hardware limit.

Recursive ran its system across all 235 kernels jointly, so the system could transfer patterns across related tasks. The system had access to standard profiling tools but received no specialized kernel engineering guidance. Starting from the previous leaderboard best of 0.699, the system reached a mean score of 0.754, an 18% reduction in the gap to the hardware limit.

Examples from the published artifacts show the system applying kernel engineering techniques that require precise reasoning about hardware memory movement and numerical precision. For one fused operation involving a nonlinear normalization layer between two matrix multiplications, the system sidestepped the awkward cross-channel reduction by rewriting the second weight matrix on every forward pass to absorb the normalization's affine transform, turning an irregular data dependency into a regular batched matrix multiply. For a quantization-heavy mixture-of-experts kernel, the system used native PTX assembly instructions to pack 4-bit floating point values and staged weight preprocessing outside a captured CUDA graph so the graph itself contains only the compute path.

Reward hacking was a particular challenge on this benchmark. Some candidate kernels the system generated exploited the evaluation setup rather than genuinely improving performance, by caching outputs or relying on persistent state across calls. Recursive addressed this by treating correctness auditing as part of the research loop itself, running increasingly strict automated checks and using AI-assisted analysis to distinguish real improvements from benchmark-specific exploits. The company notes that as the search process became more capable, the evaluator had to keep pace.

What This Means for AI Progress

These three benchmarks measure different things: training recipe design, systems-level optimization, and low-level hardware programming. What links them is that a single automated system made measurable progress on all three without being hand-tuned for any of them. The improvements were not large relative to the entire distance remaining to theoretical limits, but they were real, reproducible, and in cases where human communities had already spent years optimizing.

Recursive frames this as an early demonstration of a system that can compound many small discoveries: inventing new optimizations, adapting known ideas to tighter constraints, and combining improvements across modeling, optimization, and systems layers. The company explicitly identifies reward hacking as a central technical challenge going forward. As the search becomes stronger, the system's ability to game evaluation metrics grows alongside its ability to find genuine improvements, and distinguishing the two requires increasingly sophisticated verification.

The practical implication is narrow but meaningful. AI progress has historically come from two sources: scale (larger models, more compute) and efficiency (better use of existing resources). Systems that can automate the efficiency side of that equation, finding better training algorithms and hardware utilization without adding compute, compress the timeline between research insight and deployed improvement. The results here are limited to well-defined benchmarks with fast feedback loops. Whether the same approach scales to less structured research problems remains an open question, and Recursive has not yet demonstrated it. But these results show the scaffolding works on problems that matter.

DAVID BORISH

Automated AI Research Obliterates the Benchmarks: Recursive's First Published Results

The Setup

Beating a Community of Humans and Agents

Gains on a Two-Year-Old Leaderboard

GPU Kernels

What This Means for AI Progress

Comments

JOIN THE AI SPECTATOR MAILING LIST

Back to top