How AI Coding Agents Collapsed the Research Lab's Last Moat

David Borish
Apr 21
6 min read

TurboQuant

Google's TurboQuant paper, accepted at ICLR 2026 (arXiv 2504.19874), describes a two-stage approach to compressing the KV cache that large language models use during inference. The KV cache stores key and value vectors for every token in a conversation. At long context lengths it can consume tens of gigabytes of memory. A 70-billion-parameter model serving a single 128K-context request requires roughly 40GB for the KV cache alone.

TurboQuant addresses this through two steps. The first, PolarQuant, applies a random rotation matrix to each key and value vector before quantization, which spreads variance uniformly across coordinates and makes subsequent compression more accurate. The second, a 1-bit Quantized Johnson-Lindenstrauss transform applied to the residual, corrects for the inner product bias that MSE-optimal quantizers introduce. The combined effect, the paper argues, compresses KV cache to 3.5 bits per channel with no measurable accuracy degradation, validated on benchmarks including LongBench and the Needle-in-a-Haystack retrieval test.

Google published the paper and a blog post on March 24, 2026. It did not publish code.

Seven Days

Tom Turney, a former Google staff engineer and current founder and CTO of Psyguard.ai, built his implementation by reading the paper's math and using Claude Code to translate formulas into working software. His timeline: three days for core algorithms, 141 tests, and a Python prototype; two more days to port that into C and integrate it into llama.cpp with Metal GPU kernels; the final two days on speed optimization, moving from 739 tokens per second to 2,747.

The result, turboquant_plus, was not a straight reproduction. Turney added sparse V decoding, which skips 90% of value decompressions at long context by gating on attention weights. He added asymmetric K/V compression, keeping keys at higher precision while aggressively compressing values. He added temporal decay for older tokens. The combination enabled a 104-billion-parameter model to run at 128K context on a MacBook with a perplexity score of 4.024.

The asymmetric K/V finding deserves a closer look. The paper treated keys and values symmetrically. Turney's experiments found that key precision is the dominant quality factor because keys control attention routing through the softmax function. Compressing values aggressively while holding key precision costs almost nothing in output quality. This is a genuine contribution, not a reproduction.

What the Community Found

By the two-week mark after publication, five independent implementations existed across different languages and hardware targets. Several of these teams made a finding that complicates the paper's claims: QJL, the second stage of TurboQuant's two-stage design and the part the paper describes as its key theoretical innovation for unbiased inner product estimation, actually hurts performance in practice. Six or more independent teams confirmed this across Python, C, and Rust implementations.

The practical conclusion multiple teams reached was to drop QJL and rely on the PolarQuant stage alone, which delivers the bulk of the compression benefit without the overhead. One developer in the llama.cpp discussion thread independently discovered that replacing random rotation with a Walsh-Hadamard Transform improves results further, then puzzled over why this worked when the paper specified random rotation. The answer, debated in the thread, appears to be that WHT spreads energy more uniformly and deterministically than random projection, reducing variance before the QJL step. The community was, in real time, iterating past the paper's own design choices.

Community implementations also targeted hardware Google never considered. A full HIP/ROCm port appeared for AMD Radeon cards. MLX versions for Apple Silicon emerged. Work-in-progress integration surfaced in SGLang and vLLM. LM Studio users filed feature requests to support the new compression flags once the upstream llama.cpp PR merges.

The Academic Dispute

Running alongside the reproduction story is a separate controversy about the paper itself. Jianyang Gao, a postdoctoral researcher at ETH Zurich and first author of the prior RaBitQ method, published a detailed public statement laying out three specific accusations against the TurboQuant team.

The first is methodological similarity that was not disclosed. Both TurboQuant and RaBitQ apply a random rotation to input vectors before quantization. This is the core step in both methods. Gao argues that TurboQuant should have directly described this overlap. Instead, the final ICLR version moved its description of RaBitQ from the main text to the appendix, making the relationship harder to see. Gao notes that in January 2025, TurboQuant's second author, Majid Daliri, had contacted the RaBitQ team asking for help debugging a Python version he had built from their C++ source code. The TurboQuant team had detailed technical knowledge of RaBitQ before the paper was written.

The second accusation concerns a theoretical claim. The TurboQuant paper labels RaBitQ's mathematical analysis as suboptimal. Gao's team had provided detailed explanations by email in May 2025 demonstrating that RaBitQ achieves the same theoretical optimal bound. The TurboQuant authors confirmed they received this, and the paper retained the "suboptimal" characterization anyway.

The third is the most concrete. Gao alleges that the speed comparison benchmarks tested RaBitQ on a single-core CPU instance with multi-threading disabled while testing TurboQuant on GPU hardware. The paper does not disclose this difference in experimental conditions. The visible result, that TurboQuant is orders of magnitude faster, would then reflect the hardware disparity more than any real algorithmic difference.

Gao had flagged all three issues by email before the paper was submitted to ICLR. According to his account, the TurboQuant team acknowledged the problems and said it would address some of them after the conference concluded, but declined to address the core similarity claim. The Stanford NLP Group's official X account reposted Gao's statement. Gao's team has submitted a formal complaint to ICLR's ethics committee and plans to publish a technical report.

The Market Got It Wrong Anyway

The TurboQuant announcement triggered one of the sharpest selloffs in memory chip stocks since the 2025 tariff shock. Micron fell roughly 20% over five trading days from its March 18 earnings. SK Hynix dropped 6.2% in a single session. SanDisk shed 18% over five days despite having no connection to inference-time cache compression. The implied market logic was that if KV cache memory drops by 6x, demand for DRAM falls proportionally.

That logic misread what TurboQuant actually compresses. The KV cache is the working scratchpad a GPU uses while generating a response. It is not the memory used to store model weights, which is the larger and more hardware-intensive demand driver. Reducing KV cache pressure means a given GPU can serve more concurrent users or handle longer contexts; it does not shrink the GPU fleet required to run frontier models. Demand for memory infrastructure is driven primarily by the cost of loading and serving models, not by inference-time KV cache. The week the selloff happened, Anthropic was rationing Claude session limits because demand was exceeding its infrastructure capacity. Compression gains, when realized, tend to be absorbed by expanded usage rather than reduced hardware spend.

The reaction to TurboQuant's real claims was already somewhat divorced from the paper itself, a separation the academic controversy makes sharper. The market priced in the headline numbers without discounting for the disputed benchmarks, the withheld code, or the as-yet unverified performance on frontier-scale models.

What This Changes

The speed from paper to working implementation has implications beyond this particular case. Google's decision not to release code was presumably intended to maintain some advantage in deployment. That advantage lasted less than a week. The mechanism was not a team of engineers with access to the paper's authors; it was one person with Claude Code reading math formulas and running experiments.

This matters for how research labs think about publication. If withholding code no longer protects a competitive advantage, and the paper itself provides everything needed to reproduce and improve the work, the calculation changes. Some labs may respond by publishing less, which would be a loss for open science. Others may reckon that the reputational value of publishing outweighs whatever lead time the code embargo buys, particularly when community implementations tend to produce hardware targets and optimizations the original authors never prioritized.

The QJL finding illustrates that second point. The paper's theoretical innovation degraded real-world performance in six independent implementations. The community found this within two weeks by testing across actual hardware. That kind of distributed, adversarial validation is hard to replicate inside a single lab, and it happens only when the work is out in public.

DAVID BORISH