Writing Code vs. Shipping Code: What a Study of 100,000 Developers Reveals About AI's Real Productivity Gains

David Borish
Jun 2
6 min read

Every few months a new benchmark lands claiming that AI coding agents can outperform human developers on some suite of programming tasks. What almost none of those benchmarks measure is whether the code actually ships. A paper published this month by economists Mert Demirer (MIT), Leon Musolff (Wharton), and Liyuan Yang (MIT) asks exactly that question, and the answer has significant implications for how enterprises should think about AI-driven software investment.

The study draws on GitHub activity data for more than 100,000 developers combined with internal Microsoft telemetry on AI tool usage, spanning three generations of products: autocomplete tools like the original GitHub Copilot, synchronous agents like Claude Code that work interactively alongside the developer, and asynchronous agents like GitHub's Copilot Coding Agent and OpenAI's Codex that operate autonomously and submit pull requests for human review. By matching each adopter against a control developer with similar prior activity, the researchers constructed event studies that trace what happens to output at every stage of the software production chain after a developer starts using each tool class.

The Gains Are Real, and They Grow Across Generations

The task-level productivity results are large by any measure in this literature. Autocomplete tools increase commits by about 40% in the long run. Adding synchronous agent usage brings cumulative commit growth to roughly 140%. Layering in autonomous async agents pushes that figure to 180%, counting both human-authored and agent-authored commits. These effects persist across the full 30-week post-adoption window the study tracks, distinguishing them from the short-lived activity spikes the researchers observe when developers adopt non-AI tools like GitHub Pro or Docker.

The pattern also tracks model releases rather than developer learning curves. When the researchers restrict attention to early adopters of Claude Code and align their data to calendar time, the productivity gains step up noticeably with each successive Opus model release. That finding argues against the interpretation that developers simply get better at prompting over time; the models themselves are doing more useful work as they improve.

Effects are larger for less active developers throughout, with the least-active quartile seeing commit increases of 85% from autocomplete alone and over 200% from sync agents. Even among the most active developers, sync agents produce a 62% long-run gain. These numbers are consistent with prior randomized controlled experiments on autocomplete, which the authors use as an external validation of their event-study design.

Where the Gains Go

The more consequential finding is what happens to those task-level gains as they travel up the software production hierarchy toward actual releases. The hierarchy the authors measure runs from lines of code through distinct files, commits, pull requests, distinct repositories, and finally formal releases.

For autocomplete, a 228% gain in lines of code attenuates to 36% at commits and 10% at releases. For sync agents, a 741% gain in lines of code attenuates to 109% at commits and 20% at releases. In both cases, each successive layer in the production process absorbs a significant fraction of the upstream productivity shock, leaving a much smaller effect by the time software reaches a formal version number.

The attenuation pattern is not arbitrary. It reflects what the researchers call the weak-link hypothesis: when stages of production are complementary rather than substitutable, improving one stage has bounded effects on final output because the human-controlled downstream stages remain unchanged. The authors formalize this in a hierarchical production model with a constant elasticity of substitution (CES) structure. Calibrating the model against the autocomplete data yields an estimated elasticity of substitution of approximately 0.25 between AI-generated upstream output and downstream human effort.

That figure is well below 1, placing the technology firmly in what economists call the complements region, where even unlimited upstream automation produces finite final output gains. The intuition is direct: more code does not help if humans still have to review, integrate, and ship it at the same pace they always have.

The gap between tools is also telling. Autocomplete enters the production hierarchy only at the code-writing layer; every subsequent layer remains entirely human. Async agents enter at four layers, generating code, files, commits, and pull requests autonomously. Yet even async agents face two layers of human-controlled attenuation before reaching releases, which is why their large headline commit numbers compress so sharply by the time the software ships.

The App Store Test

To move beyond GitHub and ask whether these productivity gains are reaching end users, the researchers assembled monthly panels across four major software marketplaces: the Apple App Store, Google Play, the Chrome Web Store, and SourceForge.

The supply-side results are visible. New iOS applications ran between 30,000 and 50,000 per month through early 2025, then climbed to approximately 100,000 per month by April 2026, with the acceleration aligning closely with the rise of agentic coding tools. Chrome extensions roughly doubled over the same period. Google Play showed a reversal of a prior declining trend rather than a sharp acceleration, likely reflecting that platform's heavier moderation burden.

The consumption side tells a different story. Across all three actively growing marketplaces, total usage measured by ratings and downloads within the first three months of each cohort's launch has not increased. Not only that: the share of new applications that fail to accumulate even a minimal audience, fewer than 10 ratings on iOS, fewer than 100 downloads on Android, has risen noticeably. On iOS that share moved from roughly 79% to 86%. On Chrome it jumped from 18% to 31%.

This pattern rules out the main channels through which the additional releases could have raised consumer welfare. Aggregate usage staying flat rules out both breakout hits and a long-tail effect where many small-audience apps collectively create large total value, a mechanism documented in AI-assisted book publishing on Amazon but not visible in software. The rising share of zero-audience releases further rules out a matching channel, where better-targeted niche apps substitute for existing ones without raising total usage. The marginal applications being produced in the agentic coding era are, on the evidence, largely invisible to users.

The authors note two competing interpretations. One is that the marginal AI-assisted applications are simply lower quality, reflecting a lower-cost, lower-friction entry bar that draws in projects with weak market potential. The other is demand-side congestion: app store discovery algorithms and user attention are finite, and a larger supply of new applications does not automatically translate into more discovery. The data cannot cleanly separate these two explanations.

What This Means for Enterprise Buyers

The paper is careful about scope. Its data cover public GitHub repositories and consumer app marketplaces; enterprise and internal software, which constitutes a large fraction of the industry, is not directly observed. The study also covers only about a year of the agentic coding era, which may be too short to see the full adoption response at the consumer level, since app discovery and word-of-mouth operate on longer timescales than code production.

Within those limits, the findings carry a clear practical message. Enterprises evaluating AI coding tools should distinguish between two categories of productivity gain: what happens at the developer's keyboard and what happens to deployable software. The keyboard gains are substantial and growing across tool generations. The deployment gains are real but smaller, and they are constrained by the human stages that still govern code review, integration, quality assurance, and release management. Investments in AI coding tools without corresponding investment in those downstream human processes will yield returns significantly below what the task-level productivity numbers suggest.

The implication the authors draw at the production level is worth quoting precisely. With an elasticity of substitution of 0.25, even full automation of upstream coding layers yields mathematically bounded gains in final output because downstream human effort remains the binding constraint. The bottleneck, as the researchers put it, is shifting from writing code to reviewing, integrating, and distributing it. The next generation of AI tools that moves the needle on shipped software will be the one that can do meaningful work at those higher layers without sacrificing the human judgment that users ultimately depend on.

The study's data track Claude Code adoption events against Opus 4, Opus 4.1, and Opus 4.5 release dates and find visible step changes in productivity at each model transition, which is consistent with the general finding that the gains at the task level are real and improving. Whether model improvements will eventually compress the production hierarchy bottleneck, by generating code that requires substantially less human review, or by automating pull request integration with enough reliability to reduce that review burden, is the central empirical question this research leaves open.

DAVID BORISH

Writing Code vs. Shipping Code: What a Study of 100,000 Developers Reveals About AI's Real Productivity Gains

The Gains Are Real, and They Grow Across Generations

Where the Gains Go

The App Store Test

What This Means for Enterprise Buyers

Comments

JOIN THE AI SPECTATOR MAILING LIST

Back to top