OpenAI Releases GPT-5.5 With State-of-the-Art Scores on Coding, Science, and Computer Use

David Borish
Apr 24
5 min read

OpenAI released GPT-5.5 yesterday with benchmark numbers that tell a fairly clear story: the model is more capable than its predecessor across coding, knowledge work, and scientific research tasks, and it accomplishes that without the latency penalty that usually accompanies a larger, more capable model.

On Terminal-Bench 2.0, which measures complex command-line workflows requiring planning and tool coordination, GPT-5.5 reaches 82.7%, up from GPT-5.4's 75.1%. On OSWorld-Verified, which tests whether a model can operate real computer interfaces without human guidance, it hits 78.7%, compared to 75.0% for GPT-5.4. The model achieves these results while using fewer output tokens on both benchmarks, a detail that matters for the economics of deploying AI at scale.

The Coding Case

The strongest benchmark improvements show up in agentic coding. On Expert-SWE, an internal evaluation covering long-horizon coding tasks that OpenAI estimates take human engineers a median of 20 hours to complete, GPT-5.5 scores 73.1% against GPT-5.4's 68.5%. The gains hold on Terminal-Bench 2.0 and SWE-Bench Pro as well, and across all three, GPT-5.5 uses fewer tokens to get there.

The qualitative reports from early testers track with the numbers. Dan Shipper, CEO of Every, described a test where he rewound to a broken state of an application his team had spent days debugging. GPT-5.4 could not produce the same rewrite that a senior engineer eventually arrived at. GPT-5.5 could. Pietro Schirano at MagicPath reported that the model merged a branch containing hundreds of frontend and refactor changes into a substantially modified main branch in roughly 20 minutes, in a single pass. Senior engineers who tested the model said it predicted testing and review needs without being asked, and in at least one case, an engineer returned to find a 12-diff stack nearly complete.

One engineer at NVIDIA made the kind of remark that tends to get quoted in product announcements precisely because it captures something operationally real: losing access to the model felt like losing a limb.

The token efficiency point is not incidental. On Artificial Analysis's Coding Index, OpenAI says GPT-5.5 delivers state-of-the-art performance at half the cost of competitive frontier coding models. For organizations running large volumes of agentic coding tasks, that arithmetic compounds quickly.

Knowledge Work and Computer Use

The same reasoning improvements that make GPT-5.5 a stronger coding model also show up in broader knowledge work. On GDPval, which evaluates agents on tasks drawn from 44 occupations, the model scores 84.9%, against 83.0% for GPT-5.4 and 67.3% for Gemini 3.1 Pro. On Tau2-bench Telecom, which tests complex customer service workflows, it reaches 98.0% without prompt tuning, compared to 92.8% for GPT-5.4.

OpenAI's own internal use cases give a concrete sense of what that looks like in practice. The company's finance team used GPT-5.5 in Codex to review 24,771 K-1 tax forms totaling more than 71,000 pages, with a workflow that excluded personal information and accelerated the task by two weeks relative to the prior year. The communications team built an automated Slack agent for routing speaking requests: low-risk requests handled automatically, higher-risk ones routed to human review. A go-to-market employee automated weekly business reports and recovered five to ten hours per week.

These are not transformative use cases, but they are illustrative ones. The pattern is consistent: a task that previously required sustained human attention, or a team of people, gets compressed into a workflow that runs largely without intervention.

On computer use specifically, GPT-5.5 scores 78.7% on OSWorld-Verified, matching Claude Opus 4.7's 78.0%. The model can navigate interfaces, click, type, and move across software tools. Combined with the improvements in task persistence, which early testers noted as a distinguishing quality over GPT-5.4, that makes it a more credible candidate for the kind of autonomous computer operation that has been discussed more than demonstrated in recent years.

Scientific Research Applications

GPT-5.5 shows improvement on scientific benchmarks that have not historically been a strong suit for large language models. On GeneBench, which covers multi-stage data analysis in genetics and quantitative biology, GPT-5.5 scores 25.0% against GPT-5.4's 19.0%. These problems require models to reason through ambiguous or incomplete data, address hidden confounders, and correctly apply modern statistical methods, tasks that OpenAI says often correspond to multi-day projects for domain experts. On BixBench, a real-world bioinformatics benchmark, GPT-5.5 reaches 80.5%, up from 74.0%.

Two researcher accounts published alongside the model release illustrate the range. Derya Unutmaz, an immunology professor at the Jackson Laboratory for Genomic Medicine, used GPT-5.5 Pro to analyze a dataset of 62 samples and nearly 28,000 genes, producing a research report that surfaced questions and insights he estimated would have taken his team months. Bartosz Naskręcki, a mathematics professor in Poland, used GPT-5.5 in Codex to build an algebraic geometry application from a single prompt in 11 minutes, visualizing the intersection of quadratic surfaces and converting the result into a Weierstrass model.

The most technically notable result is a proof about Ramsey numbers. An internal version of GPT-5.5, running with a custom harness, found a proof of a longstanding asymptotic fact about off-diagonal Ramsey numbers. The proof was later verified in Lean. Ramsey theory is a legitimate research frontier; results there are infrequent and technically demanding. The model's contribution here was not a code artifact or a summary but an original mathematical argument.

Efficiency at Scale

Serving a more capable model at the same per-token latency required rethinking the inference stack rather than making incremental adjustments. GPT-5.5 was co-designed for and trained on NVIDIA GB200 and GB300 NVL72 systems. One specific optimization stands out. Before GPT-5.5, incoming requests were split into a fixed number of chunks to balance work across GPU cores. A predetermined chunk count works reasonably well in aggregate but is not optimal for the actual distribution of request sizes in production. Codex analyzed weeks of production traffic and generated custom heuristic algorithms to partition and balance that work more accurately. The result was a 20% increase in token generation speed.

The model also, in OpenAI's telling, helped improve the infrastructure serving it. GPT-5.5 was used to identify and implement improvements in the inference stack itself.

Safety and Cybersecurity

OpenAI is classifying GPT-5.5's cybersecurity and biological capabilities as High under its Preparedness Framework, one step below Critical. On CyberGym, it scores 81.8%, compared to 79.0% for GPT-5.4 and 73.1% for Claude Opus 4.7. On an internal set of capture-the-flag challenges, it reaches 88.1%.

The release includes tighter classifiers for potentially harmful cyber requests, which OpenAI acknowledges may generate friction for some users initially. To address that, the company is expanding access to cyber-permissive model variants through a Trusted Access for Cyber program. Verified users and organizations responsible for defending critical infrastructure can apply for access to models with fewer restrictions for legitimate security work. API availability for GPT-5.5 is coming soon, priced at $5 per million input tokens and $30 per million output tokens. The Pro variant, aimed at higher-accuracy work, will be priced at $30 per million input tokens and $180 per million output tokens.

What This Adds Up To

Taken together, the benchmark results and early-access accounts describe a model that is more capable on a wider range of tasks than its predecessor, runs at comparable speed, and costs less per completed task due to token efficiency. The gains are not uniform. Claude Opus 4.7 leads on MCP Atlas (79.1% to 75.3%) and on some long-context retrieval tasks. The competitive picture in frontier AI remains genuinely contested.

What GPT-5.5 represents more clearly is a shift in how agentic AI systems handle the friction points in real work: ambiguous failures, multi-day tasks, large codebases, and tasks that require persistent effort across many steps. Whether that shift proves durable in production, at scale, across more organizations, is the question the next few months will answer.

DAVID BORISH