Inside OpenAI's AI Chemist: How GPT-5.4 Improved a Drug-Making Reaction

David Borish
4 minutes ago
6 min read

When the AI Picked an Additive No One Expected

A drug molecule that cannot be synthesized is, for practical purposes, a molecule that does not exist. Medicinal chemists work inside that constraint every day. They can design promising compounds on paper, but if the chemistry to build them gives low yields or messy byproducts, the molecules get abandoned and the program moves on. Synthesis is one of the quiet bottlenecks in drug discovery, and it rarely makes headlines.

That bottleneck is the setting for a result OpenAI and Molecule.one published on June 17. Working together, they connected GPT-5.4 to Maria, an agentic chemistry system tied to a high-throughput laboratory, and handed it an open-ended assignment: improve one of several important reaction classes. The system generated research proposals, designed and ran experiments, analyzed the data, and proposed follow-up work. Its most promising idea focused on a reaction called Chan-Lam coupling, and the specific suggestion it made was one the human chemists found genuinely surprising.

The reaction that resists

Chan-Lam coupling forms carbon-nitrogen bonds, which appear throughout the structures of medicines. The reaction is widely used, but it does not perform evenly across every type of molecule. Coupling primary sulfonamides with boronic acids has historically given low yields, and that is a meaningful gap, because sulfonamides show up in anticancer drugs, antimicrobials, and diuretics. A more reliable version of this reaction would give chemists a broader and more practical way to build and explore those compounds.

Starting from the general goal of improving Chan-Lam coupling for process chemistry, GPT-5.4 narrowed the problem itself. It identified primary sulfonamides as a difficult, high-value substrate class, then suggested that mild oxidants, TEMPO among them, could improve the reaction. The proposal, labeled OAI-M1-03, was one of four the human team selected to test in the lab out of thousands the system ranked. Chemists reviewing it found the TEMPO suggestion both unexpected and interesting, which is part of why it was worth running.

How the system was put together

The workflow paired a frontier model with specialized agents and physical infrastructure, and each piece did a distinct job. Scientists working with Maria AI wrote prompts that GPT-5.4 used, inside a harness, to generate and rank thousands of research proposals. Human chemists reviewed the small set that ranked highest and chose four for testing. Maria AI then translated those high-level plans into detailed lab instructions, ran the experiments at high throughput, analyzed the raw data, and returned structured results to GPT-5.4 so it could plan the next round.

OpenAI describes the system as near-autonomous rather than fully autonomous, and the distinction is honest about where people stayed involved. Humans designed the steering and grading prompts, selected which proposals entered the lab, made limited corrections to experimental plans, helped prepare reagents and consumables, and repeated key experiments by hand. The largest human correction was a decision to avoid DMSO as a solvent, because the chemists were concerned it could react with the stronger oxidants being used for comparison. The model proposed the central ideas; the chemists supplied judgment and ran the parts of the work that still require hands.

The full process ran three months, from the first prompt on March 4 to sharing the OAI-M1-03 results with independent experts on June 4.

What the experiments showed

Maria ran 10,080 reactions across the two cycles of OAI-M1-03, more than a chemist running three reactions a day would complete in a decade. That scale was not for show. Chemistry results can mislead when they rest on a handful of examples, since a reaction can look strong on one pair of starting materials and then fall apart across a wider set. Running thousands of reactions let the team identify TEMPO out of ten oxidants tested, watch the effect repeat across diverse combinations, and map where it stopped working.

The numbers moved in the right direction. Under the optimized conditions, measured yields improved for 88 percent of the boronic acids and 83 percent of the sulfonamides tested. Mean yield rose from 16.6 percent to 25.2 percent, and the share of reactions clearing 30 percent yield went from 15.6 percent to 37.5 percent. In the second round, the system proposed a more focused set of experiments and found that TEMPO could be swapped for a much cheaper analog, 4-hydroxy-TEMPO, with little loss in performance, a detail that matters for anyone who would actually use the method.

The result also survived the move out of the screening format. Human chemists reproduced representative reactions by hand at bench scale and saw higher yields for 11 of 14 substrate pairs, with the increase exceeding twofold for eight of them. That step carries weight because very small-scale experiments can introduce artifacts that vanish at larger scale, and because bench validation is customary before work is published. Four external chemistry experts reviewed the preprint and supported the view that the finding was novel and worth sharing.

Tim Cernak, an associate professor of medicinal chemistry at the University of Michigan, described the combination of high-throughput experimentation and modern AI as a new frontier for discovery, and called the result a demonstration of mild conditions and a practical oxidant producing a broadly useful substrate scope for one of the more popular reactions in drug synthesis. The stronger test, as OpenAI notes, still lies ahead: whether independent labs reproduce the result and whether chemists find it useful across a wider range of molecules.

What the result does not establish

The paper and the blog post are careful about scope, and that care is worth repeating. The work shows that a model can make a useful contribution to organic chemistry, going beyond summarizing literature or suggesting a single experiment. It proposed a specific surprising hypothesis, surfaced it for human review, designed experiments, interpreted data, and designed follow-ups. It does not show that AI can run a chemistry research program end to end. Human judgment stayed essential, and the workflow depended on specialized high-throughput infrastructure that few labs have.

The result also does not establish that the method generalizes to other coupling reactions, other substrate classes, or manufacturing conditions. The yield estimates came from a high-throughput platform, and bench validation covered 14 substrate pairs. More work is needed to characterize the reaction mechanism, define the substrate scope, measure performance under different conditions, and reproduce the finding independently. Of the three other proposals the system generated and tested, two were experimentally confirmed and one was disproven, with analysis still ongoing.

There is also the safety dimension, which OpenAI addresses directly. The same tools that support medicine and materials science can be misused, so the project was scoped to a legitimate medicinal-chemistry problem and involved no toxins, chemical weapons, or requests to design harmful compounds. The model had undergone relevant evaluations with the UK AI Security Institute and was built to refuse harmful requests, and the experimental workflow added control by keeping humans in charge of which proposals entered the lab and of the physical infrastructure itself.

Why this fits a longer pattern

What makes this result interesting is less the chemistry than the shape of the collaboration. A frontier model reviewed the literature, proposed an idea its human collaborators would not have predicted, helped design and analyze the experiments, and arrived at a finding that chemists could test against real molecules. The capability appeared first inside a controlled, instrumented environment, a high-throughput lab running microliter reactions, before anyone tried to move it toward practical bench workflows. The path from that controlled setting outward is exactly where the value, and the uncertainty, lives.

This connects to a thesis I explore in my forthcoming book, The Tony Hawk Paradox: capabilities tend to emerge and mature inside simulated or controlled environments well before they reshape the broader systems around them. An AI proposing a viable reaction additive is impressive on its own. The more important question is what happens as the controlled conditions that produced it, the automated lab, the human steering, the curated substrate set, give way to the messier conditions of independent replication and real manufacturing. That transition is where we will learn how much this actually changes about how drugs get made.

For now, the honest description is the one OpenAI offers: an early result, validated more carefully than most, that gives the scientific community something concrete to reproduce and build on. The next chapter belongs to the independent labs.

DAVID BORISH