Beyond Expert-Level: How Claude Approaches Biological Problems Humans Cannot Answer

David Borish
Apr 30
6 min read

Benchmark results in AI tend to follow a predictable pattern: a model reaches human-level performance on a well-known test, the announcement generates attention, and researchers move on to a harder one. BioMysteryBench, a new bioinformatics evaluation developed internally at Anthropic, complicates that pattern in an interesting way. The benchmark was designed specifically to include problems that human experts cannot solve, which means model performance can now be measured in territory where there is no human ceiling to approach.

The benchmark consists of 99 questions drawn from fields including whole genome sequencing, single-cell RNA sequencing, ChIP-seq, proteomics, and metabolomics. Questions were written by domain experts and graded against objective ground truth properties of the data, such as which organism a crystal structure belongs to or which viral species appears in RNA sequencing from a clinical sample confirmed by PCR. The design sidesteps a persistent problem in scientific benchmarking: when answers are drawn from an individual scientist's conclusions, those conclusions carry the fingerprints of every subjective choice that scientist made along the way.

Why Biological Benchmarks Are Hard to Build

Most existing scientific benchmarks struggle with three properties specific to biology: there are many valid ways to approach any given research question; individual methodological choices in noisy datasets can produce entirely different conclusions; and the most consequential questions are often ones humans have not yet answered.

The paper uses metformin response as a concrete illustration. A 2011 study reported a genetic variant that predicts metformin response in type 2 diabetics, with a proposed mechanism involving AMPK activation. A follow-up study the next year, testing the same variant in pre-diabetics, found no effect. A 2012 meta-analysis pooling five cohorts concluded the original effect was real but more modest than reported. Three research teams, the same underlying question, three different answers. Evaluating a model against any one of those conclusions would build the evaluator's analytical choices into the benchmark itself.

BioMysteryBench avoids this by tying answers to properties of the data that can be verified independently of how any scientist chose to analyze them. Models were placed in a containerized environment with a minimal set of canonical bioinformatics tools, the ability to install additional tools via pip and conda, and access to standard biological databases including NCBI and Ensembl. They were graded on their final answer, not the analytical path they took to reach it.

Performance Results

Of the 99 questions, 76 were solved by at least one member of a panel of up to five domain experts. The remaining 23 were classified as human-difficult; four were removed after quality control identified them as malformed, leaving 23 human-difficult questions with verified, objective solutions.

On the human-solvable set, Claude Sonnet 4.6 and more capable models performed at roughly expert level, while Claude Mythos reached a 30% solve rate on the human-difficult problems. A parallel benchmark released by Genentech and Roche while this work was being finalized adds external support for the finding. CompBioBench, built around 100 computational biology tasks using synthetic and augmented data, found Claude Opus 4.6 reached 81% overall and 69% on their hardest questions.

How Claude Solved Problems Humans Could Not

The more analytically interesting section of the paper examines what Claude was actually doing when it answered questions that left human experts stumped. Two distinct patterns emerged.

The first involves the breadth of Claude's underlying knowledge base. Tasks that would require a human expert to run a meta-analysis or stitch together multiple databases, Claude solved by combining internal knowledge of biological mechanisms and ontologies with live analysis of the data. The practical effect is that Claude has already internalized something like the accumulated output of thousands of papers, and can apply that synthesis directly rather than reconstructing it from scratch.

The paper notes one case where this became a liability: Claude's prior knowledge caused it to override what the data was actually showing, producing an incorrect answer on a problem that human experts solved correctly. A large pretraining corpus is an advantage until it isn't.

The second pattern is methodological. When Opus 4.6 was uncertain about an answer, it often tried multiple different analytical approaches and chose the result that those approaches converged on. This mirrors a principle researchers learn from peer review: when you're not confident in a single line of evidence, look for agreement across independent methods before committing to a conclusion. That Claude appears to apply this instinctively, rather than requiring explicit instruction, is one of the more practically significant findings in the paper.

The paper also describes a subtler behavior. In some cases, while human experts used algorithms or databases to identify and annotate properties of a dataset, Claude recognized certain patterns or sequences directly, in the way that the first eukaryotic promoter was discovered when a scientist noticed the sequence "TATA" appearing repeatedly upstream of genes. The analogy is apt: pattern recognition of that kind has historically been difficult to systematize, and represents something different from database lookup or rule-following.

Reliability vs. Raw Accuracy

Because each problem was attempted five times, the researchers could distinguish between problems a model solved reliably and problems where a correct answer appeared to result from a reasoning path the model could not consistently reproduce. On the human-solvable set, Opus 4.6 was strongly bimodal: 86% of the problems it solved at all, it solved at least four out of five times. On the human-difficult set, that figure collapsed to 44%, and the share of answers that came from problems solved only once or twice out of five attempts jumped from 9% to 44%.

Sonnet 4.6 showed the same pattern more sharply, with reliable answers dropping from 75% to 22% and brittle wins rising from 9% to 56% between the two sets.

What this means practically: when Claude gets a difficult bioinformatics problem right, there is a roughly even chance it arrived there through a method it could reproduce versus a reasoning path it stumbled onto. The accuracy numbers look similar across both cases, but the underlying situation is quite different. A model that reliably extracts the right answer from a ChIP-seq dataset can be trusted to do it again. A model that got there once in five tries is harder to depend on, even when it was right.

What the Benchmark Cannot Tell Us

For tasks that neither human experts nor models have solved, it is never fully certain whether those problems are impossible or just extraordinarily difficult. The validation notebooks confirm that a signal exists in the data and the data is well-formed, but they do not guarantee that anyone, human or model, can find that answer starting from scratch.

That uncertainty is part of what gives the benchmark its unusual design. Most evaluations are built around problems that humans can solve, so human performance provides the ceiling. BioMysteryBench deliberately included problems where no such ceiling exists. A more capable model could, in principle, be the first to solve one of them, with no way to verify the answer except through its own logic and the background evidence that the signal is present.

Practical Implications

The reliability analysis suggests where AI-assisted bioinformatics is most and least useful right now. For problems that fall within the human-solvable set, current models are dependable enough to treat as genuine collaborators: they arrive at correct answers consistently, and their analytical strategies sometimes differ from human approaches in ways that are instructive rather than just incidental. For problems at the difficulty frontier, the situation is more like an unreliable expert, occasionally brilliant but not someone to trust with a single high-stakes analysis.

The approach of running multiple analytical methods and converging on areas of agreement, which Claude appears to have developed on its own, is also something human researchers could apply more systematically. It is not a new idea in science, but the fact that a model defaults to it without being prompted suggests it may be a more reliable heuristic than the field generally treats it as.

BioMysteryBench does not settle the broader question of how close AI is to independent scientific contribution. What it does show is that the gap between model capability and trained expert performance is smaller than most scientific benchmarks have suggested, and that for a specific class of problems requiring broad knowledge synthesis across a very large literature, models may already be ahead.

Click image to read the previous article

DAVID BORISH