The API Is the Attack Surface: How Chinese Labs Extracted Claude's Reasoning Capabilities at Industrial Scale

David Borish
5 hours ago
7 min read

The Attack That Looked Like Normal Traffic

Between April 22 and June 5, 2026, operators affiliated with Alibaba and its AI lab Qwen sent 28.8 million queries to Claude through approximately 25,000 fraudulent accounts. Anthropic described the campaign in a June 10 letter to Senators Tim Scott and Elizabeth Warren as "the largest known distillation attack on Anthropic to date." Alibaba did not respond to requests for comment.

The 44-day campaign was not a hack in the conventional sense. There was no breach of Anthropic's infrastructure, no stolen model weights, no compromised employee credentials. The attackers accessed Claude exactly the way any developer would, through the API, with valid sessions and ordinary-looking requests. What distinguished the campaign from legitimate usage was pattern and volume. Tens of thousands of carefully structured prompts, concentrated on the same narrow capability areas, arriving across hundreds of coordinated accounts simultaneously. The capabilities being targeted, according to Anthropic, were software engineering, agentic reasoning, complex planning, and tool use, all areas central to its frontier Mythos Preview model.

This was not Anthropic's first disclosure of this kind. In February 2026, the company published detailed technical findings linking three Chinese AI labs, DeepSeek, Moonshot AI, and MiniMax, to coordinated extraction campaigns totaling more than 16 million exchanges across roughly 24,000 fraudulent accounts. The February disclosures and the new Alibaba accusation trace the same operational template at increasing scale.

How the Extraction Actually Works

The underlying technique is called knowledge distillation, a legitimate machine learning method in which a smaller "student" model learns to replicate the behavior of a larger "teacher" model by training on the teacher's input-output pairs. Frontier labs use it routinely to build cheaper, faster versions of their own systems. The same process, run against a competitor's API without permission, is the attack.

The mechanics begin with access. Because Anthropic restricts commercial access to Claude in China for national security reasons, the labs in question used commercial proxy services that resell API access at scale. These services operate what Anthropic calls "hydra cluster" architectures: networks of fraudulent accounts distributed across the API and third-party cloud platforms, designed so that banning a single account has no meaningful effect on the broader operation. In one identified case, a single proxy network managed more than 20,000 fraudulent accounts simultaneously, mixing extraction traffic with unrelated customer requests to complicate detection.

Once access is established, the extraction phase begins. Attackers generate large volumes of carefully crafted prompts designed to elicit the model's most capable responses across a targeted domain. A single prompt like "You are an expert data analyst combining statistical rigor with deep domain knowledge. Your goal is to deliver data-driven insights grounded in real data and supported by complete and transparent reasoning" looks unremarkable in isolation. When variations of that prompt arrive tens of thousands of times across hundreds of coordinated accounts, all targeting the same narrow capability, the pattern becomes diagnostic.

The outputs collected serve two distinct training purposes. The first is supervised fine-tuning, in which a less capable open-weight model is trained directly on the high-quality input-output pairs harvested from the target system. The second, more sophisticated application is reinforcement learning: the extracted outputs are used to generate thousands of unique training tasks and reward signals, including rubric-based grading tasks that effectively turn Claude into a reward model for the attacking lab's own RL pipeline.

DeepSeek's campaign, described in Anthropic's February technical disclosure, illustrates the sophistication of the latter approach. The operation used synchronized traffic across accounts with identical patterns and coordinated timing, a load-balancing arrangement designed to increase throughput and evade detection thresholds. In one notable technique, prompts asked Claude to "imagine and articulate the internal reasoning behind a completed response and write it out step by step," generating chain-of-thought training data at scale. A separate set of tasks asked Claude to produce censorship-safe alternatives to politically sensitive queries, likely to train DeepSeek's own models to navigate restricted topics in Chinese-market deployments.

Moonshot AI took the approach further by attempting to reconstruct Claude's reasoning traces directly, using a more targeted prompt strategy aimed at capturing the internal logic the model uses before producing a visible response. MiniMax ran the largest campaign of the February batch and demonstrated real-time operational adaptability: when Anthropic released a new Claude model during the active campaign, MiniMax redirected nearly half its traffic to the updated system within 24 hours.

Critically, none of this activity required access to Claude's source code, model weights, or original training data. The attack surface is the output layer itself.

What Gets Copied and What Doesn't

The technical distinction matters for understanding both what the attacks accomplish and where they fall short. A distilled student model learns to approximate the behavior of the teacher system, reproducing its reasoning patterns, stylistic tendencies, and problem-solving approaches across the domains that were queried during extraction. For agentic coding tasks or multi-step reasoning benchmarks, a well-executed distillation campaign can produce a student model that performs meaningfully close to the teacher on those specific tasks at a fraction of the original development cost.

What the student model does not inherit is the teacher's safety infrastructure. Anthropic's safety work, including its approach to constitutional AI and the behavioral guardrails built into Claude, are properties of the training process, not of any single interaction. A model trained on Claude's outputs without that underlying training regime does not carry those constraints forward. Anthropic flags this explicitly: illicitly distilled models can deploy frontier-level capabilities for offensive cyber operations, bioweapons research assistance, or disinformation generation without the refusals that the original system would produce.

The attack also doesn't transfer perfectly. Distillation captures behavior, not architecture. A student model trained on Claude's outputs will differ from Claude in ways that matter for deployment, particularly at the frontier where the hardest reasoning tasks live. What it can do is close the gap significantly, eliminating years of research investment and billions in compute costs for the lab conducting the extraction.

Whether Any Model Can Hold an Advantage

The Alibaba accusation, set against the February disclosures and Google's own reports of similar extraction campaigns against Gemini, raises a harder question: if the interface that makes a model useful is also the mechanism through which its capabilities can be copied, can any frontier AI model maintain a durable advantage?

The straightforward answer is that complete prevention is not possible. A model has to respond to prompts in order to be useful. Every useful response is a data point that an attacker can harvest. The only way to make distillation impossible would be to make the model unavailable, which defeats the purpose of building it.

The more useful framing is to ask what prevention costs the attacker and what it preserves for the defender. Anthropic has deployed several detection layers, including classifiers and behavioral fingerprinting systems designed to identify distillation attack patterns in API traffic, plus tools for detecting coordinated activity across large numbers of accounts. These defenses raise the cost and complexity of extraction. They do not eliminate it. As Anthropic's own documentation notes, the attacker's advantage is structural: they need to succeed once; the defender must succeed continuously.

Several researchers and analysts tracking this issue have noted the analogy to earlier chapters of US-China technology competition, including the industrial espionage campaigns of the 2006-2013 period that preceded the Obama-Xi cybersecurity agreement. The difference with AI distillation attacks is that the mechanism requires no traditional intrusion. The API is open by design. The attackers are, in a strict technical sense, legitimate users.

What this means for competitive advantage in AI is that the moat around model intelligence alone is narrowing. A frontier model represents a capability lead, but that lead exists on a timer from the moment the model becomes publicly accessible. As one analysis put it, successful AI systems accidentally teach the market. Every answer a model gives reveals information about how it reasons, what patterns it has internalized, and where its capabilities concentrate. The more useful the model, the more educational its outputs.

Where durable advantage tends to compound instead is in adjacent factors: proprietary data generated through closed-loop deployment, infrastructure economics and inference efficiency, the speed of the research iteration cycle, and the depth of integration into enterprise workflows. Google's advertising business, generating $68 billion in 2025 revenue on the back of behavioral data accumulated across decades, is a rough analog. The model's outputs are valuable; the data generated by users interacting with those outputs is often more valuable still.

Anthropic has framed the distillation attack problem partly as an argument for tighter export controls, on the theory that executing extraction campaigns at scale requires access to advanced compute, and restricting that compute limits what attackers can do with harvested outputs. The Commerce Department's June 12 restrictions on Anthropic's Mythos and Fable models, coming two days after the Alibaba letter was sent, reflects a related logic from the other direction: limiting who can access the models in the first place.

The Regulatory Response and Where It Goes

Legislative movement is underway. Senators Bill Hagerty and Andy Kim are pursuing an amendment to defense legislation that would authorize blacklisting or sanctioning entities conducting distillation campaigns. Anthropic has publicly called for coordinated action between government and industry, including threat-intelligence sharing across AI companies and cloud providers.

The Frontier Model Forum, a coalition including OpenAI, Anthropic, and Google, is pooling detection intelligence to monitor and block adversarial distillation attempts. The collaboration faces practical constraints, including antitrust uncertainties that limit how directly competitors can coordinate, and the fundamental detection problem that sophisticated extraction traffic is designed to look like normal usage.

Alibaba was added to the Pentagon's Chinese military companies list in June 2026, a designation it is challenging. The Commerce Department, despite classifying DeepSeek as a national security risk through an interagency review, had not placed it on the trade blacklist as of mid-June, in part to avoid escalating tensions with Beijing.

What Anthropic's disclosures have clarified, beyond the specific campaigns, is that the attack surface is intrinsic to the product. Every frontier AI company operating a public API is, by design, operating a system that can be systematically queried for training data. The defenses available are real, expensive to build, and incomplete. The question now facing the industry and policymakers is how much of the competitive advantage embedded in American AI investment can be preserved through technical and regulatory means, and at what cost to the openness and accessibility that made those models valuable in the first place.

DAVID BORISH

The API Is the Attack Surface: How Chinese Labs Extracted Claude's Reasoning Capabilities at Industrial Scale

The Attack That Looked Like Normal Traffic

How the Extraction Actually Works

What Gets Copied and What Doesn't

Whether Any Model Can Hold an Advantage

The Regulatory Response and Where It Goes

Comments

JOIN THE AI SPECTATOR MAILING LIST

Back to top