X-Token: NVIDIA's Projection-Based Fix for Cross-Tokenizer Distillation

David Borish
Jun 1
7 min read

The Problem That No One Had Formally Proved

Knowledge distillation is a well-established technique. A smaller "student" model learns not just from labeled data but from the full probability distribution that a larger "teacher" model assigns to each possible next token at every position in a sequence. That distribution encodes the teacher's uncertainty and its sense of which alternatives are plausible, information that raw correct-answer training discards. Researchers refer to this as "dark knowledge."

The complication is that standard distillation requires both models to share the same tokenizer, the component that breaks text into discrete units for processing. Llama-3.2-1B, for example, cannot directly learn from Qwen3-4B or Phi-4-Mini because those models carve text into different pieces. Token positions across vocabularies have no natural correspondence. This restriction locks practitioners into whichever teacher model happens to use the same tokenizer as their student, regardless of whether a better teacher exists elsewhere.

Two methods have been developed to work around this. Universal Logit Distillation (ULD) sidesteps vocabulary alignment entirely by rank-sorting both distributions and minimizing the distance between ranks, treating each position as equivalent regardless of which token it represents. GOLD, the current state of the art, takes a more structured approach: it identifies tokens that have exact string matches across both vocabularies, applies standard KL divergence on those matched pairs, and applies ULD rank-matching on everything else.

NVIDIA's paper, submitted in May 2026, demonstrates that GOLD contains two structural failures that in certain configurations make it worse than using no teacher at all.

Two Failures, Opposite Causes

The first failure concerns what the paper calls the "uncommon-token" problem. Llama-3 stores multi-digit numbers as single tokens: the number 201 is one unit. Qwen3 splits numbers digit by digit: 2, 0, and 1 are three separate tokens. When Qwen3-4B is the teacher and Llama-3.2-1B is the student, none of Llama's multi-digit numerals have string matches in the Qwen vocabulary. All 1,100 of them, 100 two-digit numbers and 1,000 three-digit numbers, fall into the unmatched remainder that GOLD trains with rank-based ULD.

The consequences compound. Those tokens receive identity-agnostic noise from the rank-matching process. Additionally, GOLD's KL divergence term on the matched tokens propagates gradients that suppress every unmatched token's probability, regardless of whether the ground truth token is in that set.

The paper formally proves this in Proposition 1: GOLD's common-KL term induces non-negative gradients on every uncommon student logit, driving their probabilities downward systematically. The empirical result is stark. A Llama-3.2-1B student trained with GOLD using Qwen3-4B as teacher scores 2.56 on GSM8k, a math reasoning benchmark, compared to 12.89 for the same student trained with same-tokenizer distillation from a weaker Llama-3.2-3B teacher. The stronger teacher, through GOLD, produces a worse student than the weaker one.

The second failure operates in the opposite direction. GOLD defines its matched token set using strict string equality after simple canonicalization. A student token like "Hundreds" maps naturally to the teacher sequence "Hund" followed by "reds," but because those two strings are not identical, GOLD excludes the pair entirely. The result is an overly conservative matched set that discards useful alignment signal even when the correspondence between tokens is well-defined.

These two failures require different remedies: the first needs the partition removed, the second needs it relaxed.

How X-Token Works

X-Token has three components that operate together. The first is span alignment. Because teacher and student tokenizers produce sequences of different lengths for the same text, X-Token uses dynamic programming to group tokens into chunks where each chunk pair decodes to the same underlying text substring. A chain-rule merge then combines per-token probabilities within each chunk into a single distribution. This alignment is cached per sequence, adding no per-step training overhead.

The second component is a projection matrix W. Once spans are aligned, teacher and student distributions still operate over different vocabularies. W maps each student token to a weighted combination of teacher tokens. It is built deterministically in two passes before training begins. The first pass assigns exact matches: for every student-teacher token pair whose decoded strings match after canonicalization, the corresponding entry in W is set to one. The second pass handles unmatched student tokens by re-tokenizing their decoded text under the teacher tokenizer. If that process yields four or fewer tokens, W assigns exponentially decayed weights to that sequence, with the leading sub-token receiving the highest weight because it typically carries the most informative probability mass. Each row of W is normalized so that multiplying by W preserves the probability structure of the student's distribution.

The third component is the two loss formulations. P-KL removes GOLD's partition entirely. It projects the student's probability distribution into teacher vocabulary space using W, then computes KL divergence directly between the teacher distribution and the projected student distribution. No uncommon set exists, so rank-based noise is eliminated. The suppressive gradient problem is also resolved: the projection routes the student's probability for "201" directly onto the teacher tokens for 2, 0, and 1 via W, giving the student coherent learning signal about those numerals.

H-KL applies when the partition is structurally sound, meaning critical tokens have string matches in the teacher vocabulary. In that case, GOLD's direct KL on matched pairs delivers sharper per-pair supervision than P-KL's blended projection. H-KL retains GOLD's hybrid loss structure but expands the matched set by admitting near-equivalent pairs like "Hundreds" and "Hund" using top-ranked mappings under W. Exact matches are preserved because they receive weight one in W, the highest possible value.

Selecting between the two modes uses a coverage audit. The researcher examines what fraction of critical token categories, specifically multi-digit numerals for math tasks, appear in the matched set under a given teacher. Under Qwen3-4B, zero of Llama's two-digit numerals and zero of its three-digit numerals have matches. Under Phi-4-Mini, all of them do. The rule follows from the audit: use P-KL when critical tokens are unmatched, use H-KL when the matched set is sound.

Results and What They Show

All experiments use Llama-3.2-1B as the student, trained on the NemotronClimbMix dataset for 30,000 steps with a batch size of 768 and context length of 4,096. Evaluation covers MMLU, GSM8k, MATH-Hendrycks, Winogrande, and HellaSwag, reported as 3-shot accuracy.

Against Qwen3-4B as teacher, GOLD reaches an average score of 35.03 across the five benchmarks, below even continued pre-training without any teacher (36.63). Pure ULD already outperforms GOLD at 36.77, which confirms that GOLD's partition itself is the primary source of failure when critical tokens are misaligned, not the rank-matching scheme in the uncommon set. X-Token with P-KL reaches 38.85. GSM8k alone moves from 2.56 under GOLD to 15.54 under P-KL, surpassing same-tokenizer distillation from the stronger Llama-3.2-3B teacher (12.89) on that benchmark.

Against Phi-4-Mini, where multi-digit numerals all land in the matched set, GOLD reaches 38.66. H-KL improves that to 39.18. Applying P-KL to Phi-4-Mini drops performance to 37.50, which shows that the coverage audit selects the correct mode, and that using the wrong mode on a well-formed partition causes measurable harm.

X-Token also supports multi-teacher distillation without architectural changes. Each teacher receives its own projection matrix and the appropriate loss mode. A setup combining Phi-4-Mini with static weight 0.8 and Llama-3B with standard KL at weight 0.2 reaches an average of 40.48, which is 1.30 points above the best single cross-tokenizer result. The paper evaluates whether complementarity or volume drives those gains. Combining Phi-4-Mini and Qwen3-4B, two models with overlapping reasoning strengths, scores only 38.49, below the best single teacher. Adding Qwen-4B as a third teacher alongside the best pair yields 40.15, with math performance degrading slightly. Teacher diversity appears to matter more than teacher count.

The paper also evaluates whether the projection matrix W should be frozen after initialization or jointly refined with the student. Jointly learned W outperforms frozen W under P-KL on the Qwen pair, suggesting that the rule-based initialization provides a useful starting point that gradient-based refinement can improve.

Limitations the Paper Acknowledges

The experiments use only Llama-3.2-1B as the student, trained under a continued pre-training setup. Whether results hold for larger students or instruction-tuned settings is not addressed. Only three teacher configurations are tested, all involving BPE-based tokenizers. The multi-token rule in Pass 2 skips any student token whose decoded text re-tokenizes to more than four teacher tokens, leaving those rows empty in W. Low-overlap tokenizer families, including SentencePiece and byte-level BPE, are reserved for future work. Finally, static weighting outperforms confidence-adaptive weighting across all multi-teacher configurations tested, but the paper does not explain why adaptive schemes underperform.

What This Means for Model Development

The practical implication is that cross-tokenizer distillation has been quietly failing for a specific class of practitioners: those who committed to a student model from one model family and wanted to leverage a stronger teacher from a different family. GOLD, the method most likely to be used in that situation, can perform worse than using no teacher at all when tokenizer fragmentation patterns conflict with the student's critical tokens.

X-Token's projection matrix is built from tokenizer strings before training, requiring no training data or learned initialization. The coverage audit that determines whether to use P-KL or H-KL is a defined, reproducible criterion. Both of these properties make the method usable without specialized expertise in vocabulary alignment.

The broader pattern here fits what I describe in my upcoming bookThe Tony Hawk Paradox: capabilities that seem available in principle, in this case learning from the best available teacher model, often encounter invisible structural barriers that block them in practice. X-Token's identification and formal proof of GOLD's suppressive gradient problem is a precise example of that dynamic, a situation where what appears to be a functioning tool was producing active harm in a specific regime, and the harm went undetected because the aggregate loss remained plausible even as individual benchmark performance collapsed.

The finding should prompt anyone using GOLD-based cross-tokenizer distillation to audit which tokens in their student vocabulary fall outside the matched set under their chosen teacher, particularly for task-critical tokens in their target domain.

DAVID BORISH