Researchers Built a Language Model From 1930 Text That Learned Python Without Knowing Computers Exist

David Borish
2 hours ago
5 min read

What Talkie Is and Why It Exists

Nick Levine, David Duvenaud, and Alec Radford built talkie by training a 13-billion-parameter model on 260 billion tokens of pre-1931 English text, including books, newspapers, scientific journals, patents, periodicals, and case law. The knowledge cutoff is December 31, 1930, chosen because that is when works enter the public domain in the United States.

The motivation is methodological. Every major modern language model, regardless of the organization that built it, draws its training data from the web. Many also incorporate synthetic data or distillation from other models trained on the web. This creates a situation where researchers studying language model behavior cannot easily separate what is a property of large language models in general from what is a property of having been trained on this one dataset. Talkie exists to generate a point of comparison.

The researchers also see direct scientific value in the forecasting question: what can a model trained before an event predict about the future? Working from nearly 5,000 short event descriptions drawn from the New York Times "On This Day" feature, they calculated how surprising each event description was to the model, measured in bits per byte of text.

Events from the 1950s and 1960s were measurably more surprising than those from the 1920s, which is what you would expect from a model with a 1930 knowledge cutoff. The team plans to use this framework to study how forecasting accuracy scales with model size and degrades at longer horizons.

The Coding Result

The most striking single finding in the paper involves Python. The model has no knowledge of digital computers. Python was created in 1991. Nothing in talkie's training data mentions it. Yet when given a few demonstration examples of Python functions in context, the model can write new correct programs.

The researchers tested this using HumanEval, a standard programming benchmark, giving the model 100 attempts per problem and measuring whether it produced at least one correct solution. Talkie substantially underperformed models trained on modern web data, which includes large quantities of code. But performance improved steadily with model scale, and the team argues the trend is meaningful.

The successes are constrained. Every correct solution the model produced was either a simple one-liner or a small modification of a demonstration example. The most illustrative case is a rotation cipher: the model was given the encoding function and asked to write the decoding function. The correct answer required swapping a single character, replacing addition with subtraction. Talkie got it right. The researchers interpret this as evidence that the model had internalized the concept of inverse functions, despite having no exposure to Python specifically.

The Contamination Problem It Solves

Benchmark contamination is a known and persistent problem in language model evaluation. When a model's training data includes test questions or their close paraphrases, performance on those benchmarks overstates actual capability. The problem is difficult to audit in modern models because the training corpora are large, partially undisclosed, and drawn from the same web sources where benchmarks are published and discussed.

Talkie is contamination-free by construction. Nothing from HumanEval, or any other benchmark created after 1930, could have appeared in its training data. This makes it a cleaner tool for studying generalization, specifically the question of how far a model can operate beyond what it has directly seen.

How They Built a 1930 Conversation Partner

Converting the base model into a usable conversational system without importing modern expectations was one of the paper's more unusual engineering challenges. Standard instruction-tuning data reflects contemporary assumptions about what a language model assistant should be, how it should structure responses, and what topics are relevant. Using that data would bake in anachronistic behavior.

Instead, the team built post-training data from historical texts with regular structure: etiquette manuals, cookbooks, letter-writing guides, dictionaries, encyclopedias, and poetry collections. They generated instruction-response pairs from these sources, then used online direct preference optimization with Claude Sonnet 4.6 as the judge. The model's average instruction-following rating on a held-out evaluation set improved from 2.0 to 3.4 on a five-point scale over training. A final round of supervised fine-tuning used rejection-sampled multi-turn conversations between Claude Opus 4.6 and talkie to smooth rough edges in its conversational behavior.

The researchers acknowledge a tension here. Reinforcement learning with AI feedback shapes model behavior according to the preferences of whatever model is doing the judging. An earlier 7-billion-parameter version of talkie began producing listicles, a format that did not exist in 1930, because that's what modern AI judges reward. As they scale, the team hopes to use earlier vintage models as judges to build a fully era-appropriate post-training pipeline.

The Data Quality Problem

One of the larger obstacles to training vintage language models is OCR quality. No text from before 1931 existed in digital form. Everything had to be transcribed from physical sources, and standard OCR systems, designed for clean modern layouts, perform poorly on historical documents.

The researchers quantified this gap in controlled experiments. A model trained on pre-1931 text transcribed with conventional OCR achieved only 30% of the learning efficiency of a model trained on human-transcribed versions of the same documents, measured across 63 evaluation tasks. Simple regex cleaning brought that figure to 70%. The remaining gap is why the team is developing a dedicated vintage OCR system, though they note that modern vision-language models, while more accurate, tend to hallucinate modern facts into historical text, which would corrupt the corpus in a different way.

Temporal Leakage and Its Limits

Despite systematic efforts to exclude post-1930 text, the model is not perfectly contained. An earlier 7-billion-parameter version clearly knew details about the Roosevelt presidency and New Deal legislation. The current 13-billion-parameter model has some awareness of World War II, the United Nations, and the division of Germany after the war. These facts entered through documents with faulty date metadata or historical texts with anachronistic editorial additions such as footnotes or introductions written long after original publication.

The team developed an n-gram-based anachronism classifier to filter the corpus, but it did not catch everything. Improving leakage detection is listed as a priority for future versions.

Scale Plans and the Open Question

Talkie-1930-13b is the largest vintage language model the researchers are aware of. They describe it as a step toward a GPT-3-scale model they plan to release this summer, with a preliminary estimate suggesting their corpus can grow to over a trillion tokens, which they believe would support a model comparable in capability to the original ChatGPT.

The scientific question motivating this scaling is one that Demis Hassabis has raised publicly: could a model trained on data through 1911 independently arrive at general relativity, the way Einstein did in 1915? Talkie's architecture frames that question as empirically testable rather than speculative. Whether a model with no knowledge of an invention can derive it from first principles, given enough scale, is something this line of research could eventually measure.

That is the deeper argument the paper is making. The coding result is striking because it is concrete: here is a model that had never seen Python producing correct Python functions. But the broader claim is that vintage models offer a way to study generalization, forecasting, and capability emergence with a degree of experimental control that modern models, all of them downstream of the same web, cannot provide.

Click image to read the previous article

DAVID BORISH