top of page
  • LinkedIn
  • Instagram
  • Facebook
  • X

Your AI Keeps Giving You the Same Answer. Harvard Researchers Just Found a Way Around That

Your AI Keeps Giving You the Same Answer. Harvard Researchers Found a Way Around That
Your AI Keeps Giving You the Same Answer. Harvard Researchers Just Found a Way Around That

The Hidden Cost of Getting Better at Being Right


Every major AI lab is racing to improve benchmark scores. Models that answer factual questions more accurately, reason more reliably, and produce fewer errors are winning on leaderboards and attracting users. The competitive logic is clear. What's less visible is what gets lost in the process.


A new working paper from Harvard researchers Queenie Luo, Gary King, Michael Puett, and Michael D. Smith identifies a specific casualty of that accuracy race: the ability to generate genuinely diverse, creative responses over extended use. As models are trained to converge on correct answers, their probability distributions become more peaked around the most common outputs. That means more information gets pushed into statistical tails where standard decoding methods never look. Newer, more accurate models, the researchers found, generate narrower and more repetitive outputs than older ones when asked open-ended questions.


This matters most for a category of activity the researchers call the "search quest," which covers any prolonged, exploratory effort to find an answer that doesn't yet exist in a fixed form: choosing a wedding dress, finding a research topic, naming a startup, designing a product. Unlike a factual query with a correct answer, a search quest requires learning the space of possibilities. You don't know what you want until you've seen enough options to understand what's out there.


Current AI tools are poorly suited for this. They can produce a useful first list. After that, they start repeating themselves.


What Standard Decoding Actually Does


To understand the problem, it helps to understand how models generate text. At each step, a model computes a probability distribution over all possible next tokens. Standard decoding methods, including top-k and nucleus sampling, select from the highest-probability tokens. That keeps output fluent and on-topic, but it also means the vast majority of information encoded in a model's parameters is never accessed.


The researchers illustrate this with a simple example. When prompted to brainstorm book topics on 18th century world history, a model's top predicted tokens include "The," "Imp," "Political," and "Age." Following those paths produces European history topics: The Age of Enlightenment, Political and social changes in Europe. Move down to the 300th through 2000th positions in the ranking, and you find tokens like "Asia," "African," and "Russian," which lead to non-European topics the model clearly knows but never volunteers.


The problem compounds over time. As more web content is generated by AI, models trained on that content undergo what researchers call "model collapse," where even more knowledge gets pushed to the tails. The information is there. Standard decoding just never finds it.


The Recoding-Decoding Method


The Harvard team's solution doesn't require retraining models or accessing their internal architecture. It works by introducing two forms of randomness at generation time: a random priming phrase added to the beginning of the prompt, and a random diverting token placed at the start of each new sentence.


The priming phrase is constructed by randomly selecting from the 2,000 most common English nouns and embedding it in the phrase "Related to NOUN." The diverting token is drawn from the three-letter starting stems of the 5,000 most common English words. Both exploits a well-documented property of language models called positional bias: models attend more heavily to tokens at the beginning and end of input sequences.


The effect is to redirect generation toward lower-probability, less-traversed regions of the model's knowledge space while keeping outputs semantically relevant to the original prompt. If the prompt is "Brainstorm a world history book topic," adding the priming phrase "Related to FOOD" and the diverting token "Pas" might yield "Pasta and the silk road." Replacing those with "Related to SKY" and "Tib" might yield "Tibetan sky burials." The topics are unconventional, but they're not off-topic.


Critically, the method doesn't require a blank-slate architecture. For models that only offer chat completion APIs rather than raw completion APIs, the researchers simulate the completion endpoint through a system prompt, and validate that the simulated version substantially outperforms standard decoding, though the real completion API performs better still.


What the Data Shows


The battlefield test is the most visually striking result. Standard GPT-5.1 queried about interesting world history battlefields produced 19 unique locations across 1,000 runs. All were in Europe or North America, and all were famous in Western historiography: Gettysburg, Waterloo, Stalingrad, Marathon. RD produced 1,307 unique battlefields across a globally distributed range, including locations in East Asia, South Asia, India, Russia, the Middle East, Africa, and Australia. Standard decoding identified zero battlefields that RD did not also find.


Across 50 substantively different brainstorming topics, RD methods consistently outperformed all ordinary decoding variants on diversity, measured by the number of conceptually distinct clusters generated. Relevance stayed high: RD scored 0.99 on GPT-3.5, GPT-5.1, and Gemini-3, and 0.94 on DeepSeek-3, essentially matching the 0.99-1.00 scores of standard decoding. RD added diversity without losing relevance.


The creativity results are particularly notable. When the researchers measured what percentage of standard decoding's conceptual clusters were covered by RD, the answer was close to 100%. When they measured the reverse, standard decoding covered only about 30-40% of RD's clusters. RD explored a search space roughly three times larger while still passing through the same familiar territory.


On the five public datasets spanning 500 prompts across GRE writing topics, creative writing, image generation, and historical questions, RD increased diversity by 161% on GPT-5.1 and 140% on Gemini-3, while relevance scores remained within one percentage point of standard decoding.


One finding cuts against a common assumption: that newer, more capable models are better for creative tasks. The data shows the opposite. As model accuracy improves, the gap between standard decoding and RD widens. GPT-5.1 and Gemini-3 under standard decoding produce diversity scores of 0.27 on those five datasets. Older GPT-3.5 scores higher on diversity under standard decoding. Better benchmark performance, it turns out, comes at a measurable cost to output variety when the model is used for open-ended exploration.


The Collective Diversity Problem


The paper also addresses a consequence that extends beyond any individual user. Research published in Science Advances found that generative AI enhances individual creativity while reducing collective diversity: people using AI produce better individual work, but different people's outputs converge on the same ideas. The researchers replicated this pattern accidentally when they discovered that students in a university course had submitted nearly identical essays without communicating, apparently because they'd used AI to help structure their arguments.


RD addresses this by ensuring that different users asking similar questions receive genuinely different suggestions. In image-based tests using wedding dress and Halloween party themes, two independent batches of standard decoding outputs looked nearly identical: the same ratios of phoenixes, jellyfish, treehouses, and airships appeared across both runs.

Two batches of RD outputs diverged substantially, with one featuring traffic cones and Jurassic gardens while the other offered industrial ruins, Guy Fawkes bonfires, and pixelated video-game worlds. Measured in clusters, RD produced 244 conceptually distinct ideas from 250 generated items. Standard decoding produced 35.


What This Means for How People Use AI


The practical implications differ depending on how you're using AI. For factual queries with correct answers, standard decoding is appropriate. For exploratory tasks, the current dominant paradigm actively works against you, and the degradation gets worse the longer you query.


Users who've experienced this intuitively know it: asking an AI for research topic ideas, startup names, or design concepts yields a useful first response and then increasingly familiar territory. The model isn't hiding ideas from you deliberately. Standard decoding just doesn't visit the parts of the model where those ideas live.


The researchers also demonstrate that RD can be adapted for other purposes. By injecting domain-specific keywords probabilistically rather than randomly, it can tune how frequently certain topics appear in output, offering finer-grained control than prompt engineering alone. When given a prompt about 18th century world history and instructed to "minimally focus on China," prompt engineering either overrepresents China or fails to represent it proportionally. RD with a probabilistic injection threshold can be tuned to produce Chinese history content at precisely calibrated rates.


The paper presents the framework as extensible: the core architecture, a token-level editor that can modify outputs at any point during generation, could in principle be configured to reduce political imbalance in AI outputs, moderate opinionated responses, or provide culturally diverse content by varying the language of the injected stems.


A Constraint Worth Noting


The method requires either a completion API, which allows a model to continue generating from a mid-sequence position, or a simulated version using the chat completion API. Not all models offer the real completion API. For those that do, RD performs best. For those that don't, the simulation works but performs somewhat less well. As frontier labs increasingly restrict raw API access in favor of structured chat interfaces, the practical ceiling for this approach could depend on whether providers make completion endpoints available.


The research is also a reminder that what AI labs optimize for determines what they get, and sometimes what they sacrifice. The benchmarks that drive model development measure accuracy and correctness. The search quest, as the researchers define it, requires something different: not finding the right answer, but learning enough about the space of possible answers to eventually recognize one as your own.

 
 
 

Comments


JOIN THE AI SPECTATOR MAILING LIST

CONTACT

Contacting You About:

Thanks for submitting!

New York, NY           

Db @DavidBorish.com           

  • LinkedIn
  • Instagram
  • Facebook
  • X
Back to top

© 2026 by David Borish IP, LLC, All Rights Reserved

bottom of page