Same Question, Same Answer: Why AI Opinion Collapses in Public Debate

David Borish
2 days ago
6 min read

when AI debate — Same Question, Same Answer: Why AI Opinion Collapses in Public Debate

The Experiment

Researchers Yekyung Kim, Yapei Chang, Chau Minh Pham, and Mohit Iyyer collected public debate responses from two sources: New York Times Room for Debate, where contributors write short essays of around 350 words on contested questions, and Boston Review forums, where respondents write longer pieces averaging about 1,150 words. Both corpora share the same structure: multiple writers responding independently to the same debate prompt.

The team then generated responses to every debate from five frontier models: GPT, Claude, Gemini, DeepSeek, and Minimax. They ran each model under three conditions. In the vanilla condition, each model was given only the debate question and asked to respond. In the diversified condition, models were explicitly instructed to vary their central claims, supporting arguments, and argumentative structure as widely as possible.

In the position-guided condition, the model received an anonymized sketch of a specific human writer's main argument, biography, and tone, then was asked to write from that writer's perspective.

To measure collapse, the researchers extracted a main argument and a list of supporting sub-arguments from each essay, then used an LLM judge to score pairwise overlap within each debate. Inter-annotator agreement on the coarse boundary between substantially overlapping and non-overlapping arguments reached a kappa of 0.80, lending reliability to the classification.

The Numbers

The gap between humans and vanilla AI responses is large. In the NYT corpus, 65.3% of human main arguments are unique within a debate, meaning no other human respondent made the same argument in the same thread. For vanilla LLM responses, that figure is 3.4%. The pattern holds in the longer Boston Review forums: 78.6% of human arguments are unique versus 18.4% for vanilla AI.

Asking models to diversify their outputs raises the uniqueness rate. Under diversified prompting, some models approach the human baseline. DeepSeek reaches roughly 63%, and Gemini exceeds the human unique rate at 82%. GPT stays well below it at 45%. But uniqueness within a single model's output is only part of the question. When you ask how much of the observed human argument space any given model actually covers, the picture is more constrained.

A typical diversified LLM recovers about half of the distinct human argument clusters from the same debate. Arguments made by multiple humans are recovered 98.1% of the time. Arguments made by a single human are recovered only 67.8% of the time, and narrower, more situated claims are missed more often.

The diversified outputs also introduce arguments that no human raised at all. In the NYT data, only 47.6% to 60.3% of AI-generated main arguments substantially overlap with something a human actually wrote.

Below the Surface

The collapse doesn't stop at the level of main arguments. The researchers also analyzed supporting sub-arguments, the specific claims and pieces of evidence writers use to back their central position. To isolate this, they focused on debates where humans and AI models converged on the same main argument, then asked whether the supporting reasons differed.

They didn't. Among essays sharing the same main argument, 41% of human sub-arguments are unique within the debate. For vanilla AI, that figure drops to 9.1%. Diversified prompting recovers some ground, reaching 22.9%. Even the position-guided condition, where the model is given a specific human writer's perspective and asked to simulate their voice, produces a unique rate of only 18.4% when the target writer varies.

The qualitative difference is as notable as the quantitative one. Human sub-arguments tend to anchor in specific cases: a particular piece of legislation, a named institution, a causal chain grounded in one industry or community. AI sub-arguments tend toward portable frameworks: generic appeals to research, abstract institutional interventions, hedged generalities that stay uncommitted to any particular action.

The paper gives concrete examples. A human writer on drug enforcement: "Heavy-handed federal investment in drug enforcement has led to over half of the federal prison population being incarcerated for drug offenses." A model (Claude) on the same debate: "Research indicates that sex offenders have lower recidivism rates than other felons and that residency restrictions do not actually reduce reoffending." Both are supporting sub-arguments, but one names a specific causal relationship and one cites a generalized finding without grounding.

The researchers describe the AI pattern as convergence onto arguments that are hard to falsify, the kind that can attach to almost any debate without modification.

Structure Follows the Same Arc

Beyond what arguments are made, the study also examines how essays are built. Researchers annotated each paragraph by its argumentative role (thesis, support, rebuttal, concession, proposal, and others) and its discourse mode (argumentation, exposition, narration, description).

Human essays mix these elements across the piece. LLM essays follow a more compressed arc: open with a thesis, build support, close with a proposal. In the NYT data, the transition from a support paragraph to a proposal paragraph occurs in 29.4% of vanilla AI support transitions, compared with 12.3% for humans. The reverse pattern holds for sustained development: the support-to-support-to-support sequence appears 13.2 times per 100 paragraph triples in human essays, and 5.3 times in AI essays.

The discourse mode data reinforces this. Argumentation accounts for 71.5% of human paragraphs in the NYT corpus, compared with 97% of vanilla AI paragraphs. Humans mix in substantially more exposition and narration throughout their essays. AI essays rarely do.

Diversified and position-guided generation does not change this structural pattern much. Both still rely on argumentation at rates above 89%, far above the human baseline.

What It Means for Public Discourse

The study is careful about its scope. It measures argumentative diversity, not quality or persuasiveness. A unique argument is not necessarily a better one. The researchers also note that the human essays were written at different times than the AI responses, and that their dataset is limited to public debate forums, which may not generalize to other writing contexts.

Within those limits, the findings describe something worth tracking. AI models can already measurably shift reader beliefs, according to prior research the paper cites. If those models consistently produce the same small set of arguments in response to contested questions, and if their outputs recirculate through training data, search results, and editorial assistance, then the range of positions that readers encounter in public debate may narrow in ways that are difficult to observe directly.

The study finds that pooling all five models under diversified prompting recovers 73.9% of human main-argument clusters. That number may sound adequate until you consider what's in the remaining 26%: the arguments made by one writer rather than three, the interventions grounded in a specific community's experience, the framings that haven't yet become the conventional wisdom.

One additional finding deserves mention. In binary debates with clear pro and con sides, AI models take strong positions, defined as strong_support or strong_oppose, only 63.4% of the time. Humans do so 76.1% of the time. Asking models to diversify their outputs raises their coverage of both sides but doesn't close the gap in stance strength. The models hedge more, even when they're asked to commit.

A Note on Methods and Limitations

The argument extraction and pairwise overlap judgments in this study rely on LLM-based annotation, specifically Gemini Flash, at scale. The authors report inter-annotator agreement figures and describe multiple validation steps, including manual spot-checks by human annotators. Still, using a language model to measure the limitations of language models introduces a circularity worth acknowledging. The coarse-label agreement between the LLM judge and human annotators reached 93%, which is high, but systematic biases in how the judge defines overlap could shape the quantitative findings in ways that are hard to fully audit.

The corpus is also limited to two venues with specific editorial cultures. NYT Room for Debate and Boston Review forums attract writers with certain professional profiles and argumentative norms. Whether the same collapse pattern holds in legal writing, scientific peer review, or policy memos is an open question the paper appropriately flags.

Those caveats don't undermine the central finding. Across two corpora with different formats and essay lengths, the directional result is consistent: AI models from different providers converge on fewer arguments, deploy more generic supporting claims, and follow more predictable structures than the humans they're increasingly asked to assist.

The code and annotations are available at github.com/mungg/argument_collapse.

DAVID BORISH