top of page
  • LinkedIn
  • Instagram
  • Facebook
  • X

What AI Benchmarks Are Actually Measuring Now: Inside the Artificial Analysis Intelligence Index v4.1

artificial intelligence index
What AI Benchmarks Are Actually Measuring Now: Inside the Artificial Analysis Intelligence Index v4.1

The headline score on an AI benchmark always conceals a set of choices. What tasks count, how answers are graded, which capabilities get weighted most heavily, and what a failure actually looks like all shape the final number before any model runs a single evaluation. With the release of Intelligence Index v4.1, Artificial Analysis has published a detailed methodology that makes those choices transparent, and the choices reveal something meaningful about where AI evaluation is heading.


The index assigns 34% of total weight to agentic tasks, 24% to coding, 24% to scientific reasoning, and 18% to general knowledge. That agentic weighting is the sharpest signal in the methodology. Static benchmarks that test knowledge retrieval or multiple-choice reasoning were never great proxies for what production AI systems are being asked to do, and the field's leading independent evaluators have moved decisively away from them.


Nine evaluations now make up the v4.1 suite: GDPval-AA v2, the τ³-Banking agent evaluation, Terminal-Bench v2.1, SciCode, AA-LCR, AA-Omniscience, Humanity's Last Exam, GPQA Diamond, and CritPt. Artificial Analysis estimates a 95% confidence interval of less than ±1% for the overall index, based on more than ten repeat runs of individual evaluations.


GDPval-AA v2: Human Experts as the Calibration Point


The most architecturally significant component is GDPval-AA v2, which carries 20% of the index weight. Built on OpenAI's GDPval dataset, it covers 220 tasks across 44 occupations tied to GDP-contributing sectors in the United States. The tasks are open-ended professional work: a model receives a brief and a set of reference files, uses tools to produce one or more output files, and submits when done or abandons the task if it genuinely cannot complete it.


Grading happens through pairwise comparison. Each submission is matched against another model's submission on the same task, judged blind by one of three frontier LLM judges drawn from the panel: GPT-5.5 at medium reasoning, Gemini 3.1 Pro Preview at high reasoning, and Claude Opus 4.8 at high effort. The critical calibration element is the Elo anchor. Human expert deliverables are set at 1,000. Claude Fable 5's GDPval-AA Elo of 1,932 means it's placing substantially above human expert level on these tasks, at least as evaluated by the LLM judge panel. That number puts the capability question in sharper relief than a percentage score against a static question set ever could.


The v2 upgrade from GDPval-AA v1 includes an expanded sandbox with additional dependencies, a TeX Live LaTeX toolchain, and fixes to environment issues that were inflating score variance. It also expands the turn limit from prior versions to 250 turns per task and moves from a single judge to the three-judge panel. Elo scores from v1 were frozen and re-baselined; v1 scores and v2 scores are not directly comparable.


τ³-Banking: Knowledge Retrieval Under Pressure


The τ³-Banking evaluation, carrying 14% of the index weight, is a harder version of the customer-support agent benchmark Artificial Analysis ran in v4.0. Where τ²-Bench Telecom used a relatively contained policy set, τ³-Banking requires agents to navigate approximately 700 interconnected policy documents totaling roughly 195,000 tokens across 21 product categories. The 97 tasks involve realistic fintech customer-support scenarios with multi-step tool execution, and scoring is done against backend database state rather than conversational quality. Whether a dispute was actually opened, whether a provisional credit was actually issued, those are the outcomes that matter. Pass@1 scoring is averaged across five repeats per task, with a maximum of 200 steps per task repeat.


The difference between τ²-Bench Telecom and τ³-Banking reflects a wider judgment that the field made: simpler agent benchmarks were producing scores that didn't generalize well to environments with large, poorly-structured knowledge bases and consequential tool execution. Replacing the telecom domain with banking, and scaling the policy corpus by an order of magnitude, makes the evaluation considerably harder and more representative of real enterprise environments.


Terminal-Bench v2.1 and the Coding Suite


Terminal-Bench v2.1, which replaced Terminal-Bench Hard in the index, runs the same 89 curated tasks across software engineering, system administration, data processing, model training, and security. The upgrade isn't to the task set but to the evaluation environment. Version 2.1 corrects environment gaps and instruction ambiguities that were causing some task failures to reflect benchmark issues rather than model capability. Tasks are verified by test suite, pass@1 scoring averaged over three repeats, with a 250-episode cap per task.


SciCode, at 8% weight, asks models to write Python code that solves scientific computing sub-problems. It covers 288 sub-problems from the test set and includes scientist-annotated background information in the prompt. Sub-problem level scoring with pass@1 means partial credit is possible, which matters for a benchmark where the hardest sub-problems require both domain knowledge and code execution skill.


Scientific Reasoning: HLE, GPQA Diamond, and CritPt


The scientific reasoning category collectively carries 24% of the index. Humanity's Last Exam, at 12% weight, tests 2,158 text-only questions across mathematics, humanities, and natural sciences. It was developed by the Center for AI Safety and has attracted attention for how quickly frontier models have improved on it since its introduction.


One methodological note worth flagging: the HLE dataset was curated adversarially using GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and o1-series models as reference systems. Direct comparisons between those models and models that weren't part of the curation process carry a potential bias, and Artificial Analysis flags this limitation explicitly.


GPQA Diamond uses 198 questions from the Graduate-Level Google-Proof Q&A benchmark, covering biology, physics, and chemistry, and is specifically selected as the subset where both domain experts answer correctly and the majority of non-experts do not.


CritPt is new to the v4.1 index. It uses 70 challenge-level physics problems with unpublished, frontier-level content developed by a research team. Answer formats include numerical values, symbolic expressions in SymPy, and Python functions evaluated against test cases, with grading conducted through an official server. Five repeats per question. CritPt's addition to the index reflects a broader judgment that frontier model scores on established physics benchmarks have become insufficiently discriminating.


AA-Omniscience: Rewarding Restraint


AA-Omniscience contributes two separate components to the index: an accuracy score and a non-hallucination rate, together carrying 12% of the total. The benchmark uses 6,000 questions across 42 topics including business, law, software engineering, and science. Its scoring structure is deliberate: correct answers receive positive weight, hallucinated responses are penalized, and abstentions carry no penalty. This rewards models that know what they don't know. Claude Fable 5 scored 40 on AA-Omniscience in v4.1 testing, seven points ahead of the prior leader, with the gain driven primarily by accuracy rather than by low hallucination rate.


The AA-LCR evaluation, carrying 6% of the index, tests long-context reasoning across approximately 100,000 tokens of input per question, spanning 100 questions across seven document categories including company reports, government consultations, and legal filings. It requires models to support a minimum 128K context window to score at all.


What Changed from v4.0 and Why It Matters


Three evaluations from v4.0 have been retired or moved off the index. IFBench, which tested instruction-following through rule-driven assessment of formatting, counting, and sentence manipulation, is no longer included in the Intelligence Index score, though Artificial Analysis continues to run it and report the results separately. MATH-500 and AIME 2025 were retired from active reporting entirely. The retirements reflect a pattern: evaluations that saturate quickly as frontier models improve, or that test a capability slice that higher-weight evaluations already capture, get replaced.


The Openness Index, tracked separately from the Intelligence Index, scores models on transparency across pre-training data, post-training data, methodology, and model availability. The current leader in openness among evaluated models is not the same as the leader in intelligence, which creates a real tension for enterprises with data sovereignty requirements. Open-weight models with permissive licenses cluster in the mid-range of the intelligence rankings, with the highest-scoring open model at the time of publication sitting around 55 on the index, compared to Claude Fable 5's 64.9.


What the Methodology Tells Us About the Market


Artificial Analysis' framework reflects a view that AI capability needs to be measured in terms of task completion under realistic conditions, not pattern matching against known question formats. The GDPval benchmark's human-expert Elo anchor is a useful illustration of where this thinking leads. Setting human expert deliverables at 1,000 on the Elo scale creates a reference point that will become increasingly important as model scores continue climbing.


A model at Elo 1,932 is not just "better than previous models" in some abstract sense; it's being evaluated against what a human expert would actually produce on the same task.

That framing matters for enterprise AI strategy because the relevant comparison for most deployment decisions isn't one model versus another, it's a model versus the human workflow it might replace or augment. Benchmarks that were designed to rank models against each other produce different signals than benchmarks calibrated to human performance. The v4.1 methodology is moving in the latter direction, and the model rankings are starting to reflect it.


 
 
 

Comments


JOIN THE AI SPECTATOR MAILING LIST

CONTACT

Contacting You About:

Thanks for submitting!

New York, NY           

Db @DavidBorish.com           

  • LinkedIn
  • Instagram
  • Facebook
  • X
Back to top

© 2026 by David Borish IP, LLC, All Rights Reserved

bottom of page