The Public Standard for Real World AI Performance
Generic benchmarks only go so far.
Vals AI evaluates models on
the real tasks each industry relies on.
Generic benchmarks only go so far.
Vals AI evaluates models on
the real tasks each industry relies on.
Updated 7/1/2026
31
models tested
Benchmark consisting of a weighted performance across finance and coding tasks. Showing the potential impact that LLMs can have on the economy.
Top Models
Claude Fable 5
Claude Opus 4.8
Claude Sonnet 5
Updated 7/1/2026
22
models tested
Benchmark consisting of a weighted performance across finance, coding, and education tasks. Showing the potential impact that LLMs can have on the economy.
Top Models
Claude Fable 5
Claude Opus 4.8
Claude Sonnet 5
Updated 7/1/2026
15
models tested
Tests an agent's ability to complete legal work using documents, spreadsheets, presentations, and file-system tools.
Top Models
Claude Fable 5
Claude Opus 4.8
GLM 5.2
Updated 7/1/2026
120
models tested
Evaluating language models on a wide range of open source legal reasoning tasks.
Top Models
Claude Fable 5
Gemini 3.1 Pro Preview (02/26)
Gemini 3 Pro (11/25)
Updated 7/1/2026
New14
models tested
Evaluating agents on legal research tasks across diverse areas of US law
Top Models
Claude Opus 4.8
Claude Sonnet 5
GPT 5.5
Private question-answer benchmark over Canadian court-cases.
Updated 7/1/2026
117
models tested
A private benchmark evaluating understanding of long-context credit agreements
Top Models
Claude Fable 5
Grok 4.3
GPT 5.5
Updated 7/1/2026
New17
models tested
Evaluating agents on Excel-based financial modeling tasks
Top Models
Claude Opus 4.8
Claude Sonnet 5
GPT 5.5
Updated 7/1/2026
29
models tested
Evaluating agents on core financial analyst tasks
Top Models
Gemini 3.5 Flash
Claude Fable 5
Claude Opus 4.8
Updated 7/1/2026
81
models tested
Evaluating reading and understanding tax certificates as images
Top Models
Claude Opus 4.7
Claude Sonnet 5
Claude Opus 4.8
Updated 7/1/2026
123
models tested
A Vals-created set of questions and responses to tax questions
Top Models
Muse Spark
Claude Sonnet 4.6
Claude Fable 5
Updated 6/17/2026
68
models tested
Can models support the medical billing process?
Top Models
Gemini 3.1 Pro Preview (02/26)
Claude Fable 5
Gemini 3 Flash (12/25)
Updated 7/1/2026
66
models tested
Can models support doctors with their administrative work?
Top Models
Claude Fable 5
GPT 5.1
MiniMax-M3
Evaluating language model bias in medical questions.
Challenging national math exam given to top high-school students
Academic math benchmark on probability, algebra, and trigonometry
A multilingual benchmark for mathematical questions.
Updated 7/2/2026
117
models tested
Graduate-level Google-Proof Q&A benchmark evaluating models on questions that require deep reasoning.
Top Models
Gemini 3.1 Pro Preview (02/26)
Claude Fable 5
GPT 5.5
Updated 7/2/2026
116
models tested
Academic multiple-choice benchmark covering 14 subjects including STEM, humanities, and social sciences.
Top Models
Claude Fable 5
Gemini 3.1 Pro Preview (02/26)
Gemini 3 Pro (11/25)
Updated 7/2/2026
77
models tested
Multimodal Multi-task Benchmark
Top Models
Claude Fable 5
Gemini 3.5 Flash
GPT 5.5
Updated 7/1/2026
24
models tested
Can language models reimplement working programs in another language?
Top Models
Claude Fable 5
Claude Opus 4.8
GPT 5.5
Updated 6/9/2026
55
models tested
International Olympiad in Informatics
Top Models
Claude Fable 5
GPT 5.4 (xhigh)
GPT 5.2
Updated 7/2/2026
123
models tested
Our Implementation of the LiveCodeBench benchmark
Top Models
Claude Fable 5
Gemini 3.1 Pro Preview (02/26)
GPT 5.2 Codex
Updated 7/1/2026
25
models tested
Can language models rebuild programs from scratch?
Top Models
Claude Fable 5
Claude Opus 4.8
GPT 5.5
Updated 6/24/2026
12
models tested
How important are skills for agents?
Top Models
GPT 5.5
GPT 5.5
Claude Opus 4.8
Updated 7/1/2026
65
models tested
Solving production software engineering tasks
Top Models
Claude Fable 5
Claude Opus 4.8
Claude Opus 4.8
Updated 7/1/2026
36
models tested
State-of-the-art set of difficult terminal-based tasks
Top Models
Claude Fable 5
GPT 5.5
Claude Sonnet 5
Updated 7/1/2026
67
models tested
Can models build web applications from scratch?
Top Models
Claude Fable 5
Claude Opus 4.8
Claude Sonnet 5
State-of-the-art set of difficult terminal-based tasks
Social Mobility
Updated 7/1/2026
Public Benefits Bench v1.1
13
models tested
Can AI help people navigate SNAP benefits?
Top Models
Claude Opus 4.8
Claude Sonnet 5
MiniMax-M3
Public Benefits Bench v1
Can AI help people navigate SNAP benefits?