The Public Standard for Real World AI Performance
Generic benchmarks only go so far.
Vals AI evaluates models on
the real tasks each industry relies on.
Generic benchmarks only go so far.
Vals AI evaluates models on
the real tasks each industry relies on.
Updated 5/13/2026
16
models tested
Benchmark consisting of a weighted performance across finance and coding tasks. Showing the potential impact that LLMs can have on the economy.
Top Models
GPT 5.5
Claude Opus 4.7
Claude Sonnet 4.6
Updated 5/13/2026
13
models tested
Benchmark consisting of a weighted performance across finance, coding, and education tasks. Showing the potential impact that LLMs can have on the economy.
Top Models
GPT 5.5
Claude Opus 4.7
Claude Sonnet 4.6
Private question-answer benchmark over Canadian court-cases.
Updated 5/4/2026
104
models tested
A private benchmark evaluating understanding of long-context credit agreements
Top Models
Grok 4.3
GPT 5.5
Kimi K2.5
Updated 5/13/2026
New17
models tested
Evaluating agents on core financial analyst tasks
Top Models
GPT 5.5
Claude Opus 4.7
Claude Sonnet 4.6
Updated 5/4/2026
73
models tested
Evaluating reading and understanding tax certificates as images
Top Models
Claude Opus 4.7
Gemini 3.1 Pro Preview (02/26)
Gemini 3 Pro (11/25)
Updated 5/4/2026
110
models tested
A Vals-created set of questions and responses to tax questions
Top Models
Muse Spark
Claude Sonnet 4.6
Claude Opus 4.6 (Thinking)
Updated 5/4/2026
56
models tested
Can models support the medical billing process?
Top Models
Gemini 3.1 Pro Preview (02/26)
Gemini 3 Flash (12/25)
Claude Opus 4.7
Updated 5/4/2026
56
models tested
Can models support doctors with their administrative work?
Top Models
GPT 5.1
GPT 5.5
Claude Opus 4.6 (Nonthinking)
Evaluating language model bias in medical questions.
Challenging national math exam given to top high-school students
Academic math benchmark on probability, algebra, and trigonometry
A multilingual benchmark for mathematical questions.
Updated 5/4/2026
104
models tested
Graduate-level Google-Proof Q&A benchmark evaluating models on questions that require deep reasoning.
Top Models
Gemini 3.1 Pro Preview (02/26)
GPT 5.5
Gemini 3 Pro (11/25)
Updated 5/4/2026
103
models tested
Academic multiple-choice benchmark covering 14 subjects including STEM, humanities, and social sciences.
Top Models
Gemini 3.1 Pro Preview (02/26)
Gemini 3 Pro (11/25)
Claude Opus 4.7
Updated 4/24/2026
70
models tested
Multimodal Multi-task Benchmark
Top Models
GPT 5.5
Gemini 3.1 Pro Preview (02/26)
Gemini 3 Flash (12/25)
Updated 5/7/2026
53
models tested
International Olympiad in Informatics
Top Models
GPT 5.4
GPT 5.2
Claude Opus 4.7
Updated 4/21/2026
109
models tested
Our Implementation of the LiveCodeBench benchmark
Top Models
Gemini 3.1 Pro Preview (02/26)
GPT 5.2 Codex
DeepSeek V4
Updated 5/1/2026
47
models tested
Solving production software engineering tasks
Top Models
GPT 5.5
Claude Opus 4.7
Gemini 3.1 Pro Preview (02/26)
Updated 5/6/2026
New61
models tested
State-of-the-art set of difficult terminal-based tasks
Top Models
GPT 5.5
Claude Opus 4.7
Gemini 3.1 Pro Preview (02/26)
Updated 5/5/2026
New43
models tested
Can models build web applications from scratch?
Top Models
Claude Opus 4.7
GPT 5.5
GPT 5.4