Vals Index
Updated 3/20/2026
Vals Index
Benchmark consisting of a weighted performance across finance, law and coding tasks. Showing the potential impact that LLM's can have on the economy.
Top Model:
Claude Sonnet 4.6
Number of models tested
34
Updated 3/20/2026
Vals Multimodal Index
Benchmark consisting of a weighted performance across finance, law, coding, and education tasks. Showing the potential impact that LLM's can have on the economy.
Top Model:
Claude Sonnet 4.6
Number of models tested
25
Legal Benchmarks
Updated 3/17/2026
CaseLaw (v2)
Private question-answer benchmark over Canadian court-cases.
Top Model:
GPT 5.1
Number of models tested
41
Updated 3/18/2026
LegalBench
Evaluating language models on a wide range of open source legal reasoning tasks.
Top Model:
Gemini 3.1 Pro Preview (02/26)
Number of models tested
113
Finance Benchmarks
Updated 3/17/2026
CorpFin (v2)
A private benchmark evaluating understanding of long-context credit agreements
Top Model:
Kimi K2.5
Number of models tested
92
Updated 3/17/2026
Finance Agent v1.1
Evaluating agents on core financial analyst tasks
Top Model:
Claude Sonnet 4.6
Number of models tested
40
Updated 3/17/2026
MortgageTax
Evaluating reading and understanding tax certificates as images
Top Model:
Gemini 3.1 Pro Preview (02/26)
Number of models tested
66
Updated 3/18/2026
TaxEval (v2)
A Vals-created set of questions and responses to tax questions
Top Model:
Claude Sonnet 4.6
Number of models tested
100
Healthcare Benchmarks
Updated 3/17/2026
MedCode
Can models support the medical billing process?
Top Model:
Gemini 3.1 Pro Preview (02/26)
Number of models tested
47
Updated 3/17/2026
MedScribe
Can models support doctors with their administrative work?
Top Model:
GPT 5.1
Number of models tested
47
MedQA
Evaluating language model bias in medical questions.
Math Benchmarks
Updated 3/17/2026
AIME
Challenging national math exam given to top high-school students
Top Model:
Gemini 3.1 Pro Preview (02/26)
Number of models tested
92
Updated 3/19/2026
ProofBench
Can models write math proofs that are formally verified?
Top System:
Aristotle
Number of systems tested
23
MATH 500
Academic math benchmark on probability, algebra, and trigonometry
MGSM
A multilingual benchmark for mathematical questions.
Academic Benchmarks
Updated 3/17/2026
GPQA
Graduate-level Google-Proof Q&A benchmark evaluating models on questions that require deep reasoning.
Top Model:
Gemini 3.1 Pro Preview (02/26)
Number of models tested
95
Updated 3/18/2026
MMLU Pro
Academic multiple-choice benchmark covering 14 subjects including STEM, humanities, and social sciences.
Top Model:
Gemini 3.1 Pro Preview (02/26)
Number of models tested
93
Updated 3/17/2026
MMMU
Multimodal Multi-task Benchmark
Top Model:
Gemini 3.1 Pro Preview (02/26)
Number of models tested
63
Education Benchmarks
Updated 3/17/2026
SAGE
Student Assessment with Generative Evaluation
Top Model:
Claude Opus 4.5 (Thinking)
Number of models tested
46
Coding Benchmarks
Updated 3/18/2026
IOI
International Olympiad in Informatics
Top Model:
GPT 5.4
Number of models tested
50
Updated 3/17/2026
LiveCodeBench
Our Implementation of the LiveCodeBench benchmark
Top Model:
Gemini 3.1 Pro Preview (02/26)
Number of models tested
101
Updated 3/20/2026
SWE-bench
Solving production software engineering tasks
Top Model:
Gemini 3.1 Pro Preview (02/26)
Number of models tested
37
Updated 3/17/2026
NewTerminal-Bench 2.0
State-of-the-art set of difficult terminal-based tasks
Top Model:
Gemini 3.1 Pro Preview (02/26)
Number of models tested
46
Updated 3/20/2026
NewVibe Code Bench v1.1
Can models build web applications from scratch?
Top Model:
GPT 5.4
Number of models tested
22
Terminal-Bench
State-of-the-art set of difficult terminal-based tasks
Beta Benchmarks
Updated 12/23/2025
NewPoker Agent
Which model can make the most money playing poker?
Top Model:
GPT 5.2
Number of models tested
17