Benchmarks

Vals Index

Updated 12/12/2025

Vals Index

Benchmark consisting of a weighted performance across finance, law and coding tasks. Showing the potential impact that LLM's can have on the economy.

Top Model:

GPT 5.2

Number of models tested

28

View Details

Updated 12/11/2025

New

Vals Multimodal Index

Benchmark consisting of a weighted performance across finance, law, coding, and education tasks. Showing the potential impact that LLM's can have on the economy.

Top Model:

GPT 5.2

Number of models tested

17

View Details

Legal Benchmarks

Updated 12/11/2025

CaseLaw (v2)

Private question-answer benchmark over Canadian court-cases.

Top Model:

GPT 5 Mini

Number of models tested

56

View Details

Updated 12/11/2025

LegalBench

Evaluating language models on a wide range of open source legal reasoning tasks.

Top Model:

Gemini 3 Pro (11/25)

Number of models tested

99

View Details

Finance Benchmarks

Updated 12/11/2025

CorpFin (v2)

A private benchmark evaluating understanding of long-context credit agreements

Top Model:

Grok 4 Fast (Reasoning)

Number of models tested

79

View Details

Updated 12/11/2025

Finance Agent

Evaluating agents on core financial analyst tasks

Top Model:

GPT 5.1

Number of models tested

54

View Details

Updated 12/11/2025

MortgageTax

Evaluating reading and understanding tax certificates as images

Top Model:

Gemini 3 Pro (11/25)

Number of models tested

55

View Details

Updated 12/11/2025

TaxEval (v2)

A Vals-created set of questions and responses to tax questions

Top Model:

Grok 3

Number of models tested

85

View Details

Healthcare Benchmarks

Updated 12/11/2025

MedQA

Evaluating language model bias in medical questions.

Top Model:

o1

Number of models tested

83

View Details

Math Benchmarks

Updated 12/11/2025

AIME

Challenging national math exam given to top high-school students

Top Model:

GPT 5.2

Number of models tested

75

View Details

Updated 12/7/2025

MATH 500

Academic math benchmark on probability, algebra, and trigonometry

Top Model:

Gemini 3 Pro (11/25)

Number of models tested

59

View Details

Updated 12/11/2025

MGSM

A multilingual benchmark for mathematical questions.

Top Model:

Claude Opus 4.5 (Thinking)

Number of models tested

72

View Details

Academic Benchmarks

Updated 12/11/2025

GPQA

Graduate-level Google-Proof Q&A benchmark evaluating models on questions that require deep reasoning.

Top Model:

GPT 5.2

Number of models tested

76

View Details

Updated 12/11/2025

MMLU Pro

Academic multiple-choice benchmark covering 14 subjects including STEM, humanities, and social sciences.

Top Model:

Gemini 3 Pro (11/25)

Number of models tested

77

View Details

Updated 12/7/2025

MMMU

Multimodal Multi-task Benchmark

Top Model:

Gemini 3 Pro (11/25)

Number of models tested

53

View Details

Education Benchmarks

Updated 12/11/2025

SAGE

Student Assessment with Generative Evaluation

Top Model:

Claude Opus 4.5 (Thinking)

Number of models tested

35

View Details

Coding Benchmarks

Updated 12/12/2025

SWE-bench

Solving production software engineering tasks

Top Model:

GPT 5.2

Number of models tested

40

View Details

Updated 12/11/2025

IOI

International Olympiad in Informatics

Top Model:

GPT 5.2

Number of models tested

37

View Details

Updated 12/11/2025

LiveCodeBench

Our Implementation of the LiveCodeBench benchmark

Top Model:

GPT 5 Mini

Number of models tested

81

View Details

Updated 12/12/2025

Terminal-Bench

State-of-the-art set of difficult terminal-based tasks

Top Model:

GPT 5.2

Number of models tested

42

View Details

Updated 12/11/2025

NEW

Vibe Code Bench

Can models build web applications from scratch?

Top Model:

GPT 5.2

Number of models tested

15

View Details

Public Enterprise LLM Benchmarks

Vals Index

Vals Index

Vals Multimodal Index

Legal Benchmarks

CaseLaw (v2)

LegalBench

Finance Benchmarks

CorpFin (v2)

Finance Agent

MortgageTax

TaxEval (v2)

Healthcare Benchmarks

MedQA

Math Benchmarks

AIME

MATH 500

MGSM

Academic Benchmarks

GPQA

MMLU Pro

MMMU

Education Benchmarks

SAGE

Coding Benchmarks

SWE-bench

IOI

LiveCodeBench

Terminal-Bench

Vibe Code Bench

Join our mailing list to receive benchmark updates