Public Enterprise LLM Benchmarks

Vals Index
Vals Index
Vals Index

Updated 12/12/2025

Vals Index

Benchmark consisting of a weighted performance across finance, law and coding tasks. Showing the potential impact that LLM's can have on the economy.

Top Model:

GPT 5.2

GPT 5.2

Number of models tested

28

View Details
Vals Index

Updated 12/11/2025

New

Vals Multimodal Index

Benchmark consisting of a weighted performance across finance, law, coding, and education tasks. Showing the potential impact that LLM's can have on the economy.

Top Model:

GPT 5.2

GPT 5.2

Number of models tested

17

View Details
Finance Benchmarks

Updated 12/11/2025

CorpFin (v2)

A private benchmark evaluating understanding of long-context credit agreements

Top Model:

Grok 4 Fast (Reasoning)

Grok 4 Fast (Reasoning)

Number of models tested

79

View Details

Updated 12/11/2025

Finance Agent

Evaluating agents on core financial analyst tasks

Top Model:

GPT 5.1

GPT 5.1

Number of models tested

54

View Details

Updated 12/11/2025

MortgageTax

Evaluating reading and understanding tax certificates as images

Top Model:

Gemini 3 Pro (11/25)

Gemini 3 Pro (11/25)

Number of models tested

55

View Details

Updated 12/11/2025

TaxEval (v2)

A Vals-created set of questions and responses to tax questions

Top Model:

Grok 3

Grok 3

Number of models tested

85

View Details
Healthcare Benchmarks

Updated 12/11/2025

MedQA

Evaluating language model bias in medical questions.

Top Model:

o1

o1

Number of models tested

83

View Details
Math Benchmarks

Updated 12/11/2025

AIME

Challenging national math exam given to top high-school students

Top Model:

GPT 5.2

GPT 5.2

Number of models tested

75

View Details

Updated 12/7/2025

MATH 500

Academic math benchmark on probability, algebra, and trigonometry

Top Model:

Gemini 3 Pro (11/25)

Gemini 3 Pro (11/25)

Number of models tested

59

View Details

Updated 12/11/2025

MGSM

A multilingual benchmark for mathematical questions.

Top Model:

Claude Opus 4.5 (Thinking)

Claude Opus 4.5 (Thinking)

Number of models tested

72

View Details
Academic Benchmarks

Updated 12/11/2025

GPQA

Graduate-level Google-Proof Q&A benchmark evaluating models on questions that require deep reasoning.

Top Model:

GPT 5.2

GPT 5.2

Number of models tested

76

View Details

Updated 12/11/2025

MMLU Pro

Academic multiple-choice benchmark covering 14 subjects including STEM, humanities, and social sciences.

Top Model:

Gemini 3 Pro (11/25)

Gemini 3 Pro (11/25)

Number of models tested

77

View Details

Updated 12/7/2025

MMMU

Multimodal Multi-task Benchmark

Top Model:

Gemini 3 Pro (11/25)

Gemini 3 Pro (11/25)

Number of models tested

53

View Details
Education Benchmarks

Updated 12/11/2025

SAGE

Student Assessment with Generative Evaluation

Top Model:

Claude Opus 4.5 (Thinking)

Claude Opus 4.5 (Thinking)

Number of models tested

35

View Details
Coding Benchmarks

Updated 12/12/2025

SWE-bench

Solving production software engineering tasks

Top Model:

GPT 5.2

GPT 5.2

Number of models tested

40

View Details

Updated 12/11/2025

IOI

International Olympiad in Informatics

Top Model:

GPT 5.2

GPT 5.2

Number of models tested

37

View Details

Updated 12/11/2025

LiveCodeBench

Our Implementation of the LiveCodeBench benchmark

Top Model:

GPT 5 Mini

GPT 5 Mini

Number of models tested

81

View Details

Updated 12/12/2025

Terminal-Bench

State-of-the-art set of difficult terminal-based tasks

Top Model:

GPT 5.2

GPT 5.2

Number of models tested

42

View Details

Updated 12/11/2025

NEW

Vibe Code Bench

Can models build web applications from scratch?

Top Model:

GPT 5.2

GPT 5.2

Number of models tested

15

View Details

Join our mailing list to receive benchmark updates

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

By subscribing, I agree to Vals' Privacy Policy.