The Public Standard for Real World AI Performance

Generic benchmarks only go so far.
Vals AI evaluates models on the real tasks each industry relies on.

Vals Index
Vals Index
Finance
Healthcare
Math
AIME

Challenging national math exam given to top high-school students

View Details
MATH 500

Academic math benchmark on probability, algebra, and trigonometry

View Details
MGSM

A multilingual benchmark for mathematical questions.

View Details
Academic
Education
Coding
Proprietary

Updated 7/1/2026

Code Migration

24

models tested

Can language models reimplement working programs in another language?

Top Models

1
Claude Fable 5

Claude Fable 5

55.1%
2
Claude Opus 4.8

Claude Opus 4.8

47.3%
3
GPT 5.5

GPT 5.5

45.2%
View Details
Academic

Updated 6/9/2026

IOI

55

models tested

International Olympiad in Informatics

Top Models

1
Claude Fable 5

Claude Fable 5

72.3%
2
GPT 5.4 (xhigh)

GPT 5.4 (xhigh)

67.8%
3
GPT 5.2

GPT 5.2

54.8%
View Details
Academic

Updated 7/2/2026

LiveCodeBench

123

models tested

Our Implementation of the LiveCodeBench benchmark

Top Models

1
Claude Fable 5

Claude Fable 5

89.8%
2
Gemini 3.1 Pro Preview (02/26)

Gemini 3.1 Pro Preview (02/26)

88.5%
3
GPT 5.2 Codex

GPT 5.2 Codex

88.0%
View Details
Academic

Updated 7/1/2026

ProgramBench

25

models tested

Can language models rebuild programs from scratch?

Top Models

1
Claude Fable 5

Claude Fable 5

2.0%
2
Claude Opus 4.8

Claude Opus 4.8

1.0%
3
GPT 5.5

GPT 5.5

0.5%
View Details
Academic

Updated 6/24/2026

SkillsBench

12

models tested

How important are skills for agents?

Top Models

1
GPT 5.5

GPT 5.5

62.6%
2
GPT 5.5

GPT 5.5

62.2%
3
Claude Opus 4.8

Claude Opus 4.8

59.2%
View Details
Academic

Updated 7/1/2026

SWE-bench Verified

65

models tested

Solving production software engineering tasks

Top Models

1
Claude Fable 5

Claude Fable 5

95.0%
2
Claude Opus 4.8

Claude Opus 4.8

88.6%
3
Claude Opus 4.8

Claude Opus 4.8

85.8%
View Details
Academic

Updated 7/1/2026

Terminal-Bench 2.1

36

models tested

State-of-the-art set of difficult terminal-based tasks

Top Models

1
Claude Fable 5

Claude Fable 5

80.5%
2
GPT 5.5

GPT 5.5

76.4%
3
Claude Sonnet 5

Claude Sonnet 5

74.5%
View Details
Proprietary

Updated 7/1/2026

Vibe Code Bench v1.1

67

models tested

Can models build web applications from scratch?

Top Models

1
Claude Fable 5

Claude Fable 5

90.4%
2
Claude Opus 4.8

Claude Opus 4.8

82.7%
3
Claude Sonnet 5

Claude Sonnet 5

81.3%
View Details
Terminal-Bench 2.0

State-of-the-art set of difficult terminal-based tasks

View Details
Beta
Games
Social Mobility