Independent Evaluation, Unbiased Benchmarks

Testing AI on Real-World Tasks

We benchmark the world's leading AI models on rigorous, domain-specific tasks in finance, law, software, healthcare, and more. We run all of our own evaluations and create many of our benchmarks in-house.

Vals AI Updates

Fresh updates from our testing queue

model
05/28/2026

Anthropic's Claude Opus 4.8 evaluated across our benchmark suite

Anthropic's Claude Opus 4.8 evaluated across our benchmark suite

View Details

Benchmarks

Accuracy

Rankings

70.17%

± 0.91
1/ 20

70.71%

± 0.75
1/ 16

66.71%

± 0.93
8/ 108

53.92%

± 0.16
2/ 20

53.22%

± 2.17
5/ 60

85.75%

± 1.93
6/ 60

69.91%

± 0.89
2/ 76

69.00%

± 4.65
1/ 33

92.42%

± 1.98
4/ 108

87.82%

± 0.95
3/ 113
Contact us
Or send us an email at contact@vals.ai
Proprietary Benchmarks (contact us to get access)
Academic Benchmarks

Read about our methodology.

Industry Leaderboard

Independent benchmarks for industry-specific AI performance.

Industry
Benchmark