Public Enterprise LLM Benchmarks

11/21/2025
Benchmark

Our Vibe Code Benchmark is live!

View VibeCodeBench Results
11/24/2025
Model

Claude Opus 4.5 is first on Vals Index, SWE-Bench, SAGE.

View Model Results

Best Performing Models

Top performing models from the Vals Index. Includes a range of tasks across finance, coding and law.

All Top Performing Models

Vals Index

11/24/2025
Vals Logo
0.0%
Anthropic
Anthropic
Claude Opus 4.5 (Thinking)
Vals Index Score: 63.9%
OpenAI
OpenAI
GPT 5.1
Vals Index Score: 60.5%
Google
Google
Gemini 3 Pro (11/25)
Vals Index Score: 59.5%
1Claude Opus 4.5 (Thinking)
63.9%
2GPT 5.1
60.5%
3Gemini 3 Pro (11/25)
59.5%

Best Open Weight Models

Top performing open weight models from the Vals Index. Includes a range of tasks across finance, coding and law.

All Top Open Weight Models

Vals Index

11/24/2025
Vals Logo
0.0%
zAI
zAI
GLM 4.6
Vals Index Score: 46.5%
Kimi
Kimi
Kimi K2 Thinking
Vals Index Score: 41.6%
OpenAI
OpenAI
GPT OSS 120B
Vals Index Score: 39.3%
1GLM 4.6
46.5%
2Kimi K2 Thinking
41.6%
3GPT OSS 120B
39.3%

Pareto Efficient Models

The top performing models from the Vals Index which are cost efficient.

View full Pareto curve

Vals Index

11/24/2025
x-axis: cost per test
y-axis: accuracy
Claude Opus 4.5 (Thinking)
Anthropic
Claude Opus 4.5 (Thinking)
Accuracy: 63.9%
Cost per test: $2.70
GPT 5.1
OpenAI
GPT 5.1
Accuracy: 60.5%
Cost per test: $0.29
Grok 4.1 Fast (Reasoning)
xAI
Grok 4.1 Fast (Reasoning)
Accuracy: 49.4%
Cost per test: $0.05
1Claude Opus 4.5 (Thinking)
63.9% | $2.70
2GPT 5.1
60.5% | $0.29
3Grok 4.1 Fast (Reasoning)
49.4% | $0.05

Industry Leaderboard

Select industry:

Benchmark data not found

Vals Logo

Updates

View more
model
11/24/2025

Claude Opus 4.5 is first on Vals Index, SWE-Bench, SAGE.

Claude Opus 4.5 is first on Vals Index, SWE-Bench, SAGE.

View Details

Benchmarks

Accuracy

Rankings

Vibe Code Bench

0.0%

13/ 13

SAGE

0.0%

28/ 28

FinanceAgent

0.0%

50/ 50

CorpFin

0.0%

64/ 64

CaseLaw

0.0%

48/ 48

TaxEval

0.0%

81/ 81

MortgageTax

0.0%

51/ 51

AIME

0.0%

71/ 71

MGSM

0.0%

73/ 73

LegalBench

0.0%

96/ 96

MedQA

0.0%

77/ 77

GPQA

0.0%

73/ 73

MMLU Pro

0.0%

71/ 71

MMMU

0.0%

47/ 47

LiveCodeBench

0.0%

71/ 71

IOI

0.0%

27/ 27

Terminal-Bench

0.0%

37/ 37

SWE-bench

0.0%

35/ 35

Vals Index

0.0%

23/ 23

Academic Benchmarks
Proprietary Benchmarks (contact us to get access)
Vals Logo

Join our mailing list to receive benchmark updates

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

By subscribing, I agree to Vals' Privacy Policy.