Public Enterprise LLM Benchmarks

01/05/26
Benchmark

Poker Agent Released

View Benchmark

Best Performing Models

Top performing models from the Vals Index. Includes a range of tasks across finance, coding and law.

All Top Performing Models

Vals Index

12/23/2025
Vals Logo
0.00%
OpenAI
OpenAI
GPT 5.2
Vals Index Score: 64.49%
Anthropic
Anthropic
Claude Opus 4.5 (Thinking)
Vals Index Score: 63.77%
Google
Google
Gemini 3 Flash (12/25)
Vals Index Score: 59.95%
1GPT 5.2
64.49%
2Claude Opus 4.5 (Thinking)
63.77%
3Gemini 3 Flash (12/25)
59.95%

Best Open Weight Models

Top performing open weight models from the Vals Index. Includes a range of tasks across finance, coding and law.

All Top Open Weight Models

Vals Index

12/23/2025
Vals Logo
0.00%
zAI
zAI
GLM 4.7
Vals Index Score: 56.21%
MiniMax
MiniMax
MiniMax-M2.1
Vals Index Score: 51.39%
DeepSeek
DeepSeek
DeepSeek V3.2 (Nonthinking)
Vals Index Score: 49.39%
1GLM 4.7
56.21%
2MiniMax-M2.1
51.39%
3DeepSeek V3.2 (Nonthinking)
49.39%

Pareto Efficient Models

The top performing models from the Vals Index which are cost efficient.

View full Pareto curve

Vals Index

12/23/2025
x-axis: cost per test
y-axis: accuracy
GPT 5.2
OpenAI
GPT 5.2
Accuracy: 64.49%
Cost per test: $0.94
Claude Opus 4.5 (Thinking)
Anthropic
Claude Opus 4.5 (Thinking)
Accuracy: 63.77%
Cost per test: $0.87
Gemini 3 Flash (12/25)
Google
Gemini 3 Flash (12/25)
Accuracy: 59.95%
Cost per test: $0.16
1GPT 5.2
64.49% | $0.94
2Claude Opus 4.5 (Thinking)
63.77% | $0.87
3Gemini 3 Flash (12/25)
59.95% | $0.16

Industry Leaderboard

Select industry:
Vals Logo

Updates

View more
benchmark
01/05/26

Poker Agent Released

Poker Agent Released

View Details

No benchmark data available for this update.

View details
Vals Logo

Join our mailing list to receive benchmark updates

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

By subscribing, I agree to Vals' Privacy Policy.