New Finance Agent Benchmark Released

Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Benchmark

08/11/2025

Is your model smarter than a High-Schooler? Introducing our IOI Benchmark

Recently, top LLM labs like OpenAI and Google reported that their models achieved gold medals on the International Mathematical Olympiad (IMO). This suggests that advanced models are saturating IMO, so we decided to test models on the International Olympiad in Informatics (IOI)!

From our evaluations, we found:

  • Grok 4 wins convincingly, placing first on both the 2024 and 2025 exams.
  • Models struggle to write C++ at the level of the best high-school students – no models qualify for medals on either exam.
  • Only the largest and most expensive models even come close to placing. The only models to achieve >10% performance all cost at least $2 per question. Claude Opus 4.1 (Nonthinking) costs over $10 per question!
  • Consistency between performance on the 2024 and 2025 tests suggests that LLM labs aren’t currently training on the IOI, suggesting that this benchmark is relatively free from data contamination.

View Our IOI Benchmark

Model

08/09/2025

Opus 4.1 (Thinking) Evaluated!

We just evaluated Claude Opus 4.1 (Thinking) on our non-agentic benchmarks. While it placed in the top 10 on 6 of our public benchmarks, its performance on our private benchmarks was fairly mediocre.

View Opus 4.1 (Thinking) Evaluated!

Model

08/08/2025

Opus 4.1 (Nonthinking) Evaluated!

We just released results on Claude Opus 4.1 (Nonthinking) and found that, despite achieving top spots on MMLU Pro and MGSM, the model performs only marginally better across almost all of our benchmarks (<2% performance gain) compared to Claude Opus 4 (Nonthinking) .

On our private benchmarks, Opus 4.1 fails to place among the top 10 models. On public benchmarks, however, the model breaks the top 10 on 5 of the 9 public benchmarks we evaluated. This signals the need for more private benchmarks to evaluate meaningful differences between models and gauge true performance.

View Opus 4.1 (Nonthinking) Results

Latest Benchmarks

View All Benchmarks

Latest Model Releases

View All Models

GPT 5

GPT 5

Release date : 8/7/2025

View Model
GPT 5 Mini

GPT 5 Mini

Release date : 8/7/2025

View Model
GPT 5 Nano

GPT 5 Nano

Release date : 8/7/2025

View Model
Claude Opus 4.1 (Nonthinking)

Claude Opus 4.1 (Nonthinking)

Release date : 8/5/2025

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.