New Finance Agent Benchmark Released

Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Model

07/30/2025

Kimi K2 Instruct Evaluated on SWE-Bench

Our SWE-bench evaluation of Kimi K2 Instruct achieved 34% accuracy, barely more than half of Kimi’s published figures!

After investigating the model responses, we identified the two following sources of error:

  1. The model struggles to use tools - it often includes tool calls in the response itself! We replicated the issue on multiple popular inference providers. However, even discarding such errors only increases accuracy by around 2%.
  2. The model often gets stuck repeating itself, leading to unnecessarily long and incorrect responses. This is a common failure mode of models at zero temperature, though it’s most prevalent among thinking models.

View Kimi K2 Results

Model

07/29/2025

NVIDIA Nemotron Super Evaluated

We evaluated Llama 3.3 Nemotron Super (Nonthinking) and Llama 3.3 Nemotron Super (Thinking) and found the Thinking variant substantially outperforms the Nonthinking variant, with the significant exception of our proprietary Contract Law benchmark.

View Nemotron Super Results

Model

07/22/2025

Kimi K2 Instruct Evaluated On Non-Agentic Benchmarks!

We found that Kimi K2 Instruct is the new state-of-the-art open-source model according to our evaluations.

The model cracks the top 10 on Math500 and LiveCodeBench, narrowly beating out DeepSeek R1 on both. On other public benchmarks, however, Kimi K2 Instruct delivers middle-of-the-pack performance.

However, Kimi K2 Instruct struggles on our proprietary benchmarks, failing to break the top 10 on any of them. We noticed it particularly struggles with legal tasks such as Case Law and Contract Law but performs comparatively better on finance tasks such as Corp Fin and Tax Eval.

The model offers solid value at $1.00 input/$3.00 output per million tokens, which is cheaper than DeepSeek R1 ($3.00/$7.00, both as hosted on Together AI) but more expensive than Mistral Medium 3.1 (05/2025) ($0.40/$2.00).

We’re currently evaluating the model on SWE-bench, on which Kimi’s reported accuracy would top our leaderboard. Looking forward to seeing whether the model can live up to the hype!

View Kimi K2 Instruct Results

Latest Benchmarks

View All Benchmarks

Latest Model Releases

View All Models

Kimi K2 Instruct

Kimi K2 Instruct

Release date : 7/11/2025

View Model
Grok 4

Grok 4

Release date : 7/9/2025

View Model
Magistral Medium 3.1 (06/2025)

Magistral Medium 3.1 (06/2025)

Release date : 6/10/2025

View Model
Claude Sonnet 4 (Nonthinking)

Claude Sonnet 4 (Nonthinking)

Release date : 5/22/2025

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.