Updates
View All Updates
Model
07/30/2025
Kimi K2 Instruct Evaluated on SWE-Bench
Our SWE-bench evaluation of Kimi K2 Instruct achieved 34% accuracy, barely more than half of Kimi’s published figures!
After investigating the model responses, we identified the two following sources of error:
- The model struggles to use tools - it often includes tool calls in the response itself! We replicated the issue on multiple popular inference providers. However, even discarding such errors only increases accuracy by around 2%.
- The model often gets stuck repeating itself, leading to unnecessarily long and incorrect responses. This is a common failure mode of models at zero temperature, though it’s most prevalent among thinking models.
View Kimi K2 Results
Model
07/29/2025
NVIDIA Nemotron Super Evaluated
We evaluated Llama 3.3 Nemotron Super (Nonthinking) and Llama 3.3 Nemotron Super (Thinking) and found the Thinking variant substantially outperforms the Nonthinking variant, with the significant exception of our proprietary Contract Law benchmark.
-
On Contract Law, Llama 3.3 Nemotron Super (Thinking) struggles, ranking in the bottom 10 models. Meanwhile, Llama 3.3 Nemotron Super (Nonthinking) lands in the top 3!
-
On TaxEval and CaseLaw, Llama 3.3 Nemotron Super (Nonthinking) (Non-thinking) struggles significantly, while Llama 3.3 Nemotron Super (Thinking) sits solidly middle-of-the-pack.
-
On public benchmarks, Llama 3.3 Nemotron Super (Nonthinking) performs abysmally across the board. Llama 3.3 Nemotron Super (Thinking) improves on all public benchmarks but still struggles, particularly on MGSM (35/46) and MMLU Pro (31/43).
-
Llama 3.3 Nemotron Super (Thinking) shows substantial gains over Llama 3.3 Nemotron Super (Nonthinking) : on AIME, performance improves from 37/44 to 14/44, and on Case Law, accuracy increases by 12%. These results highlight the benefits of the reasoning model.
View Nemotron Super Results
Model
07/22/2025
Kimi K2 Instruct Evaluated On Non-Agentic Benchmarks!
We found that Kimi K2 Instruct is the new state-of-the-art open-source model according to our evaluations.
The model cracks the top 10 on Math500 and LiveCodeBench, narrowly beating out DeepSeek R1 on both. On other public benchmarks, however, Kimi K2 Instruct delivers middle-of-the-pack performance.
However, Kimi K2 Instruct struggles on our proprietary benchmarks, failing to break the top 10 on any of them. We noticed it particularly struggles with legal tasks such as Case Law and Contract Law but performs comparatively better on finance tasks such as Corp Fin and Tax Eval.
The model offers solid value at $1.00 input/$3.00 output per million tokens, which is cheaper than DeepSeek R1 ($3.00/$7.00, both as hosted on Together AI) but more expensive than Mistral Medium 3.1 (05/2025) ($0.40/$2.00).
We’re currently evaluating the model on SWE-bench, on which Kimi’s reported accuracy would top our leaderboard. Looking forward to seeing whether the model can live up to the hype!
View Kimi K2 Instruct Results
Latest Benchmarks
View All Benchmarks
Latest Model Releases
View All Models
Kimi K2 Instruct
Release date : 7/11/2025
Grok 4
Release date : 7/9/2025
Magistral Medium 3.1 (06/2025)
Release date : 6/10/2025
Claude Sonnet 4 (Nonthinking)
Release date : 5/22/2025