Updates
View All Updates
Benchmark
08/11/2025
Is your model smarter than a High-Schooler? Introducing our IOI Benchmark
Recently, top LLM labs like OpenAI and Google reported that their models achieved gold medals on the International Mathematical Olympiad (IMO). This suggests that advanced models are saturating IMO, so we decided to test models on the International Olympiad in Informatics (IOI)!
From our evaluations, we found:
- Grok 4 wins convincingly, placing first on both the 2024 and 2025 exams.
- Models struggle to write C++ at the level of the best high-school students – no models qualify for medals on either exam.
- Only the largest and most expensive models even come close to placing. The only models to achieve >10% performance all cost at least $2 per question. Claude Opus 4.1 (Nonthinking) costs over $10 per question!
- Consistency between performance on the 2024 and 2025 tests suggests that LLM labs aren’t currently training on the IOI, suggesting that this benchmark is relatively free from data contamination.
View Our IOI Benchmark
Model
08/09/2025
Opus 4.1 (Thinking) Evaluated!
We just evaluated Claude Opus 4.1 (Thinking) on our non-agentic benchmarks. While it placed in the top 10 on 6 of our public benchmarks, its performance on our private benchmarks was fairly mediocre.
-
On our private benchmarks, Claude Opus 4.1 (Thinking) lands squarely in the middle of the pack — barely making the top 10 on our TaxEval benchmark.
-
On public benchmarks, however, Claude Opus 4.1 (Thinking) ranks in the top 10 on 6 of the benchmarks we evaluated. Notably, it takes 2nd place on MMLU Pro behind only Claude Opus 4.1 (Nonthinking) and claims 1st place on MGSM.
View Opus 4.1 (Thinking) Evaluated!
Model
08/08/2025
Opus 4.1 (Nonthinking) Evaluated!
We just released results on Claude Opus 4.1 (Nonthinking) and found that, despite achieving top spots on MMLU Pro and MGSM, the model performs only marginally better across almost all of our benchmarks (<2% performance gain) compared to Claude Opus 4 (Nonthinking) .
On our private benchmarks, Opus 4.1 fails to place among the top 10 models. On public benchmarks, however, the model breaks the top 10 on 5 of the 9 public benchmarks we evaluated. This signals the need for more private benchmarks to evaluate meaningful differences between models and gauge true performance.
View Opus 4.1 (Nonthinking) Results
Latest Benchmarks
View All Benchmarks
Latest Model Releases
View All Models
GPT 5
Release date : 8/7/2025
GPT 5 Mini
Release date : 8/7/2025
GPT 5 Nano
Release date : 8/7/2025
Claude Opus 4.1 (Nonthinking)
Release date : 8/5/2025