IOI

Key Takeaways

Claude Fable 5 now leads IOI at 72.25% overall, ahead of GPT 5.4 (xhigh) at 67.83%.
The current top five are Claude Fable 5, GPT 5.4 (xhigh), GPT 5.2, Claude Opus 4.7, and Qwen 3.7 Max.
Fable 5 is perfect on IOI 2024 (100.00%) and remains top-tier on IOI 2025 (44.50%).
IOI remains highly differentiating: there is still a steep capability drop from the top tier to the middle of the leaderboard.

Why IOI?

Recently, top LLM labs like OpenAI and Google reported that their models achieved gold medals on the International Mathematical Olympiad (IMO). However, advanced models are starting to saturate IMO, meaning it may no longer effectively differentiate between the capabilities of top-performing models. Reports also suggest the evaluation process faced coordination challenges, with AI companies seeking expedited validation mid-competition that may not reflect standard IMO assessment procedures.

The International Olympiad in Informatics (IOI) offers several advantages as an LLM benchmark. Unlike the IMO, the IOI is not yet saturated, providing clear differentiation between model capabilities. The competition features standardized and automated grading, ensuring objective evaluation without subjective scoring. Additionally, the IOI has real-world relevance as it tests C++ programming skills that are directly applicable to software development.

Benchmark Design

We designed our benchmark to imitate competition conditions as closely as possible.

Agent Harness

We adapted our open-source agent harness from our Finance Agent Benchmark by providing it with access to the following tools:

a c++ (v20) execution environment, which executes arbitrary code
a submission tool used for grading, which executes submitted code and returns a score out of 100 total possible points

These tools, particularly the submission tool, were based on the testing environment available to human contestants.

Please find our open-source implementation here.

Scoring

We further designed our submission tool to match the grading process during the olympiad. Agents get up to 50 submissions, each of which are graded on a variety of subtasks. A competitor receives credit for a given subtask if any submission passes all tests on a given subtask. In particular, this means the final score can be higher than any individual submission, since credit for separate subtasks are combined by the grading system.

Results

Claude Fable 5 leads with 72.25% accuracy, achieving the best performance on IOI. GPT 5.4 (xhigh) follows with 67.83% accuracy, ahead of GPT 5.2 (54.83%), Claude Opus 4.7 (47.08%), and Qwen 3.7 Max (46.75%). Performance remains uneven across years and models, with a steep drop after the top result.

We chose to evaluate on both 2024 and 2025 to check for data contamination, which we suspect to explain the performance decrease between those years on LiveCodeBench. By contrast, we find increased performance in 2025, which we attribute to an easier test - student scores increased commensurately between the two years.