Benchmark

IOI

Vals Logo

Updated: 3/18/2026

International Olympiad in Informatics

Key Takeaways

  • GPT 5.4 now leads IOI at 67.83% overall, a large jump over the previous leader GPT 5.2 at 54.83%.
  • The current top five are GPT 5.4, GPT 5.2, GPT 5.3 Codex, Gemini 3 Flash (12/25), and Gemini 3 Pro (11/25).
  • GPT 5.4 is strongest on both years (67.17% on IOI 2024 and 68.50% on IOI 2025), while GPT 5.2 shows a larger year-to-year spread (43.83% to 65.83%).
  • IOI remains highly differentiating: there is still a steep capability drop from the top tier to the middle of the leaderboard.

Why IOI?

Recently, top LLM labs like OpenAI and Google reported that their models achieved gold medals on the International Mathematical Olympiad (IMO). However, advanced models are starting to saturate IMO, meaning it may no longer effectively differentiate between the capabilities of top-performing models. Reports also suggest the evaluation process faced coordination challenges, with AI companies seeking expedited validation mid-competition that may not reflect standard IMO assessment procedures.

The International Olympiad in Informatics (IOI) offers several advantages as an LLM benchmark. Unlike the IMO, the IOI is not yet saturated, providing clear differentiation between model capabilities. The competition features standardized and automated grading, ensuring objective evaluation without subjective scoring. Additionally, the IOI has real-world relevance as it tests C++ programming skills that are directly applicable to software development.


Benchmark Design

We designed our benchmark to imitate competition conditions as closely as possible.

Agent Harness

We adapted our open-source agent harness from our Finance Agent Benchmark by providing it with access to the following tools:

  1. a c++ (v20) execution environment, which executes arbitrary code
  2. a submission tool used for grading, which executes submitted code and returns a score out of 100 total possible points

These tools, particularly the submission tool, were based on the testing environment available to human contestants.

Please find our open-source implementation here.

Scoring

We further designed our submission tool to match the grading process during the olympiad. Agents get up to 50 submissions, each of which are graded on a variety of subtasks. A competitor receives credit for a given subtask if any submission passes all tests on a given subtask. In particular, this means the final score can be higher than any individual submission, since credit for separate subtasks are combined by the grading system.


Results

GPT 5.2 leads with 54.83% accuracy, achieving the best performance on IOI by a wide margin. Gemini 3 Flash (12/25) follows with 39.08% accuracy, placing second, narrowly ahead of Gemini 3 Pro (11/25) (38.83%). Grok 4 achieves 26.17% accuracy, placing fourth. However, all models struggled on the task - none qualified for a medal in either year.

We chose to evaluate on both 2024 and 2025 to check for data contamination, which we suspect to explain the performance decrease between those years on LiveCodeBench. By contrast, we find increased performance in 2025, which we attribute to an easier test - student scores increased commensurately between the two years.

Join our mailing list to receive benchmark updates

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

By subscribing, I agree to Vals' Privacy Policy.