Task Type:

AIME Benchmark

Last updated

Task type :

Key Takeaways

  • In general, the reasoning models performed the best on these tasks, significantly outperforming both non-reasoning models and human performance. o3 Mini took the top spot, with 86.5% accuracy, followed by Deepseek R1 and o1.
  • Unlike other benchmarks like MATH 500 and MGSM, the AIME still appears to not be saturated yet, since the questions are significantly harder.
  • It is important to note that the questions and answers to AIME are publicly available - meaning they could be included in the models pretraining corpus. The AIME 2024 was released in February 2024, and the AIME 2025 was released in February 2025. It is also suspicious that the models performed significantly better on the older 2024 version, compared to the newer 2025 questions.
  • Gemini 2.0 Flash (001) demonstrated exceptional performance relative to its cost and speed.

Background

The American Invitational Mathematics Examination (AIME) is a prestigious, invite-only mathematics competition for high-school students who perform in the top 5% of the AMC 12 mathematics exam. It involves 15 questions of increasing difficulty, with the answer to every question being a single integer from 0 to 999. The median score is historically between 4 and 6 questions correct (out of the 15 possible). Two versions of the test are given every year (thirty questions total). You can view the questions from previous years on the AIME website

This examination serves as a crucial gateway for students aiming to qualify for the USA Mathematical Olympiad (USAMO). In general, the test is extremely challenging, and covers a wide range of mathematical topics, including algebra, geometry, and number theory.

The results clearly illustrate that no current model has yet mastered this benchmark, although several achieve strong performance.


Methodology

For this benchmark, we used the thirty questions from the 2024 and 2025 versions of the test (sixty questions total), modelling our approach after the repository from the GAIR NLP Lab.

To minimize parsing errors, we instructed the models with the following prompt template.

Please reason step by step, and put your final answer within \boxed{}

{Question}

The answer was then extracted from the boxed section and compared to the ground truth.

Although a few questions included an image or diagram, all of the information needed to solve the problem was present in the question text, so we did not include these images.

Reducing variance

Given the very low size of this benchmark, we ran each model 8 times on both AIME 2024 and AIME 2025 to reduce variance. We averaged the pass@1 performance across all runs for each model.


Highest-Performing Models

The highest-performing models in this benchmark were o3 Mini and Deepseek R1, with o3 Mini acheiving the highest score of 86.5% and Deepseek R1 acheiving a modest 74.0%.

o3 Mini

o3 Mini

Released date : 1/31/2025

Accuracy :

86.5%

Latency :

154.6s

Cost :

$1.10 / $4.40

  • O3 Mini scored the highest overall across all tasks
  • Despite being a Mini model, it still had a noticeably high latency.
  • The model has around 200 billion parameters

View Model


DeepSeek R1

DeepSeek R1

Released date : 1/20/2025

Accuracy :

74.0%

Latency :

153.9s

Cost :

$8.00 / $8.00

  • Deepseek R1 was the second highest performing model overall across all tasks
  • The model has around 671 billion parameters
  • One of the more popular models benchmarked

View Model


Results

To visualize the performance differences among models, we provide the following scatter plots illustrating accuracy versus latency and price.

AIME

/

Analysis of the results indicates that the correctly answered questions were distributed among different models, suggesting that no single model has developed a comprehensive approach to solving these problems. This reinforces the notion that current AI models remain limited in their ability to solve advanced mathematical problems consistently.

To illustrate this, we present an example where Grok 2 produced an incorrect answer due to a calculation error, while Deepseek arrived at the correct solution.

Example Question From AIME 2024:

The 9 members of a baseball team went to an ice-cream parlor after their game.
Each player had a single-scoop cone of chocolate, vanilla, or strawberry ice cream.
At least one player chose each flavor, and the number of players who chose chocolate was greater than the number of players who chose vanilla, which was greater than the number of players who chose strawberry.
Let $N$ be the number of different assignments of flavors to players that meet these conditions. Find the remainder when $N$ is divided by 1000.

In this case, Grok 2 made a calculation error that led to an incorrect final answer.

Grok 2’s Solution:

### Case 3: \(a = 7, b = 1, c = 1\)
The number of ways to choose 7 players out of 9 for chocolate is \(\binom{9}{7}\).
After selecting 7 players for chocolate, 1 of the remaining 2 players must be chosen for vanilla, which is \(\binom{2}{1}\).
The last player automatically gets strawberry. Therefore, the number of ways to arrange this is:
\[
\binom{9}{7} \cdot \binom{2}{1} = 36 \cdot 2 = 72
\]

Deepseek’s Correct Solution:

3. For \((4, 3, 2)\):
   \[
   \frac{9!}{4! \cdot 3! \cdot 2!} = \frac{362880}{24 \cdot 6 \cdot 2} = \frac{362880}{288} = 1260
   \]

The correct summation of cases should have been 252 + 504 + 1260 = 2016.

Applying modular arithmetic, the final result should have been 2016 mod 1000, yielding the correct answer.

Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.