Key Takeaways
- In general, the reasoning models performed the best on these tasks, significantly outperforming both non-reasoning models and human performance. o3 Mini took the top spot, with 86.5% accuracy, followed by Deepseek R1 and o1.
- Unlike other benchmarks like MATH 500 and MGSM, the AIME still appears to not be saturated yet, since the questions are significantly harder.
- It is important to note that the questions and answers to AIME are publicly available - meaning they could be included in the models pretraining corpus. The AIME 2024 was released in February 2024, and the AIME 2025 was released in February 2025. It is also suspicious that the models performed significantly better on the older 2024 version, compared to the newer 2025 questions.
- Gemini 2.0 Flash (001) demonstrated exceptional performance relative to its cost and speed.
Background
The American Invitational Mathematics Examination (AIME) is a prestigious, invite-only mathematics competition for high-school students who perform in the top 5% of the AMC 12 mathematics exam. It involves 15 questions of increasing difficulty, with the answer to every question being a single integer from 0 to 999. The median score is historically between 4 and 6 questions correct (out of the 15 possible). Two versions of the test are given every year (thirty questions total). You can view the questions from previous years on the AIME website
This examination serves as a crucial gateway for students aiming to qualify for the USA Mathematical Olympiad (USAMO). In general, the test is extremely challenging, and covers a wide range of mathematical topics, including algebra, geometry, and number theory.
The results clearly illustrate that no current model has yet mastered this benchmark, although several achieve strong performance.
Methodology
For this benchmark, we used the thirty questions from the 2024 and 2025 versions of the test (sixty questions total), modelling our approach after the repository from the GAIR NLP Lab.
To minimize parsing errors, we instructed the models with the following prompt template.
Please reason step by step, and put your final answer within \boxed{}
{Question}
The answer was then extracted from the boxed section and compared to the ground truth.
Although a few questions included an image or diagram, all of the information needed to solve the problem was present in the question text, so we did not include these images.
Reducing variance
Given the very low size of this benchmark, we ran each model 8 times on both AIME 2024 and AIME 2025 to reduce variance. We averaged the pass@1 performance across all runs for each model.
Highest-Performing Models
The highest-performing models in this benchmark were o3 Mini and Deepseek R1, with o3 Mini acheiving the highest score of 86.5% and Deepseek R1 acheiving a modest 74.0%.
o3 Mini
Released date : 1/31/2025
Accuracy :
86.5%
Latency :
154.6s
Cost :
$1.10 / $4.40
- O3 Mini scored the highest overall across all tasks
- Despite being a Mini model, it still had a noticeably high latency.
- The model has around 200 billion parameters
View Model
DeepSeek R1
Released date : 1/20/2025
Accuracy :
74.0%
Latency :
153.9s
Cost :
$8.00 / $8.00
- Deepseek R1 was the second highest performing model overall across all tasks
- The model has around 671 billion parameters
- One of the more popular models benchmarked
View Model
Results
To visualize the performance differences among models, we provide the following scatter plots illustrating accuracy versus latency and price.
/
Analysis of the results indicates that the correctly answered questions were distributed among different models, suggesting that no single model has developed a comprehensive approach to solving these problems. This reinforces the notion that current AI models remain limited in their ability to solve advanced mathematical problems consistently.
To illustrate this, we present an example where Grok 2 produced an incorrect answer due to a calculation error, while Deepseek arrived at the correct solution.
Example Question From AIME 2024:
The 9 members of a baseball team went to an ice-cream parlor after their game.
Each player had a single-scoop cone of chocolate, vanilla, or strawberry ice cream.
At least one player chose each flavor, and the number of players who chose chocolate was greater than the number of players who chose vanilla, which was greater than the number of players who chose strawberry.
Let $N$ be the number of different assignments of flavors to players that meet these conditions. Find the remainder when $N$ is divided by 1000.
In this case, Grok 2 made a calculation error that led to an incorrect final answer.
Grok 2’s Solution:
### Case 3: \(a = 7, b = 1, c = 1\)
The number of ways to choose 7 players out of 9 for chocolate is \(\binom{9}{7}\).
After selecting 7 players for chocolate, 1 of the remaining 2 players must be chosen for vanilla, which is \(\binom{2}{1}\).
The last player automatically gets strawberry. Therefore, the number of ways to arrange this is:
\[
\binom{9}{7} \cdot \binom{2}{1} = 36 \cdot 2 = 72
\]
Deepseek’s Correct Solution:
3. For \((4, 3, 2)\):
\[
\frac{9!}{4! \cdot 3! \cdot 2!} = \frac{362880}{24 \cdot 6 \cdot 2} = \frac{362880}{288} = 1260
\]
The correct summation of cases should have been 252 + 504 + 1260 = 2016.
Applying modular arithmetic, the final result should have been 2016 mod 1000, yielding the correct answer.