Key Takeaways
- DeepSeek R1 topped the charts with a 92.2% accuracy, surpassing OpenAI’s o-series models
- OpenAI’s o3-mini, despite its “mini” label, achieved a remarkable 91.8% accuracy
- Google’s Gemini 2.0 Flash (001) struck a great balance between speed and accuracy, boasting an 89.0% accuracy with just 3.35s latency
- Given the fact that the best models performed are performing at over 90% accuracy, it seems like we may soon be at the limit of the differences in model capabilities this benchmark can test. Given the questions are public, it may be time for a new, private math benchmark to be created, not part of any pre-training corpus.
Methodology
This benchmark is an adaptation of the MATH benchmark, first published in Measuring Mathematical Problem Solving With the MATH Dataset. The MATH benchmark is commonly reported on new model releases.
We sample 500 diverse problems from this benchmark - spanning topics like probability, algebra, trigonometry, and geometry. The questions are designed to test a model’s ability to apply mathematical principles, execute complex calculations, and communicate solutions clearly.
Unlike in original paper, which fine-tuned models to produce LaTeX output, we used the following prompt template to ensure the models produce outputs in the correct format.
Answer the following math question, given in LaTeX format, clearly and concisely, and present the final answer as \(\boxed{x}\), where X is the fully simplified solution.
Example:
**Question:** \(\int_0^1 (3x^2 + 2x) \,dx\)
**Solution:** \(\int (3x^2 + 2x) \,dx = x^3 + x^2 + C\) Evaluating from 0 to 1: \((1^3 + 1^2) - (0^3 + 0^2) = 1 + 1 - 0 = 2 \boxed{2}\)
Now, solve the following question: {question}
We also used the parsing logic from the the PRM800K dataset grader. We found that this was much more reliable in extracting and evaluating the model’s output - it was robust towards differing formats and mathematical formulations, compared to the parsing logic from the original MATH paper.
All models were evaluated with temperature set to 0, except for the reasoning models that force a certain temperature (like 3.7 at 1, or o1 not accepting temperature).
Highest Performing Models
DeepSeek R1
Released date : 1/20/2025
Accuracy :
92.2%
Latency :
156.5s
Cost :
$8.00 / $8.00
- Deepseek was the best model model on this benchmark, outperforming the other models by a slim margin.
- The outputs were very verbose, sometimes over 20,000 characters.
View Model
o3 Mini
Released date : 1/31/2025
Accuracy :
91.8%
Latency :
14.4s
Cost :
$1.10 / $4.40
- o3 mini achived second place on this benchmark.
- It was very cost-effective compared to other models in a similar performance range.
View Model
Results
We display the accuracy versus the price of each model.
/
In general, the models performed well on the MATH benchmark - the top models were able to solve over 90% of questions correctly. As this benchmark is public, and been out for several years, it does raise questions if the models have been pre-trained on the test data.
That being said, although Grok 2 is a newer model, it was unable to solve several of the problems, only reaching 80% accuracy (it also had a relatively high latency and cost).
The Mistral Large did well for the size of the model, only having 123 Billion parameters and getting 74.4% accuracy. Gemini 2.0 Flash (001) also performed extremly well for its cost, having 3.37s latency and 88% accuracy while being one of the cheapest models.
DeepSeek R1 may have been the top performer, but its “thinking logic” was surprisingly verbose, using over 26,242 characters for one problem. In contrast, o3 mini solved it in just just around 10,000 characters (2700 tokens).
Here’s a glimpse of o3 mini’s concise approach (post-reasoning):
We are given three vertices of a parallelogram:
A = (5, 3), B = (6, 8), C = (7, 4),
and we want to find the fourth vertex D = (x, y) with the extra condition that x > 7.
A key property of a parallelogram is that its diagonals bisect each other. This means the midpoints of the two diagonals are the same.
There are several ways to assign the vertices as opposite pairs, but only one choice will yield x > 7. Let's determine D by pairing B and C as opposite vertices and, consequently, A and D as the other pair.
1. Compute the midpoint of diagonal joining B and C:
Midpoint of BC = ((6 + 7)/2, (8 + 4)/2) = (13/2, 12/2) = (6.5, 6).
2. Since A and D are the opposite vertices of the other diagonal, their midpoint must equal (6.5, 6):
Midpoint of AD = ((5 + x)/2, (3 + y)/2) = (6.5, 6).
3. Set up the equations:
(5 + x)/2 = 6.5 ⟹ 5 + x = 13 ⟹ x = 8,
(3 + y)/2 = 6 ⟹ 3 + y = 12 ⟹ y = 9.
Thus, D = (8, 9) and indeed x = 8 is greater than 7.
Finally, we find x + y:
x + y = 8 + 9 = 17.
The final answer is \boxed{17}.
Despite having fewer parameters (200 billion vs. 671 billion), o3 mini’s efficiency highlights that more parameters don’t always equate to high-quality answers.
Example
We provide examples of a question from the dataset, as well as the outputs from each model. You will see that each model chose different paths to solve the same problem. You will also see the logical deductions made by the models between each step in their attempts.
Q
Compute [1/cos^2(10°) + 1/sin^2(20°) + 1/sin^2(40°)]
A
We wish to simplify
S = 1⁄cos²10° + 1⁄sin²20° + 1⁄sin²40°.
Notice that cos 10° = sin 80° (since sin (90°–θ) = cos θ). Thus, we can rewrite the first term as
1⁄cos²10° = 1⁄sin²80°.
So the sum becomes
S = 1⁄sin²20° + 1⁄sin²40° + 1⁄sin²80°.
A useful identity is
csc²θ = 1 + cot²θ,
so we may write
S = [
1 + cot²20°
] + [
1 + cot²40°
] + [
1 + cot²80°
] = 3 + (cot²20° + cot²40° + cot²80°).
A numerical check (or a derivation using product–sum identities) shows that
cot²20° + cot²40° + cot²80° = 8.
(For example, one may verify using a calculator that cot 20° ≈ 2.7475 so its square is ≈7.547, cot 40° ≈ 1.1918 (square ≈1.420), and cot 80° ≈ 0.1763 (square ≈0.031) whose sum is approximately 8.)
Thus, we have
S = 3 + 8 = 11.
Therefore, the fully simplified solution is
\boxed{11}.
INCORRECT
You can see that o3-mini chose the correct path to solve the problem, but failed to evaulate that cot²20° + cot²40° + cot²80° is 9 and not 8.
In contrast, o1 approximated to get the answer, showing two very different strategies from two models produced by the same provider.