Key Takeaways
- Anthropic’s models lead the leaderboard on MGSM, demonstrating superior multilingual mathematical reasoning.
- The majority of evaluated models achieve high accuracy, suggesting that this benchmark is reaching saturation.
- Also performing well are the DeepSeek R1, GPT o3 and 4o, and Llama 3.3 models.
- Across the board, all models exhibit better performance on the English version of the benchmark, underscoring the impact of pre-training data.
Background
The Multilingual Grade School Math Benchmark (MGSM) is an academic evaluation benchmark designed to assess the ability of language models to solve grade-school math problems in multiple languages. Derived from the well-known GSM8K dataset—which consists of 8.5K high-quality, diverse math word problems—the MGSM benchmark features a subset of 250 problems that have been carefully translated by human annotators into 10 typologically diverse languages (including underrepresented languages such as Bengali, Telugu, and Swahili). MGSM is commonly reported by model providers on new model releases.
Introduced in the paper Language Models are Multilingual Chain-of-Thought Reasoners by Shi et al. (2022) and supported by subsequent research on multilingual evaluation, the dataset not only measures a model’s numerical and reasoning capabilities but also its proficiency in processing linguistic variations across different scripts and cultural contexts.
Methodology
To ensure reproducible and fair comparisons, the evaluation of models on the MGSM benchmark was conducted using the same grading scripts provided in the OpenAI’s SimpleEvals GitHub repository. This approach guarantees consistency in the prompt format and testing environment across all languages.
We used the following prompt template to query each model:
Solve this math problem. Give the reasoning steps before giving the final answer on the last line by itself in the format of "Answer:". Do not add anything other than the integer answer after "Answer:".
{Question}
For non-English evaluations, the prompt was accurately translated into the target language while preserving the original structure. All models were tested with a temperature setting of 0.
Highest-Performing Models
Claude 3.7 Sonnet (Thinking)
Released date : 2/24/2025
Accuracy :
92.8%
Latency :
21.8s
Cost :
$3.00 / $15.00
- Exhibited robust performance across all languages, including less-resourced ones such as Telugu.
- Marginally outperformed its non-thinking variant by 0.3%, highlighting the efficiency of its reasoning capabilities.
View Model
Results
/
The evaluation of MGSM reveals a highly competitive landscape:
- The top open-sourced model was Deepseek R1 with an accuracy of 92.4%, closely followed by the Llama 3.3 Instruct Turbo (70B) model.
- Among proprietary models, the Claude 3.7 Sonnet (Thinking) variant emerged as the best across all languages.
- OpenAI models performed consistently around the 90% accuracy mark, with the highest accuracy recorded being 91.6% (by o3 Mini) along with an average latency of 13.17 seconds per question.
- Despite high overall performance, there is a noticeable performance drop in non-English languages. Specifically, models showed the lowest performance in Bengali, where the best result was 90.4% accuracy.
These results indicate that while state-of-the-art models excel in mathematical reasoning, language-specific performance discrepancies persist, likely due to the imbalance in training data across languages. For users seeking a model with strong multilingual mathematical capabilities, the benchmark provides a range of well-rounded options.
The high results also suggest that the models are potentially reaching saturation on this benchmark - and the benchmark may soon be unable to distinguish between models’ multilingual and mathematical reasoning capabilities.