Task Type:

MGSM Benchmark

Last updated

1

Claude 3.7 Sonnet (Thinking)

★

92.8%

$3.00 / $15.00

21.83 s

2

Claude 3.5 Sonnet Latest

92.5%

$3.00 / $15.00

3.86 s

3

Claude 3.7 Sonnet

92.4%

$3.00 / $15.00

4.68 s

4

DeepSeek R1

92.4%

$8.00 / $8.00

10.13 s

5

o3 Mini

91.6%

$1.10 / $4.40

13.17 s

6

Llama 3.3 Instruct Turbo (70B)

$⚡︎

91.3%

$0.88 / $0.88

2.96 s

7

GPT 4o (2024-08-06)

90.6%

$2.50 / $10.00

7.08 s

8

GPT 4o (2024-11-20)

90.2%

$2.50 / $10.00

4.06 s

9

Gemini 1.5 Pro (002)

89.3%

$1.25 / $5.00

2.83 s

10

o1

88.9%

$15.00 / $60.00

11.39 s

11

Gemini 2.0 Flash (001)

88.9%

$0.10 / $0.40

1.54 s

12

Mistral Large (11/2024)

88.2%

$2.00 / $6.00

8.31 s

13

Gemini 1.5 Flash (002)

86.7%

$0.07 / $0.30

1.43 s

14

Grok 2

86.3%

$2.00 / $10.00

6.70 s

15

GPT 4o Mini

86.2%

$0.15 / $0.60

4.06 s

16

Claude 3.5 Haiku Latest

85.9%

$1.00 / $5.00

3.75 s

17

Mistral Small (02/2024)

85.0%

$0.20 / $0.60

3.08 s

18

Jamba 1.5 Large

77.4%

$2.00 / $8.00

10.65 s

19

Jamba 1.6 Large

75.0%

$2.00 / $8.00

10.31 s

20

Jamba 1.6 Mini

44.7%

$0.20 / $0.40

4.08 s

21

Jamba 1.5 Mini

29.6%

$0.20 / $0.40

2.99 s

Task type :

Key Takeaways

Anthropic’s models lead the leaderboard on MGSM, demonstrating superior multilingual mathematical reasoning.
The majority of evaluated models achieve high accuracy, suggesting that this benchmark is reaching saturation.
Also performing well are the DeepSeek R1, GPT o3 and 4o, and Llama 3.3 models.
Across the board, all models exhibit better performance on the English version of the benchmark, underscoring the impact of pre-training data.

Background

The Multilingual Grade School Math Benchmark (MGSM) is an academic evaluation benchmark designed to assess the ability of language models to solve grade-school math problems in multiple languages. Derived from the well-known GSM8K dataset—which consists of 8.5K high-quality, diverse math word problems—the MGSM benchmark features a subset of 250 problems that have been carefully translated by human annotators into 10 typologically diverse languages (including underrepresented languages such as Bengali, Telugu, and Swahili). MGSM is commonly reported by model providers on new model releases.

Introduced in the paper Language Models are Multilingual Chain-of-Thought Reasoners by Shi et al. (2022) and supported by subsequent research on multilingual evaluation, the dataset not only measures a model’s numerical and reasoning capabilities but also its proficiency in processing linguistic variations across different scripts and cultural contexts.

Methodology

To ensure reproducible and fair comparisons, the evaluation of models on the MGSM benchmark was conducted using the same grading scripts provided in the OpenAI’s SimpleEvals GitHub repository. This approach guarantees consistency in the prompt format and testing environment across all languages.

We used the following prompt template to query each model:

Solve this math problem. Give the reasoning steps before giving the final answer on the last line by itself in the format of "Answer:". Do not add anything other than the integer answer after "Answer:".

{Question}

For non-English evaluations, the prompt was accurately translated into the target language while preserving the original structure. All models were tested with a temperature setting of 0.

Highest-Performing Models

Claude 3.7 Sonnet (Thinking)

Released date : 2/24/2025

Accuracy :

92.8%

Latency :

21.8s

Cost :

$3.00 / $15.00

Exhibited robust performance across all languages, including less-resourced ones such as Telugu.
Marginally outperformed its non-thinking variant by 0.3%, highlighting the efficiency of its reasoning capabilities.

View Model

Results

MGSM

/

The evaluation of MGSM reveals a highly competitive landscape:

The top open-sourced model was Deepseek R1 with an accuracy of 92.4%, closely followed by the Llama 3.3 Instruct Turbo (70B) model.
Among proprietary models, the Claude 3.7 Sonnet (Thinking) variant emerged as the best across all languages.
OpenAI models performed consistently around the 90% accuracy mark, with the highest accuracy recorded being 91.6% (by o3 Mini) along with an average latency of 13.17 seconds per question.
Despite high overall performance, there is a noticeable performance drop in non-English languages. Specifically, models showed the lowest performance in Bengali, where the best result was 90.4% accuracy.

These results indicate that while state-of-the-art models excel in mathematical reasoning, language-specific performance discrepancies persist, likely due to the imbalance in training data across languages. For users seeking a model with strong multilingual mathematical capabilities, the benchmark provides a range of well-rounded options.

The high results also suggest that the models are potentially reaching saturation on this benchmark - and the benchmark may soon be unable to distinguish between models’ multilingual and mathematical reasoning capabilities.

MGSM Benchmark

Key Takeaways

Background

Methodology

Highest-Performing Models

Claude 3.7 Sonnet (Thinking)

Results

Join our mailing list to receive benchmark updates on