Key Takeaways
- o1 leads the benchmark, with an average accuracy of 83.5%.
- In second place, with an average accuracy of 82.7% across all tasks, is Claude 3.7 Sonnet.
- The cost-effective Gemini 2.0 Flash delivers 77.4% average accuracy with the best price-performance ratio at $0.10/$0.40 per 1M tokens and fast 4.32s latency.
Context
MMLU is a commonly-used academic evaluation that tests models on multiple-choice questions on subjects like physics, chemistry, etc. MMLU Pro is an improved version of MMLU, focusing on data quality and diversity. MMLU Pro consists of over 12,000 question-answer pairs, created through:
- Filtering and enhancing classic MMLU questions
- Expanding answer options from 4 to 10 choices
- Incorporating new sources (STEM Website, TheoremQA, SciBench)
- Expert verification of questions
The benchmark evaluates models on their ability to:
- Demonstrate deep subject matter expertise
- Apply complex reasoning to challenging problems
- Show consistent performance across varied domains
Highest Quality Models
o1
Released date : 12/17/2024
Accuracy :
83.5%
Latency :
26.9s
Cost :
$15.00 / $60.00
- OpenAI's o1 still leads this benchmark, with the highest performance out of any model.
- Given the size of the dataset, o1 is quite expensive to run, costing far more than any other model we tested.
View Model
Claude 3.7 Sonnet (Thinking)
Released date : 2/24/2025
Accuracy :
82.7%
Latency :
31.7s
Cost :
$3.00 / $15.00
- Claude 3.7 Sonnet (Thinking) achieves the second-highest accuracy of 82.7% on MMLU Pro.
- It demonstrates strong performance across all subject areas while being more cost-effective than o1.;
- The model shows particularly strong results on STEM subjects, with accuracies above 85% in physics, chemistry and biology.
View Model
The results per subject are summarized in the graph below.
/
/
Additional Notes
Methodology
All reported results use the 5-shot Chain-of-Thought prompting method, which included 5 examples per category in the prompt, as well as encouraging the models to think step by step. Few-shot CoT prompting is the approach used in the original paper. Here is the exact prompt:
The following are multiple-choice questions (with answers) about biology. Think step by step and then finish your answer with "The answer is (X)" where X is the correct letter choice.
Question: Which of the following represents an accurate statement concerning arthropods?
Options: A. They possess an exoskeleton composed primarily of peptidoglycan., B. They possess an open circulatory system with a dorsal heart., C. They are members of a biologically unsuccessful phylum incapable of exploiting diverse habitats and nutrition sources., D. They lack paired, jointed appendages.
Answer: Let's think step by step. Peptidoglycan is known to comprise the plasma membrane of most bacteria, rather than the exoskeleton of arthropods, which is made of chitin, which rules out (A). The answer (C) is false because arthropods are a highly successful phylum. Likewise, arthropods have paired, jointed appendages, which rules out (D). The only remaining option is (B), as arthropods have an open circulatory system with a dorsal tubular heart. The answer is (B).
Question: In a given population, 1 out of every 400 people has a cancer caused by a completely recessive allele, b. Assuming the population is in Hardy-Weinberg equilibrium, which of the following is the expected proportion of individuals who carry the b allele but are not expected to develop the cancer?
Options: A. 19/400, B. 1/400, C. 40/400, D. 38/400, E. 2/400, F. 1/200, G. 20/400, H. 50/400
Answer: Let's think step by step. According to the Hardy Weinberg Law, $p^2 + 2 p q + q^2 = 1$, and $p + q = 1$ where $p$ is the frequency of the dominant allele, $q$ is the frequency of the recessive allele, and $p^2$, $q^2$, and $2pq$ are the frequencies of dominant homozygous, recessive homozygous, and heterozygous individuals, respectively. The frequency of the recessive allele (q) is $\sqrt{\frac{1}{400}} = 0.05$. We have $p = 1 - q = 0.95$. The frequency of heterozygous individuals is $2pq = 2 \cdot 0.05 \cdot 0.95 = 0.095$. The number of heterozygous individuals is equal to the frequency of heterozygous individuals times the size of the population, or $0.095 * 400 = 38$. So we end up with 38/400. The answer is (D).
Question: A mutation in a bacterial enzyme changed a previously polar amino acid into a nonpolar amino acid. This amino acid was located at a site distant from the enzyme’s active site. How might this mutation alter the enzyme’s substrate specificity?
Options: A. By changing the enzyme’s pH optimum, B. By changing the enzyme's molecular weight, C. An amino acid change away from the active site increases the enzyme's substrate specificity., D. By changing the shape of the protein, E. By changing the enzyme's temperature optimum, F. By altering the enzyme's ability to be denatured, G. By changing the enzyme’s location in the cell, H. By changing the enzyme's color, I. An amino acid change away from the active site cannot alter the enzyme’s substrate specificity., J. By altering the enzyme's rate of reaction
Answer: Let's think step by step. A change in an amino acid leads to a change in the primary structure of the protein. A change in the primary structure may lead to a change in the secondary and the tertiary structure of the protein. A change in the tertiary structure means a change in the shape of the protein, so (C) has to be correct. Since the change does not affect the active site of the enzyme, we do not expect the activity of the enzyme to be affected. The answer is (D).
Question: Which of the following is not a way to form recombinant DNA?
Options: A. Translation, B. Conjugation, C. Specialized transduction, D. Transformation
Answer: Let's think step by step. The introduction of foreign DNA or RNA into bacteria or eukaryotic cells is a common technique in molecular biology and scientific research. There are multiple ways foreign DNA can be introduced into cells including transformation, transduction, conjugation, and transfection. In contrast, (A) is not a way to form DNA: during translation the ribosomes synthesize proteins from RNA. The answer is (A).
Question: Which of the following is not known to be involved in the control of cell division?
Options: A. Microtubules, B. Checkpoints, C. DNA polymerase, D. Centrosomes, E. Cyclins, F. Mitochondria, G. Protein kinases, H. Fibroblast cells
Answer: Let's think step by step. Normal cells move through the cell cycle in a regulated way. At the checkpoint stage, they use information about their own internal state and cues from the environment around them to decide whether to proceed with cell division. Cues like these act by changing the activity of core cell cycle regulators inside the cell. The most common regulators are cyclins and cyclin-dependent kinases. Fibroblast cells do not play any role in cell division. The answer is (H).
Which of the following would most likely provide examples of mitotic cell divisions?
A - cross section of muscle tissue
B - longitudinal section of a shoot tip
C - longitudinal section of a leaf vein
D - cross section of a fruit
E - cross section of a leaf
F - longitudinal section of a petal
G - longitudinal section of a seed
H - cross section of an anther (site of pollen production in a flower)
To grade the answers for correctness, we use the following regex, consistent with the original paper:
(?:answer is \(?(B)\)?)|(?:[Aa]nswer:\s*(B))
The overall score was calculated by averaging the 14 task-specific scores.
In total, the benchmark used ~15M input tokens per model.
Maximum output tokens per model:
We double checked that all the reasoning models had enough reasoning tokens to complete the task. For example, o1 was given 16,384 max output tokens but only averaged around 3,000 thousand output tokens including the reasoning.
- Default configuration: 8,192 tokens
- GPT-4o (2024-11-20), GPT-4o1 (2024-12-17), GPT-4o3-mini (2025-01-32): 16,384 tokens
- Claude 3.5 Sonnet, Claude 3.7 Sonnet: 8,192 tokens
- GPT-4o (2024-05-13), Llama 3.3 Instruct Turbo: 4,096 tokens
- Command R+: 4,000 tokens