Task Type:

GPQA Benchmark

Last updated

Task type :

Key Takeaways

  • The top overall performer, Claude 3.7 Sonnet (Thinking), achieved an average accuracy of 75.3% on both the few-shot and zero-shot tasks.
  • Runner-up O3 Mini reached 75.0% average accuracy, also showing consistency between few-shot (74.2%) and zero-shot (75.8%).
  • Few-shot prompting generally improved performance across most models, with an average improvement of ~2-5 percentage points over zero-shot prompting.
  • In general, the reasoning models performed very strongly on this benchmark, likely because of the complex and multi-step nature of the questions.

Context

The Graduate-Level Google-Proof Q&A (GPQA) benchmark (paper) is a public academic benchmark commonly used to measure models’ general question-answering performance. It evaluates language models on challenging, graduate-level questions across STEM fields. When created, the questions were specifically designed to be “Google-proof” - they require deep understanding and reasoning rather than fact recall or search.

As in the original paper, this benchmark uses two evaluation approaches:

  • Zero-shot chain-of-thought: Models are asked to solve the problems with instruction to explain their reasoning steps
  • Few-shot chain-of-thought: In addition to the instructions, models are provided with 5 example questions and answers.

We focus on the “diamond” subset of 198 questions - these were questions in which expert validators answered correctly and no more than one out of three non-experts answered correctly. They are both challenging, but have unambigous answers.


Highest Quality Models

Claude 3.7 Sonnet

Claude 3.7 Sonnet

Released date : 2/24/2025

Accuracy :

67.4%

Latency :

9.2s

Cost :

$3.00 / $15.00

  • Claude 3.7 Sonnet (Thinking) leads with 75.3% average accuracy on both few-shot and zero-shot tasks.
  • It is priced significantly cheaper than OpenAI's o1 (third place), although not as cheap as o3 mini (second place).

View Model


The results per question type are summarized in the graph below.

GPQA

/


GPQA

/

The accuracy-cost scatter plot shows that o3 Mini is a standout performer, achieving near top accuracy at an extremely low cost. Although o1 also achieves high accuracy, it is at a much more expensive price point.


Additional Notes

Methodology

We only used the diamond subset (198 questions) as the evaluation set - these were the questions that were most challenging for non-experts, but had the most agreement between experts. We replicated the original paper’s prompting and response parsing techniques replicated as closly as was possible. Finally, GPT-4o (2024-11-20) was configured with “no markdown formatting” system prompt to ensure compatibility with the paper’s answer extraction regex.

The prompts used were similar to the original GPQA paper, although for zeros-shot CoT, we only prompt the model once, rather than asking the question across two separate queries. Here are examples of the prompts used for both methods.

Zero-shot CoT Prompt:

What is the correct answer to this question:

Two quantum states with energies E1 and E2 have a lifetime of 10^-9 sec and 10^-8 sec, respectively. We want to clearly distinguish these two energy levels. Which one of the following options could be their energy difference so that they can be clearly resolved?

Choices:
(A) 10^-8 eV
(B) 10^-9 eV
(C) 10^-4 eV
(D) 10^-11 eV

Reason through your answer step-by-step. Then, based on your reasoning, provide the single most likely answer choice. Answer in the format "The correct answer is (insert answer here)."

Few-shot CoT Prompt:

Here are some example questions from experts. An explanation is given before the final answer.
Answer the final question yourself, giving your reasoning beforehand.

Question: In a given population, 1 out of every 400 people has a cancer caused by a completely recessive allele, b. Assuming the population is in Hardy-Weinberg equilibrium, which of the following is the expected proportion of individuals who carry the b allele but are not expected to develop the cancer?
(A) 1/400
(B) 19/400
(C) 20/400
(D) 38/400
Let's think step by step: The expected proportion of individuals who carry the b allele but are not expected to develop the cancer equals to the frequency of heterozygous allele in the given population. According to the Hardy-Weinberg equation p^2 + 2pq + q^2 = 1, where p is the frequency of dominant allele frequency, q is the frequency of recessive allele frequency, p^2 is the frequency of the homozygous dominant allele, q^2 is the frequency of the recessive allele, and 2pq is the frequency of the heterozygous allele. Given that q^2 = 1/400, hence, q = 0.05 and p = 1-q = 0.95. The frequency of the heterozygous allele is 2pq = 2*0.05*0.95 = 38/400.
The correct answer is (D)

Question: A Fe pellet of 0.056g is first dissolved in 10mL of hydrobromic acid HBr (0.1M). The resulting solution is then titrated by KMnO4 (0.02M). How many equivalence points are there?
(A) Two points, 25ml and 35ml
(B) One point, 25mL
(C) One point, 10ml
(D) Two points, 25ml and 30ml
Let's think step by step: HBr reacts with Fe to produce Fe2+. MnO4- initially reacts with Fe2+ followed by Br-. There are two equivalence points at 25ml and 35ml. In the beaker, the present species are Fe2+ and Br-. In a titration involving two analytes, it's essential to identify which reaction occurs first. Given the redox nature of the titration and the reduction potentials: E0 (Br2/Br-) = 1.09V, E0 (MnO4-/Mn2+) = 1.49V, and E0 (Fe3+/Fe2+) = 0.77V. With [Fe2+] determined as 0.1M, two reactions are considered. Reaction 1: MnO4- reacts with 5Fe2+ and 8H+ to produce Mn2+, 5Fe3+, and 4H2O. Reaction 2: 2MnO4- reacts with 10Br- and 16H+ to produce 2Mn2+ and 5Br2 with 8H2O as a byproduct. MnO4- first reacts with Fe2+ in a 1:5 ratio, making the first equivalence point at 25ml. Once Fe2+ is exhausted, MnO4- reacts with Br- in a 2:10 ratio, adding another 10ml for a total second equivalence point at 35ml.
The correct answer is (A)

Question: Consider a quantum mechanical system containing a particle of mass m moving in an isotropic three dimensional potential of the form V(r) = 1/2mω^2r^2 corresponding to the acted force obeying Hooke's law. Here, ω is the angular frequency of oscillation and r is the radial distance of the particle from the origin in spherical polar coordinate. What is the value of energy of the third excited state, and how many linearly independent eigenfunctions are possible for the same energy eigenvalue?
(A) 11 π^2ℏ^2/(2mr^2), 3
(B) (9/2) ℏω, 10
(C) 11 π^2ℏ^2/(2mr^2), 10
(D) (9/2) ℏω, 3
Let's think step by step: This problem is nothing but the three dimensional simple harmonic oscillator (SHO) problem. The energy spectrum of three dimensional SHO is En = (n+3/2)ℏω where n = 0,1,2,3..... For third excited state n=3. 3+3/2=6/2+3/2=9/2. Thus the corresponding energy is (9/2)ℏω. The degeneracy of the state is gn = (n+1)(n+2)/2. For n=3, degeneracy is (3+1)*(3+2)/2=4*5/2=10.
The correct answer is (B)

Question: Your overhear two chemists talking to each other as they leave a synthetic organic
chemistry lab. One asks the other "So, how did it go?" The second chemist replies, "Not well -
my compounds are on top of each other." What is the second chemist most likely referring to?
Choices:
(A) The compounds they are working with have similar polarities.
(B) The compounds they are working with have similar boiling points.
(C) The compounds they are working with are bonding to each other through non-covalent/van
der Waals interactions.
(D) The compounds they are working with have similar optical rotations.
Let's think step by step:
"On top of each other" commonly refers to two compounds that have similar Rf values on
chromatography (a common operation in synthetic chemistry). Similar Rf values arise for
compounds with similar polarities.
The correct answer is (A)

Question: Mitochondria are semi-autonomous cellular organelles in charge of energy production.
They encode for a part of their own translational machinery and respiratory complexes. Mito-
chondrial function is governed by over a thousand proteins imported from the cell, contributing
to processes like the transport of proteins, ribosome biogenesis and translation regulation,
respiratory oxidation, metabolism, and apoptotic signaling cascade. Mutations in the code
for mitochondrial protein networks can cause numerous diseases in humans that are inherited
through generations. Mutations of which of the mitochondrial proteins listed below are least
likely to be genetically transmitted from a father to his children?
Choices:
(A) Translocase of inner mitochondrial membrane 17B
(B) ATP binding cassette subfamily B member 8
(C) NADH dehydrogenase 2
(D) Tu translation elongation factor, mitochondrial
Let's think step by step: The colleague should know that mitochondria from fathers are rarely if
ever, transmitted to their offspring. Therefore, the protein encoded by the paternal mitochondrial
genome will most likely not be passed down the generation. NADH dehydrogenase 2 is the only
one encoded by the mitochondrial genome from the MT-ND2 gene among the listed proteins.
Leigh's syndrome, lactic acidosis, and metabolic diseases are all linked to a mutation in the ND2
gene. ATP binding cassette subfamily B member 8 (ABCB8) is a chromosome 7 encoded gene;
Tu translation elongation factor, mitochondrial is chromosome 16 gene TUFM. Translocase
of inner mitochondrial membrane 17B is chromosome X coded gene TIMM17B. There is no
evidence that it is maternally imprinted; hence, daughters may inherit the father's gene copy in a
50:50 ratio.
The correct answer is (C)


Two quantum states with energies E1 and E2 have a lifetime of 10^-9 sec and 10^-8 sec, respectively. We want to clearly distinguish these two energy levels. Which one of the following options could be their energy difference so that they can be clearly resolved?

Choices:
(A) 10^-8 eV
(B) 10^-11 eV
(C) 10^-9 eV
(D) 10^-4 eV

Give step by step reasoning before you answer, and when you're ready to answer, please use the format "The correct answer is (insert answer here)".

We use the same answer regex as the original paper to check for correctness:

answer is \((C)\)|Answer: \((C)\)|answer: \((C)\)|answer \((C)\)
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.