ProofBench

Key Takeaways

Harmonic’s Aristotle achieved the best overall performance at 71%, while GPT 5.4 is the top foundation model at 56%. Claude Opus 4.7 is the second-best foundation model at 54%.
Foundation models are improving on formal math, but there is still a large gap to specialized systems on hard Lean tasks (15 points between Aristotle and GPT 5.4).

Background

Large language models now routinely achieve impressive scores on mathematical contests such as the AIME, IMO, and Putnam. These results suggest substantial progress in mathematical reasoning— however, they come with an important caveat. Natural language solutions can mask subtle logical errors, unstated assumptions, or invalid steps that are difficult and time-consuming for humans to audit.

To address this gap, ProofBench evaluates whether models can translate mathematical reasoning into machine-checkable proofs. We use Lean 4, a proof assistant and programming language designed for formal mathematics, where every logical inference must be formally verified (i.e., be machine-checkable). Since a solution here either compiles or it does not— there is no partial credit for plausible reasoning- this makes for a more objective evaluation of mathematical reasoning skills.

In addition, mathematicians are increasingly using Lean 4 to formalize their work. Therefore, models that perform well on this benchmark will be valuable in assisting with mathematical research in the future.

Success on this benchmark therefore tests whether models can maintain logical correctness across chains of reasoning, and whether they can be effective with modern formal mathematics. Strong performance would indicate genuine progress toward AI systems that can assist mathematicians with formalization, verification, and large-scale proof development.

Results

To visualize the performance differences among models, we provide scatter plots illustrating accuracy versus latency and cost.

ProofBench

Benchmark Overview

Each benchmark task pairs a natural-language mathematical problem with its corresponding Lean 4 formal statement. Given both, a model must construct a formal proof that is accepted by the Lean proof checker. The benchmark measures proof success directly, providing a clear and unforgiving signal of correctness. Problems are drawn from advanced undergraduate and graduate-level sources, including qualifying exams and standard textbooks. They span a broad range of modern mathematics:

Probability theory and stochastic analysis
Measure theory
Real and functional analysis
Algebra and commutative algebra
Algebraic geometry
Number theory
Set theory, logic, and model theory

By focusing on higher levels of abstraction rather than high-school contest problems (e.g. AIME), ProofBench provides a complementary signal to existing math benchmarks and better reflects the challenges faced in real mathematical research.

Tool Use Analysis

Within the ProofBench harness, models have access to tools organized into the following bundles:

Bundle	Tools
Search	lean_loogle (Mathlib search)
Code Execution	lean_run_code (Lean code execution + feedback)
Submission	submit_proof (final proof submission)

Models may iteratively search the Mathlib library and test partial proofs or code in a Lean 4 environment before submitting a final solution. The submission tool is invoked exactly once per problem. The visualization below summarizes how different models allocate effort between search and code execution during problem solving.

ProofBench Tool Usage Count (Search vs Code Execution)

One observation in tool-use patterns is that models that tend to focus more on executing code generally seem to perform better. For instance, we observed that Claude Opus 4.5 (Thinking) quickly starts executing code and getting feedback, using the search tool more as a method to find lemmas it recognizes it needs in the middle of constructing its solution. A common failure mode we observed is models spiraling into a long series of search tool calls, attempting to find a theorem or notion that doesn’t exist, but continuing to search for variations of it in an attempt to find it.

Methodology

The benchmark consists of two splits: a public split and a private test split, each containing 100 problems. For every problem, we provide:

A natural-language statement of the theorem (no informal proof attached; models must generate the proof)
A carefully vetted Lean 4 formalization of the statement

Models are allowed up to 40 interaction turns per problem, enabling iterative development with tool feedback. No access to the informal proof is provided.

Note on evaluating Aristotle: To evaluate Aristotle, a formal-math system from Harmonic AI, we fed in the full Lean 4 statement file, with the natural language statement included as a comment at the top of the file. We emphasize that Aristotle uses its own internal system/harness, and not the evaluation harness we provide to other models. Additionally, the time reported indicates only solve time, and does not include queue time.

Data Collection Process and Examples

Prior benchmarks have faced numerous challenges with ensuring that the formalizations of the statements were correct. Statements may be misformalized in a number of ways, particularly if autoformalized by models. To circumvent these issues, problems were selected and formalized by a team of Lean 4 experts, including PhD students and experts in their respective domains. Statements and formalizations underwent a secondary review process before being included in the benchmark. We also collected natural language proofs for the problems, and while the main scores reported on this page do not give the model(s) access to the proofs, we plan to conduct ablations including the proofs.

For examples of the questions in the benchmark, see the panel below.

Model Output Examples

Prove that a finitely additive non-negative set function μ on a measurable space (Ω, F) with the "continuity" property Bₙ ∈ F, Bₙ ↓ ∅, μ(Bₙ) < ∞ → μ(Bₙ) → 0 must be countably additive if μ(Ω) < ∞.

Lean Formalization:

Lean 4

1import Mathlib
2
3open TopologicalSpace Filter MeasureTheory ProbabilityTheory Function
4
5open scoped NNReal ENNReal MeasureTheory ProbabilityTheory Topology
6
7namespace MeasureTheory
8
9variable {Ω : Type*} [MeasurableSpace Ω]
10
11theorem dembo_1_1_15 {μ : Set Ω → ℝ≥0}
12    (hAdd : ∀ A B, MeasurableSet A → MeasurableSet B → Disjoint A B → μ (A ∪ B) = μ A + μ B)
13    (hCont : ∀ B : ℕ → Set Ω,
14      (∀ n, MeasurableSet (B n)) → Antitone B → (⋂ n, B n) = ∅
15      → Tendsto (fun n => μ (B n)) atTop (𝓝 0))
16    {A : ℕ → Set Ω} (hA : ∀ n, MeasurableSet (A n)) (hDisj : Pairwise (Disjoint on A)) :
17    μ (⋃ n, A n) = ∑' n, μ (A n) := by
18  sorry
19
20end MeasureTheory

Lean 4

1by
2  have hEmpty : μ ∅ = 0 := by
3    have h := hAdd ∅ ∅ MeasurableSet.empty MeasurableSet.empty disjoint_bot_left
4    simp only [Set.union_empty] at h
5    have : μ ∅ + μ ∅ = μ ∅ := h.symm
6    have h3 : μ ∅ + μ ∅ - μ ∅ = μ ∅ - μ ∅ := by rw [this]
7    simp at h3; exact h3
8  set B := fun n => ⋃ k ≥ n, A k with hBdef
9  have hB_meas : ∀ n, MeasurableSet (B n) := fun n => 
10    MeasurableSet.biUnion (Set.to_countable _) (fun k _ => hA k)
11  have hB_anti : Antitone B := by
12    intro m n hmn
13    apply Set.biUnion_subset_biUnion_left
14    intro k hk; exact le_trans hmn hk
15  have hB_inter : ⋂ n, B n = ∅ := by
16    rw [Set.eq_empty_iff_forall_not_mem]
17    intro x hx
18    simp only [Set.mem_iInter] at hx
19    have h0 := hx 0; rw [hBdef] at h0
20    simp only [Set.mem_iUnion, ge_iff_le] at h0
21    obtain ⟨k, -, hxk⟩ := h0
22    have hk1 := hx (k + 1); rw [hBdef] at hk1
23    simp only [Set.mem_iUnion, ge_iff_le] at hk1
24    obtain ⟨j, hjk, hxj⟩ := hk1
25    have hkj : k ≠ j := by omega
26    exact Set.disjoint_iff.mp (hDisj hkj) ⟨hxk, hxj⟩
27  have hBtends : Tendsto (fun n => μ (B n)) atTop (𝓝 0) := hCont B hB_meas hB_anti hB_inter
28  have hUnion : ∀ n, (⋃ i, A i) = (⋃ k < n, A k) ∪ B n := by
29    intro n; ext x
30    simp only [Set.mem_union, Set.mem_iUnion, exists_prop, hBdef]
31    constructor
32    · rintro ⟨k, hxk⟩
33      by_cases hkn : k < n
34      · left; exact ⟨k, hkn, hxk⟩
35      · right; exact ⟨k, not_lt.mp hkn, hxk⟩
36    · rintro (⟨k, -, hxk⟩ | ⟨k, -, hxk⟩) <;> exact ⟨k, hxk⟩
37  have hDisjParts : ∀ n, Disjoint (⋃ k < n, A k) (B n) := by
38    intro n; rw [Set.disjoint_iff]
39    intro x ⟨hx1, hx2⟩
40    simp only [Set.mem_iUnion, exists_prop, hBdef] at hx1 hx2
41    obtain ⟨k1, hk1n, hxk1⟩ := hx1
42    obtain ⟨k2, hk2n, hxk2⟩ := hx2
43    have hne : k1 ≠ k2 := by omega
44    exact Set.disjoint_iff.mp (hDisj hne) ⟨hxk1, hxk2⟩
45  have hFinMeas : ∀ n, MeasurableSet (⋃ k < n, A k) := fun n => 
46    MeasurableSet.biUnion (Set.to_countable _) (fun k _ => hA k)
47  have hFinAdd : ∀ n, μ (⋃ k < n, A k) = ∑ k ∈ Finset.range n, μ (A k) := by
48    intro n
49    induction n with
50    | zero => simp [hEmpty]
51    | succ n ih =>
52      rw [Finset.sum_range_succ, ← ih]
53      have hunion : (⋃ k < n + 1, A k) = (⋃ k < n, A k) ∪ A n := by
54        ext x; simp only [Set.mem_iUnion, Set.mem_union, exists_prop]
55        constructor
56        · rintro ⟨k, hkn, hxk⟩
57          by_cases hkn' : k < n
58          · left; exact ⟨k, hkn', hxk⟩
59          · have heq : k = n := by omega
60            right; rw [heq] at hxk; exact hxk
61        · rintro (⟨k, hkn, hxk⟩ | hxn)
62          · exact ⟨k, Nat.lt_succ_of_lt hkn, hxk⟩
63          · exact ⟨n, Nat.lt_succ_self n, hxn⟩
64      have hdisj : Disjoint (⋃ k < n, A k) (A n) := by
65        rw [Set.disjoint_iff]; intro x ⟨hx1, hx2⟩
66        simp only [Set.mem_iUnion, exists_prop] at hx1
67        obtain ⟨k, hkn, hxk⟩ := hx1
68        have hne : k ≠ n := by omega
69        exact Set.disjoint_iff.mp (hDisj hne) ⟨hxk, hx2⟩
70      rw [hunion]; exact hAdd _ _ (hFinMeas n) (hA n) hdisj
71  have hKey : ∀ n, μ (⋃ i, A i) = (∑ k ∈ Finset.range n, μ (A k)) + μ (B n) := by
72    intro n
73    calc μ (⋃ i, A i) = μ ((⋃ k < n, A k) ∪ B n) := by rw [hUnion n]
74      _ = μ (⋃ k < n, A k) + μ (B n) := hAdd _ _ (hFinMeas n) (hB_meas n) (hDisjParts n)
75      _ = (∑ k ∈ Finset.range n, μ (A k)) + μ (B n) := by rw [hFinAdd n]
76  have hPartialSums : Tendsto (fun n => ∑ k ∈ Finset.range n, μ (A k)) atTop (𝓝 (μ (⋃ i, A i))) := by
77    have h1 : ∀ n, (∑ k ∈ Finset.range n, μ (A k)) = μ (⋃ i, A i) - μ (B n) := by
78      intro n
79      have := hKey n
80      have h2 : μ (⋃ i, A i) - μ (B n) = ∑ k ∈ Finset.range n, μ (A k) + μ (B n) - μ (B n) := by
81        rw [← this]
82      simp at h2; exact h2.symm
83    simp_rw [h1]
84    have : Tendsto (fun n => μ (⋃ i, A i) - μ (B n)) atTop (𝓝 (μ (⋃ i, A i) - 0)) := 
85      Tendsto.sub tendsto_const_nhds hBtends
86    simp at this; exact this
87  have hHasSum : HasSum (fun n => μ (A n)) (μ (⋃ i, A i)) := by
88    rw [NNReal.hasSum_iff_tendsto_nat]; exact hPartialSums
89  exact hHasSum.tsum_eq.symm

CORRECT

Correct proof.

Show that an A-scheme is separated (over A) if and only if it is separated over ℤ.

Lean Formalization:

Lean 4

1import Mathlib
2
3open AlgebraicGeometry
4
5variable {A : Type*} {X : Scheme} [CommRing A]
6    (struct_X : X ⟶ Spec (CommRingCat.of A))
7
8theorem algebraicGeometry_vakil_ch_11_2_E :
9    IsSeparated struct_X ↔ IsSeparated (instOverTerminalScheme X).hom := by
10  sorry

Lean 4

1by
2  simpa using
3    (show IsSeparated struct_X ↔ IsSeparated (CategoryTheory.Limits.terminal.from X) from by
4      constructor
5      · intro h
6        have hterm :
7            IsSeparated
8              (CategoryTheory.Limits.terminal.from (Spec (CommRingCat.of A))) := by
9          infer_instance
10        have hcomp :
11            IsSeparated
12              (CategoryTheory.CategoryStruct.comp struct_X
13                (CategoryTheory.Limits.terminal.from (Spec (CommRingCat.of A)))) :=
14          (CategoryTheory.MorphismProperty.IsStableUnderComposition.comp_mem
15            (P := @AlgebraicGeometry.IsSeparated)
16            struct_X (CategoryTheory.Limits.terminal.from (Spec (CommRingCat.of A))) h hterm)
17        simpa [CategoryTheory.Limits.terminal.comp_from struct_X] using hcomp
18      · intro h
19        have hcomp :
20            IsSeparated
21              (CategoryTheory.CategoryStruct.comp struct_X
22                (CategoryTheory.Limits.terminal.from (Spec (CommRingCat.of A)))) := by
23          simpa [CategoryTheory.Limits.terminal.comp_from struct_X] using h
24        haveI :
25            IsSeparated
26              (CategoryTheory.CategoryStruct.comp struct_X
27                (CategoryTheory.Limits.terminal.from (Spec (CommRingCat.of A)))) :=
28          hcomp
29        exact
30          AlgebraicGeometry.IsSeparated.of_comp struct_X
31            (CategoryTheory.Limits.terminal.from (Spec (CommRingCat.of A))))

CORRECT

Acknowledgements

We thank Kexing Ying, Vasilii Nesterov, Janitha Aswedige, Bingyu Xia, Elif Uskuplu, and Yizheng Zhu for contributing problem formalizations. We also thank Alex Gu for valuable feedback that improved the benchmark design. Finally, we are grateful to the teams behind lean-lsp-mcp, as well as Loogle.

Citations

[1] Formal Mathematical Reasoning: A New Frontier in AI (arXiv:2412.16075)
[2] The Lean 4 language (lean-lang.org)

Citation (BibTeX)

@misc{ravi2026proofbench,
title        = {ProofBench: Can Models Write Math Proofs That Are Formally Verified?},
author       = {Ravi, Nikil and Nashold, Langston and Krishnan, Rayan},
year         = {2026},
month        = jan,
howpublished = {Vals AI},
url          = {https://vals.ai/benchmarks/proof-bench},
note         = {Accessed: 2026-01-30}
}