Key Takeaways
- Harmonic’s Aristotle achieved the best overall performance at 71%, while GPT 5.4 is the top foundation model at 56%. Claude Opus 4.7 is the second-best foundation model at 54%.
- Foundation models are improving on formal math, but there is still a large gap to specialized systems on hard Lean tasks (15 points between Aristotle and GPT 5.4).
Background
Large language models now routinely achieve impressive scores on mathematical contests such as the AIME, IMO, and Putnam. These results suggest substantial progress in mathematical reasoning— however, they come with an important caveat. Natural language solutions can mask subtle logical errors, unstated assumptions, or invalid steps that are difficult and time-consuming for humans to audit.
To address this gap, ProofBench evaluates whether models can translate mathematical reasoning into machine-checkable proofs. We use Lean 4, a proof assistant and programming language designed for formal mathematics, where every logical inference must be formally verified (i.e., be machine-checkable). Since a solution here either compiles or it does not— there is no partial credit for plausible reasoning- this makes for a more objective evaluation of mathematical reasoning skills.
In addition, mathematicians are increasingly using Lean 4 to formalize their work. Therefore, models that perform well on this benchmark will be valuable in assisting with mathematical research in the future.
Success on this benchmark therefore tests whether models can maintain logical correctness across chains of reasoning, and whether they can be effective with modern formal mathematics. Strong performance would indicate genuine progress toward AI systems that can assist mathematicians with formalization, verification, and large-scale proof development.
Results
To visualize the performance differences among models, we provide scatter plots illustrating accuracy versus latency and cost.
Benchmark Overview
Each benchmark task pairs a natural-language mathematical problem with its corresponding Lean 4 formal statement. Given both, a model must construct a formal proof that is accepted by the Lean proof checker. The benchmark measures proof success directly, providing a clear and unforgiving signal of correctness. Problems are drawn from advanced undergraduate and graduate-level sources, including qualifying exams and standard textbooks. They span a broad range of modern mathematics:
- Probability theory and stochastic analysis
- Measure theory
- Real and functional analysis
- Algebra and commutative algebra
- Algebraic geometry
- Number theory
- Set theory, logic, and model theory
By focusing on higher levels of abstraction rather than high-school contest problems (e.g. AIME), ProofBench provides a complementary signal to existing math benchmarks and better reflects the challenges faced in real mathematical research.
Tool Use Analysis
Within the ProofBench harness, models have access to tools organized into the following bundles:
| Bundle | Tools |
|---|---|
| Search | lean_loogle (Mathlib search) |
| Code Execution | lean_run_code (Lean code execution + feedback) |
| Submission | submit_proof (final proof submission) |
Models may iteratively search the Mathlib library and test partial proofs or code in a Lean 4 environment before submitting a final solution. The submission tool is invoked exactly once per problem. The visualization below summarizes how different models allocate effort between search and code execution during problem solving.
One observation in tool-use patterns is that models that tend to focus more on executing code generally seem to perform better. For instance, we observed that Claude Opus 4.5 (Thinking) quickly starts executing code and getting feedback, using the search tool more as a method to find lemmas it recognizes it needs in the middle of constructing its solution. A common failure mode we observed is models spiraling into a long series of search tool calls, attempting to find a theorem or notion that doesn’t exist, but continuing to search for variations of it in an attempt to find it.
Methodology
The benchmark consists of two splits: a public split and a private test split, each containing 100 problems. For every problem, we provide:
- A natural-language statement of the theorem (no informal proof attached; models must generate the proof)
- A carefully vetted Lean 4 formalization of the statement
Models are allowed up to 40 interaction turns per problem, enabling iterative development with tool feedback. No access to the informal proof is provided.
Note on evaluating Aristotle: To evaluate Aristotle, a formal-math system from Harmonic AI, we fed in the full Lean 4 statement file, with the natural language statement included as a comment at the top of the file. We emphasize that Aristotle uses its own internal system/harness, and not the evaluation harness we provide to other models. Additionally, the time reported indicates only solve time, and does not include queue time.
Data Collection Process and Examples
Prior benchmarks have faced numerous challenges with ensuring that the formalizations of the statements were correct. Statements may be misformalized in a number of ways, particularly if autoformalized by models. To circumvent these issues, problems were selected and formalized by a team of Lean 4 experts, including PhD students and experts in their respective domains. Statements and formalizations underwent a secondary review process before being included in the benchmark. We also collected natural language proofs for the problems, and while the main scores reported on this page do not give the model(s) access to the proofs, we plan to conduct ablations including the proofs.
For examples of the questions in the benchmark, see the panel below.
Model Output Examples
Q
Prove that a finitely additive non-negative set function μ on a measurable space (Ω, F) with the "continuity" property Bₙ ∈ F, Bₙ ↓ ∅, μ(Bₙ) < ∞ → μ(Bₙ) → 0 must be countably additive if μ(Ω) < ∞.
L
Lean Formalization:1import Mathlib23open TopologicalSpace Filter MeasureTheory ProbabilityTheory Function45open scoped NNReal ENNReal MeasureTheory ProbabilityTheory Topology67namespace MeasureTheory89variable {Ω : Type*} [MeasurableSpace Ω]1011theorem dembo_1_1_15 {μ : Set Ω → ℝ≥0}12 (hAdd : ∀ A B, MeasurableSet A → MeasurableSet B → Disjoint A B → μ (A ∪ B) = μ A + μ B)13 (hCont : ∀ B : ℕ → Set Ω,14 (∀ n, MeasurableSet (B n)) → Antitone B → (⋂ n, B n) = ∅15 → Tendsto (fun n => μ (B n)) atTop (𝓝 0))16 {A : ℕ → Set Ω} (hA : ∀ n, MeasurableSet (A n)) (hDisj : Pairwise (Disjoint on A)) :17 μ (⋃ n, A n) = ∑' n, μ (A n) := by18 sorry1920end MeasureTheory1by2 have hEmpty : μ ∅ = 0 := by3 have h := hAdd ∅ ∅ MeasurableSet.empty MeasurableSet.empty disjoint_bot_left4 simp only [Set.union_empty] at h5 have : μ ∅ + μ ∅ = μ ∅ := h.symm6 have h3 : μ ∅ + μ ∅ - μ ∅ = μ ∅ - μ ∅ := by rw [this]7 simp at h3; exact h38 set B := fun n => ⋃ k ≥ n, A k with hBdef9 have hB_meas : ∀ n, MeasurableSet (B n) := fun n => 10 MeasurableSet.biUnion (Set.to_countable _) (fun k _ => hA k)11 have hB_anti : Antitone B := by12 intro m n hmn13 apply Set.biUnion_subset_biUnion_left14 intro k hk; exact le_trans hmn hk15 have hB_inter : ⋂ n, B n = ∅ := by16 rw [Set.eq_empty_iff_forall_not_mem]17 intro x hx18 simp only [Set.mem_iInter] at hx19 have h0 := hx 0; rw [hBdef] at h020 simp only [Set.mem_iUnion, ge_iff_le] at h021 obtain ⟨k, -, hxk⟩ := h022 have hk1 := hx (k + 1); rw [hBdef] at hk123 simp only [Set.mem_iUnion, ge_iff_le] at hk124 obtain ⟨j, hjk, hxj⟩ := hk125 have hkj : k ≠ j := by omega26 exact Set.disjoint_iff.mp (hDisj hkj) ⟨hxk, hxj⟩27 have hBtends : Tendsto (fun n => μ (B n)) atTop (𝓝 0) := hCont B hB_meas hB_anti hB_inter28 have hUnion : ∀ n, (⋃ i, A i) = (⋃ k < n, A k) ∪ B n := by29 intro n; ext x30 simp only [Set.mem_union, Set.mem_iUnion, exists_prop, hBdef]31 constructor32 · rintro ⟨k, hxk⟩33 by_cases hkn : k < n34 · left; exact ⟨k, hkn, hxk⟩35 · right; exact ⟨k, not_lt.mp hkn, hxk⟩36 · rintro (⟨k, -, hxk⟩ | ⟨k, -, hxk⟩) <;> exact ⟨k, hxk⟩37 have hDisjParts : ∀ n, Disjoint (⋃ k < n, A k) (B n) := by38 intro n; rw [Set.disjoint_iff]39 intro x ⟨hx1, hx2⟩40 simp only [Set.mem_iUnion, exists_prop, hBdef] at hx1 hx241 obtain ⟨k1, hk1n, hxk1⟩ := hx142 obtain ⟨k2, hk2n, hxk2⟩ := hx243 have hne : k1 ≠ k2 := by omega44 exact Set.disjoint_iff.mp (hDisj hne) ⟨hxk1, hxk2⟩45 have hFinMeas : ∀ n, MeasurableSet (⋃ k < n, A k) := fun n => 46 MeasurableSet.biUnion (Set.to_countable _) (fun k _ => hA k)47 have hFinAdd : ∀ n, μ (⋃ k < n, A k) = ∑ k ∈ Finset.range n, μ (A k) := by48 intro n49 induction n with50 | zero => simp [hEmpty]51 | succ n ih =>52 rw [Finset.sum_range_succ, ← ih]53 have hunion : (⋃ k < n + 1, A k) = (⋃ k < n, A k) ∪ A n := by54 ext x; simp only [Set.mem_iUnion, Set.mem_union, exists_prop]55 constructor56 · rintro ⟨k, hkn, hxk⟩57 by_cases hkn' : k < n58 · left; exact ⟨k, hkn', hxk⟩59 · have heq : k = n := by omega60 right; rw [heq] at hxk; exact hxk61 · rintro (⟨k, hkn, hxk⟩ | hxn)62 · exact ⟨k, Nat.lt_succ_of_lt hkn, hxk⟩63 · exact ⟨n, Nat.lt_succ_self n, hxn⟩64 have hdisj : Disjoint (⋃ k < n, A k) (A n) := by65 rw [Set.disjoint_iff]; intro x ⟨hx1, hx2⟩66 simp only [Set.mem_iUnion, exists_prop] at hx167 obtain ⟨k, hkn, hxk⟩ := hx168 have hne : k ≠ n := by omega69 exact Set.disjoint_iff.mp (hDisj hne) ⟨hxk, hx2⟩70 rw [hunion]; exact hAdd _ _ (hFinMeas n) (hA n) hdisj71 have hKey : ∀ n, μ (⋃ i, A i) = (∑ k ∈ Finset.range n, μ (A k)) + μ (B n) := by72 intro n73 calc μ (⋃ i, A i) = μ ((⋃ k < n, A k) ∪ B n) := by rw [hUnion n]74 _ = μ (⋃ k < n, A k) + μ (B n) := hAdd _ _ (hFinMeas n) (hB_meas n) (hDisjParts n)75 _ = (∑ k ∈ Finset.range n, μ (A k)) + μ (B n) := by rw [hFinAdd n]76 have hPartialSums : Tendsto (fun n => ∑ k ∈ Finset.range n, μ (A k)) atTop (𝓝 (μ (⋃ i, A i))) := by77 have h1 : ∀ n, (∑ k ∈ Finset.range n, μ (A k)) = μ (⋃ i, A i) - μ (B n) := by78 intro n79 have := hKey n80 have h2 : μ (⋃ i, A i) - μ (B n) = ∑ k ∈ Finset.range n, μ (A k) + μ (B n) - μ (B n) := by81 rw [← this]82 simp at h2; exact h2.symm83 simp_rw [h1]84 have : Tendsto (fun n => μ (⋃ i, A i) - μ (B n)) atTop (𝓝 (μ (⋃ i, A i) - 0)) := 85 Tendsto.sub tendsto_const_nhds hBtends86 simp at this; exact this87 have hHasSum : HasSum (fun n => μ (A n)) (μ (⋃ i, A i)) := by88 rw [NNReal.hasSum_iff_tendsto_nat]; exact hPartialSums89 exact hHasSum.tsum_eq.symmCORRECT
Correct proof.
Q
Show that an A-scheme is separated (over A) if and only if it is separated over ℤ.
L
Lean Formalization:1import Mathlib23open AlgebraicGeometry45variable {A : Type*} {X : Scheme} [CommRing A]6 (struct_X : X ⟶ Spec (CommRingCat.of A))78theorem algebraicGeometry_vakil_ch_11_2_E :9 IsSeparated struct_X ↔ IsSeparated (instOverTerminalScheme X).hom := by10 sorry1by2 simpa using3 (show IsSeparated struct_X ↔ IsSeparated (CategoryTheory.Limits.terminal.from X) from by4 constructor5 · intro h6 have hterm :7 IsSeparated8 (CategoryTheory.Limits.terminal.from (Spec (CommRingCat.of A))) := by9 infer_instance10 have hcomp :11 IsSeparated12 (CategoryTheory.CategoryStruct.comp struct_X13 (CategoryTheory.Limits.terminal.from (Spec (CommRingCat.of A)))) :=14 (CategoryTheory.MorphismProperty.IsStableUnderComposition.comp_mem15 (P := @AlgebraicGeometry.IsSeparated)16 struct_X (CategoryTheory.Limits.terminal.from (Spec (CommRingCat.of A))) h hterm)17 simpa [CategoryTheory.Limits.terminal.comp_from struct_X] using hcomp18 · intro h19 have hcomp :20 IsSeparated21 (CategoryTheory.CategoryStruct.comp struct_X22 (CategoryTheory.Limits.terminal.from (Spec (CommRingCat.of A)))) := by23 simpa [CategoryTheory.Limits.terminal.comp_from struct_X] using h24 haveI :25 IsSeparated26 (CategoryTheory.CategoryStruct.comp struct_X27 (CategoryTheory.Limits.terminal.from (Spec (CommRingCat.of A)))) :=28 hcomp29 exact30 AlgebraicGeometry.IsSeparated.of_comp struct_X31 (CategoryTheory.Limits.terminal.from (Spec (CommRingCat.of A))))CORRECT
Acknowledgements
We thank Kexing Ying, Vasilii Nesterov, Janitha Aswedige, Bingyu Xia, Elif Uskuplu, and Yizheng Zhu for contributing problem formalizations. We also thank Alex Gu for valuable feedback that improved the benchmark design. Finally, we are grateful to the teams behind lean-lsp-mcp, as well as Loogle.
Citations
[1] Formal Mathematical Reasoning: A New Frontier in AI (arXiv:2412.16075)
[2] The Lean 4 language (lean-lang.org)
Citation (BibTeX)
@misc{ravi2026proofbench,
title = {ProofBench: Can Models Write Math Proofs That Are Formally Verified?},
author = {Ravi, Nikil and Nashold, Langston and Krishnan, Rayan},
year = {2026},
month = jan,
howpublished = {Vals AI},
url = {https://vals.ai/benchmarks/proof-bench},
note = {Accessed: 2026-01-30}
}