Key Takeaways
- Claude Opus 4.8 leads at 69.4%, ahead of Claude Sonnet 5 (66.3%) and GPT 5.5 (64.5%). With partial-credit grading, even top scores are substantial-but-incomplete financial models, not client-ready deliverables.
- Difficulty is consistent across models: nearly all score highest on Dataroom Summaries and lowest on LBO and DCF, the most deeply interdependent models, where early errors cascade.
- Numerical accuracy is the bottleneck: the leading model passes 87% of formula and 74% of presentation checks but only 61% of numerical checks.
Background
The Excel Modeling Benchmark (EMB) evaluates whether AI agents can generate complex Excel financial models typical in Investment Banking and Private Equity. Each task is focused on real-world utility: it provides a written prompt describing a business scenario along with a set of source files, and asks the agent to build the complete model, including the workbook’s tabs, structure, formulas, and formatting.
The benchmark tests the kind of modeling work that junior analysts do in practice. This includes forecasting how a company will perform, building out its financial statements, and working out what a company or a deal is worth. The core challenge is taking provided inputs (historical financials, customer database, company data provided in spreadsheets) and driving them through dozens of interdependent steps while keeping logic consistent end-to-end. Correct models should be usable in real-world scenarios. Most importantly, the produced models must open cleanly in Excel without issue. Circular references should be handled cleanly, case toggles should switch the entire model between scenarios (e.g. base/bull/bear) from a single input, and changing core inputs should propagate through the entire model. Each model is expected to take human experts a minimum of 5 hours of work to create from scratch.
The tasks span seven model categories: LBO Models, DCF Models, M&A Models, Operating Models, Comps Spreads, Dataroom Summaries, and a catch-all Misc Models category. All tasks, gold-standard models, and rubrics are authored and peer-reviewed by financial experts, drawing on realistic scenarios.
Each task is evaluated in two modes. In Template mode the agent fills in a provided workbook skeleton and is graded by exact cell match against the gold model. In Scratch mode it builds the workbook from nothing and is graded by a partial-credit rubric scoring three kinds of checks: 1) numerical (do values match the ground truth), 2) formula (are cells wired to the correct inputs), and 3) presentation (IB conventions like color-coding and number formats, plus source citations). EMB scores both correctness and usability, so layout and modeling conventions are just as important as the numbers and formulas.
Results
The Pareto chart above shows how accuracy trades off against cost and latency. Claude Opus 4.8 is the most accurate at 69.4%, at $12 per task, while Claude Sonnet 5 ranks second at 66.3% but is the most expensive model of all at $15.44 per task. Both cost several times more than GPT 5.5, which reaches 64.5% at a fraction of the cost. Among the leading models, Kimi K2.6 is the most cost-efficient, holding 58% accuracy at just $2.20 per task, and MiMo V2.5 Pro is cheaper still at $0.22 while staying above 50%.
Performance by category
Category scores for each model, shaded from low (red) to high (green).
| Model | LBO | DCF | M&A | Operating | Comps | Dataroom | Misc |
|---|---|---|---|---|---|---|---|
| Claude Opus 4.8 | 54 | 66 | 80 | 67 | 73 | 86 | 59 |
| Claude Sonnet 5 | 60 | 56 | 76 | 65 | 69 | 85 | 52 |
| GPT 5.5 | 53 | 57 | 67 | 64 | 65 | 81 | 65 |
| Gemini 3.5 Flash | 46 | 58 | 68 | 59 | 70 | 81 | 62 |
| GLM 5.2 | 52 | 53 | 64 | 63 | 55 | 83 | 61 |
| Claude Sonnet 4.6 | 53 | 38 | 71 | 58 | 62 | 78 | 62 |
| Kimi K2.6 | 43 | 50 | 64 | 57 | 61 | 75 | 55 |
| Qwen 3.7 Max | 47 | 44 | 67 | 52 | 51 | 81 | 58 |
| MiMo V2.5 Pro | 48 | 44 | 64 | 55 | 47 | 78 | 51 |
| Gemini 3.1 Pro Preview (02/26) | 38 | 46 | 59 | 58 | 45 | 79 | 43 |
| DeepSeek V4 | 44 | 44 | 48 | 47 | 57 | 63 | 58 |
| Qwen 3.7 Plus | 38 | 37 | 59 | 58 | 39 | 72 | 42 |
| MiniMax-M3 | 45 | 37 | 48 | 49 | 53 | 53 | 50 |
| GPT 5.4 Mini | 42 | 41 | 50 | 51 | 40 | 48 | 46 |
| GPT 5.4 Nano | 33 | 38 | 44 | 37 | 47 | 72 | 43 |
| Grok 4.3 | 9 | 11 | 27 | 16 | 17 | 29 | 18 |
| Gemini 3.1 Flash Lite Preview | 9 | 5 | 7 | 5 | 10 | 15 | 9 |
Per-category performance is consistent across evaluated agents, with nearly all performing strongest on Dataroom Summaries and weakest on LBO and DCF models. LBO and DCF are one long chain of linked calculations, so errors propagate throughout the model; Dataroom Summaries lean more on summarizing provided data, so errors stay localized.
Scratch rubric breakdown
Scratch-mode rubric scores split by check type: numerical values, formulas, and formatting (presentation conventions, source citations, and workbook-wide checks).
| Model | Numerical | Formula | Presentation |
|---|---|---|---|
| Claude Opus 4.8 | 61 | 87 | 74 |
| Claude Sonnet 5 | 59 | 87 | 75 |
| GPT 5.5 | 58 | 86 | 78 |
| Gemini 3.5 Flash | 57 | 84 | 66 |
| GLM 5.2 | 59 | 84 | 77 |
| Claude Sonnet 4.6 | 58 | 82 | 74 |
| Kimi K2.6 | 52 | 79 | 70 |
| Qwen 3.7 Max | 53 | 78 | 73 |
| MiMo V2.5 Pro | 52 | 75 | 66 |
| Gemini 3.1 Pro Preview (02/26) | 51 | 78 | 65 |
| DeepSeek V4 | 48 | 73 | 64 |
| Qwen 3.7 Plus | 51 | 76 | 65 |
| MiniMax-M3 | 53 | 81 | 72 |
| GPT 5.4 Mini | 47 | 72 | 64 |
| GPT 5.4 Nano | 34 | 56 | 57 |
| Grok 4.3 | 20 | 47 | 52 |
| Gemini 3.1 Flash Lite Preview | 3 | 3 | 37 |
Formula and presentation checks pass far more often than numerical checks. A formula only has to point at the right input cells to pass, but if any of those inputs is wrong it still computes the wrong value and fails the numerical checks.
Template vs. Scratch
Each model's score under the two grading modes. Template is scored by exact cell match and scratch by a partial-credit rubric.
For all but the strongest models, Scratch mode scores beat Template mode. Scratch mode measures financial content and doesn’t require a rigid structure; Template mode additionally requires matching the gold model cell-for-cell structurally. Most models score higher in Scratch, and the gap is widest for the weakest models.
Generated tokens per turn
Models ranked by overall score, with the average tokens each generates per agent turn and a sparkline of how that varies from the start to the end of a run. The sparklines share a common scale, so taller means more output per turn.
| Model | Tokens / turn | Trend |
|---|---|---|
| 1. Claude Opus 4.8 | 4.6k | |
| 2. Claude Sonnet 5 | 2.0k | |
| 3. GPT 5.5 | 1.8k | |
| 4. Gemini 3.5 Flash | 1.7k | |
| 5. GLM 5.2 | 3.1k | |
| 6. Claude Sonnet 4.6 | 3.9k | |
| 7. Kimi K2.6 | 1.6k | |
| 8. Qwen 3.7 Max | 2.7k | |
| 9. MiMo V2.5 Pro | 3.0k | |
| 10. Gemini 3.1 Pro Preview (02/26) | 785 | |
| 11. DeepSeek V4 | 1.8k | |
| 12. Qwen 3.7 Plus | 2.7k | |
| 13. MiniMax-M3 | 1.3k | |
| 14. GPT 5.4 Mini | 3.4k | |
| 15. GPT 5.4 Nano | 1.3k | |
| 16. Grok 4.3 | 1.3k | |
| 17. Gemini 3.1 Flash Lite Preview | 501 |
Generation is front-loaded: models write the bulk of the model early, then taper to smaller edits and checks. Claude Opus 4.8 is an outlier, generating far more per turn than any other model, with its output peaking a few turns in rather than at the very start.
Agents work the task almost entirely through two tools: bash to read the input spreadsheets and write the workbook, and execute to recalculate and verify it. The chart below reports the average turns, tool calls, and tool-call errors per task for each model.
Example Task
- NumericalThe Q1 2018 cohort's ARR by quarter in Dec-19 equals $24,640.
- FormulaNet Retention % equals 1 + the sum of upsell, cross-sell, downsell, and churn all divided by BoP ARR.
- PresentationCustomer Name (ID) has 'Customer' as a prefix
- PresentationInput tabs contain a source citation in the top-left corner
Dataset
The dataset comprises 103 expert-authored, peer-reviewed questions, each paired with a gold-standard Excel model and a detailed grading rubric.
Each question is evaluated under the Template mode and Scratch mode scenarios. The dataset is divided into three parts: Public (1 open-sourced task), Private Validation (51 samples available for license), and Test (51 samples). All results reported on this page are based solely on the private, held-out Test set to prevent overfitting.
Task Taxonomy
EMB organizes tasks into seven model families reflecting real investment banking and private equity workflows.
1. LBO Models
Models the full equity story of a leveraged buyout from entry to exit, capturing the key levers of value creation. Goes beyond a basic debt schedule to incorporate bolt-on acquisitions, dividend recapitalizations, multi-tranche capital structures with PIK components, and sensitivity analysis on returns.
2. DCF Models
Intrinsic valuations that go beyond a single-entity discount, typically using a Sum-of-the-Parts approach. Each business segment is valued independently with its own cash flow profile, discount rate, and terminal value methodology.
3. M&A Models
Live-deal simulations focused on the mechanics of combining two financial entities. Covers the full range of transaction structures (stock, cash, mixed), pro-forma consolidation with purchase accounting adjustments, and contribution analysis showing each party’s economic share versus ownership received.
4. Operating Models
Deep dives into the underlying business rather than the transaction. Built from the bottom up with granular revenue drivers, department-level cost structures, and a fully linked balance sheet with working capital circularity.
5. Comps Spreads
Deal-level comparable company analysis with precise enterprise value builds for each comp. Emphasizes normalization (adjusting for lease liabilities, non-recurring items, and calendarization) to arrive at clean, apples-to-apples valuation multiples.
6. Dataroom Summaries
Post-dataroom analysis that validates the investment thesis through attribution and unit economics. Covers customer concentration, cohort-based retention metrics, revenue segmentation and bridging, and margin analysis with clear GAAP-to-adjusted reconciliation.
7. Misc. Models
A catch-all for the long tail of niche analyses beyond the six core families. These tasks test whether a model has learned general modeling principles rather than just the common templates.
13-Week Cash Flow
Short-term liquidity model built to manage a business through constrained conditions. Tracks weekly cash receipts and disbursements with high granularity, incorporating vendor payment timing, payroll cycles, and restructuring actions. Emphasizes variance analysis versus forecast, covenant visibility, and minimum liquidity thresholds to inform stakeholder negotiations and operational decision-making.
Equity Waterfall Returns
Models the distribution of proceeds across stakeholders at exit, capturing the full capital stack hierarchy. Allocates value through debt seniority, liquidation preferences, participation features, and conversion mechanics to show realized outcomes by tranche. Highlights breakpoints, IRRs, and MOICs for each class, with sensitivity to exit value and structure.
Project Finance Models
Asset-level valuation frameworks focused on long-duration infrastructure investments. Builds cash flows from contracted revenues, operating costs, and financing structures, incorporating tax equity, sculpted debt, and reserve accounts. Centers on NPV, IRR, and DSCR analysis under multiple scenarios, with detailed treatment of construction periods and ramp-up dynamics.
Multi-Tranche Returns Models
Evaluates investments with layered return structures across different instruments and stakeholders. Integrates preferred returns, warrants, options, and GP/LP promote tiers into a unified framework that captures timing and priority of payouts. Emphasizes scenario-driven outcomes, showing how returns shift across tranches under varying performance and exit assumptions.
Methodology
Agents are evaluated on a shared harness, and nearly all of their work runs through two tools. bash is the primary tool, used to read data files, inspect the workbook, and write the Excel model. execute opens the workbook and recalculates its formulas, letting the agent run and verify the model as it builds. The remaining tools cover data gathering and are used sparingly, because most tasks ship with the data a model needs. Web search and SEC filing search pull in outside information, price history retrieves market data, and retrieve and parse read supporting web pages. A submit tool finalizes the workbook for grading.
Each run is capped at intentionally generous 3.5 hours of wall-clock time, though in practice most models rarely come close.
Each task was written by a financial expert, who first builds a Gold model: a complete, correct reference implementation of the task specification. From the Gold model, a Template model is constructed by stripping the Gold of all values and formulas, leaving only the workbook structure as a starting point. In Template mode, each agent must fill in the Template workbook with the correct formulas and values, and the submission is scored on the proportion of graded cells whose values match the Gold workbook.
In Scratch mode, the agent is given the same instructions but no template workbook, and must build the model from scratch. The Scratch submission is scored on the fraction of rubric checks passed across numerical accuracy, formula correctness, and presentation (IB conventions like color-coding and number formats, plus source citations). Each rubric check is evaluated by an agentic LLM judge, GPT 5.4 Mini run at xhigh reasoning effort, which inspects the submitted workbook and returns a pass or fail with its reasoning.
In both modes, a subset of tasks require the model to be driven by case toggles, such as Base, Bull, and Bear scenarios, with the final model evaluated for correctness across all cases.
Each submission is graded by first recalculating all formulas through a real Microsoft Excel engine. Accuracy is the category mean: tasks are averaged within each of the seven categories, and the seven category scores are weighted equally.
Because grading awards partial credit toward a complete model, high scores on EMB should be construed as substantial but incomplete work, not as client-ready deliverables.
Acknowledgements
We would like to thank Andrew Schettino and all of the financial experts who authored and peer-reviewed the Excel Modeling Benchmark.