Excel Modeling Benchmark

Key Takeaways

Claude Opus 4.8 leads at 69.4%, ahead of Claude Sonnet 5 (66.3%) and GPT 5.5 (64.5%). With partial-credit grading, even top scores are substantial-but-incomplete financial models, not client-ready deliverables.
Difficulty is consistent across models: nearly all score highest on Dataroom Summaries and lowest on LBO and DCF, the most deeply interdependent models, where early errors cascade.
Numerical accuracy is the bottleneck: the leading model passes 87% of formula and 74% of presentation checks but only 61% of numerical checks.

Background

The Excel Modeling Benchmark (EMB) evaluates whether AI agents can generate complex Excel financial models typical in Investment Banking and Private Equity. Each task is focused on real-world utility: it provides a written prompt describing a business scenario along with a set of source files, and asks the agent to build the complete model, including the workbook’s tabs, structure, formulas, and formatting.

The benchmark tests the kind of modeling work that junior analysts do in practice. This includes forecasting how a company will perform, building out its financial statements, and working out what a company or a deal is worth. The core challenge is taking provided inputs (historical financials, customer database, company data provided in spreadsheets) and driving them through dozens of interdependent steps while keeping logic consistent end-to-end. Correct models should be usable in real-world scenarios. Most importantly, the produced models must open cleanly in Excel without issue. Circular references should be handled cleanly, case toggles should switch the entire model between scenarios (e.g. base/bull/bear) from a single input, and changing core inputs should propagate through the entire model. Each model is expected to take human experts a minimum of 5 hours of work to create from scratch.

The tasks span seven model categories: LBO Models, DCF Models, M&A Models, Operating Models, Comps Spreads, Dataroom Summaries, and a catch-all Misc Models category. All tasks, gold-standard models, and rubrics are authored and peer-reviewed by financial experts, drawing on realistic scenarios.

Each task is evaluated in two modes. In Template mode the agent fills in a provided workbook skeleton and is graded by exact cell match against the gold model. In Scratch mode it builds the workbook from nothing and is graded by a partial-credit rubric scoring three kinds of checks: 1) numerical (do values match the ground truth), 2) formula (are cells wired to the correct inputs), and 3) presentation (IB conventions like color-coding and number formats, plus source citations). EMB scores both correctness and usability, so layout and modeling conventions are just as important as the numbers and formulas.

Results

EMB

xAI

The Pareto chart above shows how accuracy trades off against cost and latency. Claude Opus 4.8 is the most accurate at 69.4%, at $12 per task, while Claude Sonnet 5 ranks second at 66.3% but is the most expensive model of all at $15.44 per task. Both cost several times more than GPT 5.5, which reaches 64.5% at a fraction of the cost. Among the leading models, Kimi K2.6 is the most cost-efficient, holding 58% accuracy at just $2.20 per task, and MiMo V2.5 Pro is cheaper still at $0.22 while staying above 50%.

Performance by category

Category scores for each model, shaded from low (red) to high (green).

lowerhigher

Model	LBOLBO	DCFDCF	M&AM&A	OperatingOperating	CompsComps	DataroomDataroom	MiscMisc
Claude Opus 4.8Claude Opus 4.8	54	66	80	67	73	86	59
Claude Sonnet 5Claude Sonnet 5	60	56	76	65	69	85	52
GPT 5.5GPT 5.5	53	57	67	64	65	81	65
Gemini 3.5 FlashGemini 3.5 Flash	46	58	68	59	70	81	62
GLM 5.2GLM 5.2	52	53	64	63	55	83	61
Claude Sonnet 4.6Claude Sonnet 4.6	53	38	71	58	62	78	62
Kimi K2.6Kimi K2.6	43	50	64	57	61	75	55
Qwen 3.7 MaxQwen 3.7 Max	47	44	67	52	51	81	58
MiMo V2.5 ProMiMo V2.5 Pro	48	44	64	55	47	78	51
Gemini 3.1 Pro Preview (02/26)Gemini 3.1 Pro Preview (02/26)	38	46	59	58	45	79	43
DeepSeek V4DeepSeek V4	44	44	48	47	57	63	58
Qwen 3.7 PlusQwen 3.7 Plus	38	37	59	58	39	72	42
MiniMax-M3MiniMax-M3	45	37	48	49	53	53	50
GPT 5.4 MiniGPT 5.4 Mini	42	41	50	51	40	48	46
GPT 5.4 NanoGPT 5.4 Nano	33	38	44	37	47	72	43
Grok 4.3Grok 4.3	9	11	27	16	17	29	18
Gemini 3.1 Flash Lite PreviewGemini 3.1 Flash Lite Preview	9	5	7	5	10	15	9

Per-category performance is consistent across evaluated agents, with nearly all performing strongest on Dataroom Summaries and weakest on LBO and DCF models. LBO and DCF are one long chain of linked calculations, so errors propagate throughout the model; Dataroom Summaries lean more on summarizing provided data, so errors stay localized.

Scratch rubric breakdown

Scratch-mode rubric scores split by check type: numerical values, formulas, and formatting (presentation conventions, source citations, and workbook-wide checks).

lowerhigher

Model	NumericalNumerical	FormulaFormula	PresentationPresentation
Claude Opus 4.8Claude Opus 4.8	61	87	74
Claude Sonnet 5Claude Sonnet 5	59	87	75
GPT 5.5GPT 5.5	58	86	78
Gemini 3.5 FlashGemini 3.5 Flash	57	84	66
GLM 5.2GLM 5.2	59	84	77
Claude Sonnet 4.6Claude Sonnet 4.6	58	82	74
Kimi K2.6Kimi K2.6	52	79	70
Qwen 3.7 MaxQwen 3.7 Max	53	78	73
MiMo V2.5 ProMiMo V2.5 Pro	52	75	66
Gemini 3.1 Pro Preview (02/26)Gemini 3.1 Pro Preview (02/26)	51	78	65
DeepSeek V4DeepSeek V4	48	73	64
Qwen 3.7 PlusQwen 3.7 Plus	51	76	65
MiniMax-M3MiniMax-M3	53	81	72
GPT 5.4 MiniGPT 5.4 Mini	47	72	64
GPT 5.4 NanoGPT 5.4 Nano	34	56	57
Grok 4.3Grok 4.3	20	47	52
Gemini 3.1 Flash Lite PreviewGemini 3.1 Flash Lite Preview	3	3	37

Formula and presentation checks pass far more often than numerical checks. A formula only has to point at the right input cells to pass, but if any of those inputs is wrong it still computes the wrong value and fails the numerical checks.

Template vs. Scratch

Each model's score under the two grading modes. Template is scored by exact cell match and scratch by a partial-credit rubric.

TemplateScratch

For all but the strongest models, Scratch mode scores beat Template mode. Scratch mode measures financial content and doesn’t require a rigid structure; Template mode additionally requires matching the gold model cell-for-cell structurally. Most models score higher in Scratch, and the gap is widest for the weakest models.

Generated tokens per turn

Models ranked by overall score, with the average tokens each generates per agent turn and a sparkline of how that varies from the start to the end of a run. The sparklines share a common scale, so taller means more output per turn.

Model	Tokens / turn	Trend
1. Claude Opus 4.8Claude Opus 4.8	4.6k
2. Claude Sonnet 5Claude Sonnet 5	2.0k
3. GPT 5.5GPT 5.5	1.8k
4. Gemini 3.5 FlashGemini 3.5 Flash	1.7k
5. GLM 5.2GLM 5.2	3.1k
6. Claude Sonnet 4.6Claude Sonnet 4.6	3.9k
7. Kimi K2.6Kimi K2.6	1.6k
8. Qwen 3.7 MaxQwen 3.7 Max	2.7k
9. MiMo V2.5 ProMiMo V2.5 Pro	3.0k
10. Gemini 3.1 Pro Preview (02/26)Gemini 3.1 Pro Preview (02/26)	785
11. DeepSeek V4DeepSeek V4	1.8k
12. Qwen 3.7 PlusQwen 3.7 Plus	2.7k
13. MiniMax-M3MiniMax-M3	1.3k
14. GPT 5.4 MiniGPT 5.4 Mini	3.4k
15. GPT 5.4 NanoGPT 5.4 Nano	1.3k
16. Grok 4.3Grok 4.3	1.3k
17. Gemini 3.1 Flash Lite PreviewGemini 3.1 Flash Lite Preview	501

Generation is front-loaded: models write the bulk of the model early, then taper to smaller edits and checks. Claude Opus 4.8 is an outlier, generating far more per turn than any other model, with its output peaking a few turns in rather than at the very start.

Agents work the task almost entirely through two tools: bash to read the input spreadsheets and write the workbook, and execute to recalculate and verify it. The chart below reports the average turns, tool calls, and tool-call errors per task for each model.

Tool Calls Analysis

Example Task

Sample task · Dataroom Summaries

You are an associate at a PE firm who has just received access to the customer data cube for SaaS Company XYZ. Your Partner has asked you to pull together a summary analysis to highlight key trends in the customer data. It is currently June of 2021, and you have data available from January 2018, until the May-21 LTM period. In order to properly understand this customer data, you will create a customer retention analysis that analyzes rolling LTM trends around ARR, new bookings, upsell, cross-sell, downsell, and churn. You will also highlight both gross and net retention trends. Finally, you will analyze key customer retention trends around the beginning number of customers, new customers, churned customers, end of period customers, and overall customer retention. You have been provided with a data file named “Customer Data Raw.xlsx”. Begin by importing it into your workbook. This data will be referenced and linked to. Construct the model with the following tabs: 1) Active Data, 2) Active Customers, 3) Retention, 4) Customer Summary. Active Data tab Starting on the active data tab, use the 0/1 toggles to allow you to switch on and off different products and geographies. For example, if Product 1 is switched to 0, all values for customers with product 1 should be zeroed out. For purposes of this model, assume each toggle is active (switched to 1). Additionally, use this tab to calculate new logo, cross-sell, upsell, downsell, and churn amounts for each customer in each YoY period in dollar figures (e.g. Jan-19 new logo represents YoY change between Jan-18 and Jan-19). Inherit the table layout of the imported raw data with new sections created to the right for new logo, cross-sell, upsell, downsell, and churn. New logos and churn should be identified at the customer level, not the product level. Active Customers tab Create a section that depicts the unique customer list of the active data tab and illustrates customer-level ARR across the data provided. Then calculate new logo and churn YoY for Jan-19 through May-21 periods (in dollar figures). Additionally, for each customer, calculate the start date, cohort (quarter. E.g. Q1 2018), max win (largest new logo growth for each customer), and max loss (largest churn for each customer). Retention tab After these two tabs are set up, you can now create your ARR and customer waterfalls. Show the waterfalls on an annual basis for each month starting in January 2019 (i.e., beginning ARR for January 2019 is ARR in January 2018). Begin with the ARR waterfall, and include the bridge from beginning ARR to ending ARR, inclusive of upsell, cross-sell, downsell, churn, and new logo bookings. From this, calculate bookings $ (upsell + cross-sell + new logo bookings), % termination (churn % of BoP ARR), gross retention, and net retention. Gross retention is calculated as 1 + downsell and churn divided by BoP ARR. Net retention is calculated as 1 + the sum of downsell, upsell, cross-sell, and churn divided by BoA ARR. Finally, create a customer waterfall bridging from beginning of period customers to end of period customers through new logo and churned logo calculations. Calculate logo retention off of this build. Add difference checks (rounded to nearest whole number) to confirm your total ARR and customer count numbers tie to the active data tab. Customer Summary tab To further analyze the customer trends, analyze customer count by cohort monthly sale sizes ($100k+ monthly sales, $75-100k monthly sales, $50-75k monthly sales, $25-50k monthly sales, and < $25k monthly sales) and ARR by cohort date (quarterly cohorts, starting with the Q1 2018 cohort) for each month starting in January 2018. Show totals for both the cohort by monthly sales and the cohort by quarter tables. Show the cohort revenue retention % starting for cohorts from Q1 2018 through Q1 2021. Revenue retention % for each cohort should be calculated as the current month’s ARR divided by the ARR in the last month of the quarter in which the cohort started (for example, for the Q2 2018 cohort, this would be June 2018). For the Q1 2018 cohort, assume the starting month is January 2018. The revenue retention % should be shown starting in the last month of the quarter in which the cohort started and should always start at 100.0%. Add difference checks (rounded to nearest whole number) to confirm your total ARR and customer count numbers tie to the retention tab. On the same tab, add a summary of your top 1, 3, 5, 10, and 20 customers over time and highlight their ARR and % of total ARR in each period. Additionally, show the top 10 customers as of the May 2021 period and how they have performed over time, starting in January 2018. Sum the top 10 customers for total top 10 ARR. Finally, please add a summary showing the ARR by month for the customers that are responsible for the 10 largest new logo bookings and the 10 largest customer churn events over the entire data period. Formatting - Hardcodes are in blue fonts - References to other tabs are in green fonts - Formulas using only data on a given tab are in black fonts - Checks are in red fonts - All numbers except percentages are rounded to the nearest whole number - Percentages are displayed rounded to one decimal - Percentages are italicized - Negative numbers are in parentheses - Customer Name (ID) has "Customer" as a prefix - Product (ID) has "Product" as a prefix - All dollar figures contain a "$" prefix

Sample rubric checks

NumericalThe Q1 2018 cohort's ARR by quarter in Dec-19 equals $24,640.
FormulaNet Retention % equals 1 + the sum of upsell, cross-sell, downsell, and churn all divided by BoP ARR.
PresentationCustomer Name (ID) has 'Customer' as a prefix
PresentationInput tabs contain a source citation in the top-left corner

Dataset

The dataset comprises 103 expert-authored, peer-reviewed questions, each paired with a gold-standard Excel model and a detailed grading rubric.

Each question is evaluated under the Template mode and Scratch mode scenarios. The dataset is divided into three parts: Public (1 open-sourced task), Private Validation (51 samples available for license), and Test (51 samples). All results reported on this page are based solely on the private, held-out Test set to prevent overfitting.

Task Taxonomy

EMB organizes tasks into seven model families reflecting real investment banking and private equity workflows.

1. LBO Models

Models the full equity story of a leveraged buyout from entry to exit, capturing the key levers of value creation. Goes beyond a basic debt schedule to incorporate bolt-on acquisitions, dividend recapitalizations, multi-tranche capital structures with PIK components, and sensitivity analysis on returns.

2. DCF Models

Intrinsic valuations that go beyond a single-entity discount, typically using a Sum-of-the-Parts approach. Each business segment is valued independently with its own cash flow profile, discount rate, and terminal value methodology.

3. M&A Models

Live-deal simulations focused on the mechanics of combining two financial entities. Covers the full range of transaction structures (stock, cash, mixed), pro-forma consolidation with purchase accounting adjustments, and contribution analysis showing each party’s economic share versus ownership received.

4. Operating Models

Deep dives into the underlying business rather than the transaction. Built from the bottom up with granular revenue drivers, department-level cost structures, and a fully linked balance sheet with working capital circularity.

5. Comps Spreads

Deal-level comparable company analysis with precise enterprise value builds for each comp. Emphasizes normalization (adjusting for lease liabilities, non-recurring items, and calendarization) to arrive at clean, apples-to-apples valuation multiples.

6. Dataroom Summaries

Post-dataroom analysis that validates the investment thesis through attribution and unit economics. Covers customer concentration, cohort-based retention metrics, revenue segmentation and bridging, and margin analysis with clear GAAP-to-adjusted reconciliation.

7. Misc. Models

A catch-all for the long tail of niche analyses beyond the six core families. These tasks test whether a model has learned general modeling principles rather than just the common templates.

13-Week Cash Flow

Short-term liquidity model built to manage a business through constrained conditions. Tracks weekly cash receipts and disbursements with high granularity, incorporating vendor payment timing, payroll cycles, and restructuring actions. Emphasizes variance analysis versus forecast, covenant visibility, and minimum liquidity thresholds to inform stakeholder negotiations and operational decision-making.

Equity Waterfall Returns

Models the distribution of proceeds across stakeholders at exit, capturing the full capital stack hierarchy. Allocates value through debt seniority, liquidation preferences, participation features, and conversion mechanics to show realized outcomes by tranche. Highlights breakpoints, IRRs, and MOICs for each class, with sensitivity to exit value and structure.

Project Finance Models

Asset-level valuation frameworks focused on long-duration infrastructure investments. Builds cash flows from contracted revenues, operating costs, and financing structures, incorporating tax equity, sculpted debt, and reserve accounts. Centers on NPV, IRR, and DSCR analysis under multiple scenarios, with detailed treatment of construction periods and ramp-up dynamics.

Multi-Tranche Returns Models

Evaluates investments with layered return structures across different instruments and stakeholders. Integrates preferred returns, warrants, options, and GP/LP promote tiers into a unified framework that captures timing and priority of payouts. Emphasizes scenario-driven outcomes, showing how returns shift across tranches under varying performance and exit assumptions.

Methodology

Agents are evaluated on a shared harness, and nearly all of their work runs through two tools. bash is the primary tool, used to read data files, inspect the workbook, and write the Excel model. execute opens the workbook and recalculates its formulas, letting the agent run and verify the model as it builds. The remaining tools cover data gathering and are used sparingly, because most tasks ship with the data a model needs. Web search and SEC filing search pull in outside information, price history retrieves market data, and retrieve and parse read supporting web pages. A submit tool finalizes the workbook for grading.

Each run is capped at intentionally generous 3.5 hours of wall-clock time, though in practice most models rarely come close.

Each task was written by a financial expert, who first builds a Gold model: a complete, correct reference implementation of the task specification. From the Gold model, a Template model is constructed by stripping the Gold of all values and formulas, leaving only the workbook structure as a starting point. In Template mode, each agent must fill in the Template workbook with the correct formulas and values, and the submission is scored on the proportion of graded cells whose values match the Gold workbook.

In Scratch mode, the agent is given the same instructions but no template workbook, and must build the model from scratch. The Scratch submission is scored on the fraction of rubric checks passed across numerical accuracy, formula correctness, and presentation (IB conventions like color-coding and number formats, plus source citations). Each rubric check is evaluated by an agentic LLM judge, GPT 5.4 Mini run at xhigh reasoning effort, which inspects the submitted workbook and returns a pass or fail with its reasoning.

In both modes, a subset of tasks require the model to be driven by case toggles, such as Base, Bull, and Bear scenarios, with the final model evaluated for correctness across all cases.

Each submission is graded by first recalculating all formulas through a real Microsoft Excel engine. Accuracy is the category mean: tasks are averaged within each of the seven categories, and the seven category scores are weighted equally.

Because grading awards partial credit toward a complete model, high scores on EMB should be construed as substantial but incomplete work, not as client-ready deliverables.

Acknowledgements

We would like to thank Andrew Schettino and all of the financial experts who authored and peer-reviewed the Excel Modeling Benchmark.