Task Type:

TaxEval (v2) Benchmark

Last updated

1

GPT 5 Mini

★$

80.1%

$0.25 / $2.00

36.78 s

2

o3

79.0%

$2.00 / $8.00

25.18 s

3

Grok 3

78.8%

$3.00 / $15.00

13.11 s

4

o4 Mini

78.8%

$1.10 / $4.40

15.25 s

5

o1

78.6%

$15.00 / $60.00

19.19 s

6

Claude 3.7 Sonnet (Thinking)

78.4%

$3.00 / $15.00

43.26 s

7

GPT 4.1

⚡︎

78.3%

$2.00 / $8.00

8.03 s

8

Claude Sonnet 4.5 (Thinking)

78.3%

$3.00 / $15.00

48.26 s

9

GPT 5

78.3%

$1.25 / $10.00

79.88 s

10

GPT 4o (2024-11-20)

78.1%

$2.50 / $10.00

5.83 s

Task type :

★Best Performing

$Best Budget

⚡︎Best Speed

Reasoning Model

Key Takeaways

GPT 5 Mini , OpenAI’s latest reasoning model, takes the #1 spot.
Seven of the top ten models are reasoning models, highlighting the advancements made possible by this new technology.
The top models are all tightly clustered around 80%

Dataset and Context

TaxEval v2 evaluates models’ abilities to answer hard tax-related questions. This version focuses on both answer correctness and structured reasoning capabilities. This dataset was created in collaboration with financial and tax experts, who have both created and double-checked all questions and answers.

Some key features:

1,500+ total questions across validation and test sets
A balanced distribution of topics and question types
Comprehensive evaluation of both answers and reasoning steps

The benchmark consists of three main components:

Public Validation Set: 20 samples, available upon request (contact contact@vals.ai).
Private Validation Set: 300 samples for model evaluation, available for purchase to evaluate and improve models.
Test Set: 1,223 samples. These samples are never shared.

The benchmark is composed of two tasks (each task uses the same questions, but they are evaluated differently):

Answer Correctness: The factual correctness of the answer, as compared to a ground truth.
Stepwise Reasoning: The quality and structure of the reasoning process as compared to the reasoning process of human experts. This ensures models not only provide correct answers but also demonstrate a clear thinking process.

The benchmark includes a diverse range of question types:

Application and Compliance (18.3%)
Comparative Analysis (16.2%)
Numerical Reasoning (16.7%)
Problem Solving and Critical Thinking (16.5%)
Semantic Analysis (18.0%)
Updates and Current Affairs (15.9%)

Each category is carefully balanced between the private validation and test sets to ensure representative sampling. The evaluation process uses Claude 3.5 Sonnet Latest as judge for both answer correctness and stepwise reasoning assessment.

Results

TaxEval

/

TaxEval

/

Model Output Examples

The hardest questions for the models typically require complex, multi-step calculations, reasoning on what laws and numbers to use given the information, or accessing more recent data.

Below is an example where only GPT 5 gets it right ($18,280), while the other models perform the calculations incorrectly. All models used a significant amount of reasoning tokens for this answer.

Q

Michael, a married taxpayer filing jointly, has an adjusted gross income (AGI) of $300,000 for 2023, excluding any investment income. During the year, he received $20,000 in interest from corporate bonds and $15,000 in interest from municipal bonds issued by his state of residence. In addition, he sold a collectible artwork for $100,000 that he had purchased 5 years ago for $60,000. Calculate Michael's tax liability related to his investment income, including any applicable taxes on capital gains and considering the Net Investment Income Tax (NIIT).

A

Response:

Michael owes a total of $18,280 in taxes related to his investment income.

CORRECT

TaxEval (v2) Benchmark

Key Takeaways

Dataset and Context

Results

Model Output Examples

Join our mailing list to receive benchmark updates on