Task Type:

TaxEval (v2) Benchmark

Last updated

1

OpenAI

o3

79.0%

$10.00 / $40.00

25.18 s

2

xAI

Grok 3 Beta

78.8%

$3.00 / $15.00

13.11 s

3

OpenAI

o4 Mini

$

78.8%

$1.10 / $4.40

15.25 s

4

OpenAI

o1

78.6%

$15.00 / $60.00

19.19 s

5

Anthropic

Claude 3.7 Sonnet (Thinking)

78.4%

$3.00 / $15.00

43.26 s

6

OpenAI

GPT 4.1

78.4%

$2.00 / $8.00

8.03 s

7

OpenAI

GPT 4o (2024-11-20)

⚡︎

78.1%

$2.50 / $10.00

5.83 s

8

Google

Gemini 2.5 Pro Exp

77.1%

$1.25 / $10.00

22.04 s

9

DeepSeek

DeepSeek R1

76.7%

$8.00 / $8.00

162.88 s

10

xAI

Grok 3 Mini Fast Beta Low Reasoning

76.5%

$0.60 / $4.00

11.74 s

11

Anthropic

Claude 3.7 Sonnet

75.9%

$3.00 / $15.00

6.44 s

12

xAI

Grok 3 Mini Fast Beta High Reasoning

75.7%

$0.60 / $4.00

24.96 s

13

DeepSeek

DeepSeek V3 (03/24/2025)

75.2%

$1.20 / $1.20

31.71 s

14

OpenAI

GPT 4.1 mini

75.0%

$0.40 / $1.60

5.91 s

15

OpenAI

GPT 4o (2024-08-06)

75.0%

$2.50 / $10.00

9.54 s

16

OpenAI

o3 Mini

73.9%

$1.10 / $4.40

81.08 s

17

Google

Gemini 2.0 Flash Thinking Exp

73.8%

$0.10 / $0.70

12.80 s

18

Anthropic

Claude 3.5 Sonnet Latest

73.7%

$3.00 / $15.00

5.94 s

19

Google

Gemini 2.0 Flash Exp

72.4%

$0.07 / $0.30

7.51 s

20

DeepSeek

DeepSeek V3

72.0%

$0.90 / $0.90

25.76 s

21

xAI

Grok 2

71.4%

$2.00 / $10.00

10.63 s

22

Google

Gemini 2.0 Pro Exp

71.0%

$1.25 / $5.00

9.14 s

23

Google

Gemini 2.0 Flash (001)

70.5%

$0.10 / $0.40

5.72 s

24

Meta

Llama 4 Maverick

69.3%

$0.27 / $0.85

28.18 s

25

Mistral

Mistral Large (11/2024)

67.7%

$2.00 / $6.00

12.00 s

26

Cohere

Command A

66.6%

$2.50 / $10.00

10.13 s

27

Meta

Llama 3.1 Instruct Turbo (405B)

66.3%

$3.50 / $3.50

24.91 s

28

OpenAI

GPT 4.1 nano

66.1%

$0.10 / $0.40

3.04 s

29

AI21 Labs

Jamba 1.6 Large

65.3%

$2.00 / $8.00

16.30 s

30

OpenAI

GPT 4o Mini

64.9%

$0.15 / $0.60

9.00 s

31

Google

Gemini 1.5 Pro (002)

64.9%

$1.25 / $5.00

9.91 s

32

Meta

Llama 3.3 Instruct Turbo (70B)

63.9%

$0.88 / $0.88

3.86 s

33

Anthropic

Claude 3.5 Haiku Latest

63.0%

$1.00 / $5.00

5.00 s

34

Mistral

Mistral Small 3.1 (03/2025)

62.9%

$0.07 / $0.30

7.95 s

35

AI21 Labs

Jamba 1.5 Large

62.7%

$2.00 / $8.00

20.32 s

36

Meta

Llama 3.1 Instruct Turbo (70B)

61.1%

$0.88 / $0.88

4.39 s

37

Meta

Llama 4 Scout

59.0%

$0.18 / $0.59

7.09 s

38

Mistral

Mistral Small (02/2024)

54.1%

$0.20 / $0.60

8.97 s

39

Google

Gemini 1.5 Flash (002)

53.4%

$0.07 / $0.30

3.74 s

40

AI21 Labs

Jamba 1.6 Mini

50.3%

$0.20 / $0.40

4.39 s

41

AI21 Labs

Jamba 1.5 Mini

46.9%

$0.20 / $0.40

4.82 s

42

Meta

Llama 3.1 Instruct Turbo (8B)

39.0%

$0.18 / $0.18

2.45 s

Task type :

Key Takeaways


Dataset

TaxEval v2 is our latest benchmark for evaluating models’ ability to handle tax-related questions — it is a completely new benchmark from our previous TaxEval dataset. This version focuses on both answer correctness and structured reasoning capabilities. This dataset was created in collaboration with financial and tax experts, who have both created and double-checked all questions and answers.

Some key features:

  • 1,500+ total questions across validation and test sets
  • Balanced distribution of topics and question types
  • Comprehensive evaluation of both answers and reasoning steps

The benchmark consists of three main components:

  • Public Validation Set: 20 samples, publicly accessible (to request access, please contact contact@vals.ai).
  • Private Validation Set: 300 samples for model evaluation, available for purchase to evaluate and improve models.
  • Test Set: 1,223 samples. These samples are never shared.

The benchmark is composed of two tasks (each task uses the same questions, but are evaluated differently):

  1. Answer Correctness: The factual correctness of the answer.
  2. Stepwise Reasoning: The quality and structure of the reasoning process is assessed, ensuring models not only provide correct answers but also demonstrate clear logical steps.

The benchmark includes a diverse range of question types:

  • Application and Compliance (18.3%)
  • Comparative Analysis (16.2%)
  • Numerical Reasoning (16.7%)
  • Problem Solving and Critical Thinking (16.5%)
  • Semantic Analysis (18.0%)
  • Updates and Current Affairs (15.9%)

Each category is carefully balanced between the private validation and test sets to ensure representative sampling. The evaluation process uses Claude 3.5 Sonnet Latest as judge for both answer correctness and stepwise reasoning assessment.


Overall Results

We can see a trend of OpenAI’s and Anthropic’s competitors (Google, xAI, Mistral, etc.) closing the gap and reaching similar performance as the top models.

TaxEval

/

We can also see that although GPT-4o is the best-performing model, Gemini 2.0 Flash is not far off, but far cheaper. We also notice that the chart is generally dominated by closed-source models - althogh more closed source models were benchmarked, the Llama models had only middling performance.


TaxEval

/


Model highlights

o1

o1

Released date : 12/17/2024

Accuracy :

78.6%

Latency :

19.2s

Cost :

$15.00 / $60.00

  • Grok 3 Beta achieves the highest performance on this benchmark.
  • Despite scoring very similarly to o1, it demonstrates significantly better latency and cost efficiency.

View Model

DeepSeek R1

DeepSeek R1

Released date : 1/20/2025

Accuracy :

76.7%

Latency :

162.9s

Cost :

$8.00 / $8.00

  • The Chinese-trained DeepSeek R1 model has been a significant revelation for the AI community, establishing itself as a formidable competitor to closed source models while being completely open source.
  • Despite these impressive results, competing closed source models have since evolved and now outperform DeepSeek R1.

View Model


Model Output Examples

The hardest questions for the models typically require complex, multi-step calculations, reasoning on what laws and numbers to use given the information, or accessing more recent data (which is then directly linked to the training cutoff)

Below is an example where only GPT-4o (2024-11-20) gets it right, while the other models perform the calculations incorrectly. It is also noticeable that the models need a long reasoning to get to the answer, making the output token limit significant here. Forcing an early cut might not let enough time for the model to get to the answer.

Q

Michael, a married taxpayer filing jointly, has an adjusted gross income (AGI) of $300,000 for 2023, excluding any investment income. During the year, he received $20,000 in interest from corporate bonds and $15,000 in interest from municipal bonds issued by his state of residence. In addition, he sold a collectible artwork for $100,000 that he had purchased 5 years ago for $60,000. Calculate Michael's tax liability related to his investment income, including any applicable taxes on capital gains and considering the Net Investment Income Tax (NIIT).

A

Response:

Michael owes a total of $18,280 in taxes related to his investment income.

CORRECT

Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.