Task Type:

TaxEval (v2) Benchmark

Last updated

Task type :

Key Takeaways

  • GPT-o1 (2024-12-17) sets the new state-of-the-art on the TaxEval dataset, demonstrating its strong numerical reasoning ability.
  • Claude 3.7 Sonnet (Thinking) reaches the 2nd place, with a significant improvement since Claude 3.5.
  • DeepSeek R1 showed exceptional performance, closing the gap of open source models with its more expensive, closed source alternatives.
  • Google 2.0 Flash Exp and Grok 2, Google’s and xAI’s latest models, show strong performance by taking the 5th and 6th places.

Dataset

TaxEval v2 is our latest benchmark for evaluating models’ ability to handle tax-related questions — it is a completely new benchmark from our previous TaxEval dataset. This version focuses on both answer correctness and structured reasoning capabilities. This dataset was created in collaboration with financial and tax experts, who have both created and double-checked all questions and answers.

Some key features:

  • 1,500+ total questions across validation and test sets
  • Balanced distribution of topics and question types
  • Comprehensive evaluation of both answers and reasoning steps

The benchmark consists of three main components:

  • Public Validation Set: 20 samples, publicly accessible (to request access, please contact contact@vals.ai).
  • Private Validation Set: 300 samples for model evaluation, available for purchase to evaluate and improve models.
  • Test Set: 1,223 samples. These samples are never shared.

The benchmark is composed of two tasks (each task uses the same questions, but are evaluated differently):

  1. Answer Correctness: The factual correctness of the answer.
  2. Stepwise Reasoning: The quality and structure of the reasoning process is assessed, ensuring models not only provide correct answers but also demonstrate clear logical steps.

The benchmark includes a diverse range of question types:

  • Application and Compliance (18.3%)
  • Comparative Analysis (16.2%)
  • Numerical Reasoning (16.7%)
  • Problem Solving and Critical Thinking (16.5%)
  • Semantic Analysis (18.0%)
  • Updates and Current Affairs (15.9%)

Each category is carefully balanced between the private validation and test sets to ensure representative sampling. The evaluation process uses Claude 3.5 Sonnet Latest as judge for both answer correctness and stepwise reasoning assessment.


Overall Results

We can see a trend of OpenAI’s and Anthropic’s competitors (Google, xAI, Mistral, etc.) closing the gap and reaching similar performance as the top models.

TaxEval

/

We can also see that although GPT-4o is the best-performing model, Gemini 2.0 Flash is not far off, but far cheaper. We also notice that the chart is generally dominated by closed-source models - althogh more closed source models were benchmarked, the Llama models had only middling performance.


TaxEval

/


Model highlights

o1

o1

Released date : 12/17/2024

Accuracy :

78.6%

Latency :

19.2s

Cost :

$15.00 / $60.00

  • GPT-o1 takes advantage of OpenAI's new reasoning abilities and demonstrates this ability when applied to tax questions.
  • This top performance comes at a high price as one of the slowest, most expensive models available.

View Model

DeepSeek R1

DeepSeek R1

Released date : 1/20/2025

Accuracy :

76.7%

Latency :

162.9s

Cost :

$8.00 / $8.00

  • A bombshell for the AI community, the Chinese-trained DeepSeek R1 model proves a worthy competitor to Open AI's o1 model. It shows rivaling performance while being completely open source.
  • Companies looking to adopt R1 should avoid using the DeepSeek platform directly which stores all data sent. Instead, the model can be used on inference providers such as Fireworks AI, Together AI, or cloud platforms.

View Model


Model Output Examples

The hardest questions for the models typically require complex, multi-step calculations, reasoning on what laws and numbers to use given the information, or accessing more recent data (which is then directly linked to the training cutoff)

Below is an example where only GPT-4o (2024-11-20) gets it right, while the other models perform the calculations incorrectly. It is also noticeable that the models need a long reasoning to get to the answer, making the output token limit significant here. Forcing an early cut might not let enough time for the model to get to the answer.

Q

Michael, a married taxpayer filing jointly, has an adjusted gross income (AGI) of $300,000 for 2023, excluding any investment income. During the year, he received $20,000 in interest from corporate bonds and $15,000 in interest from municipal bonds issued by his state of residence. In addition, he sold a collectible artwork for $100,000 that he had purchased 5 years ago for $60,000. Calculate Michael's tax liability related to his investment income, including any applicable taxes on capital gains and considering the Net Investment Income Tax (NIIT).

A

Response:

Michael owes a total of $18,280 in taxes related to his investment income.

CORRECT

Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.