TaxEval (v2)

Key Takeaways

Grok 3 Beta reaches the #1 place, tied with GPT-o1 (2024-12-17).
Claude 3.7 Sonnet (Thinking) gets the third place, but with a way higher latency. (43.26s versus 13.11 for Grok 3 Beta and 19.19 for o1)
Gemini 2.5 Pro Exp and DeepSeek R1 take the 5th and 6th places.

Dataset

TaxEval v2 is our latest benchmark for evaluating models’ ability to handle tax-related questions — it is a completely new benchmark from our previous TaxEval dataset. This version focuses on both answer correctness and structured reasoning capabilities. This dataset was created in collaboration with financial and tax experts, who have both created and double-checked all questions and answers.

Some key features:

1,500+ total questions across validation and test sets
Balanced distribution of topics and question types
Comprehensive evaluation of both answers and reasoning steps

The benchmark consists of three main components:

Public Validation Set: 20 samples, publicly accessible (to request access, please contact contact@vals.ai).
Private Validation Set: 300 samples for model evaluation, available for purchase to evaluate and improve models.
Test Set: 1,223 samples. These samples are never shared.

The benchmark is composed of two tasks (each task uses the same questions, but are evaluated differently):

Answer Correctness: The factual correctness of the answer.
Stepwise Reasoning: The quality and structure of the reasoning process is assessed, ensuring models not only provide correct answers but also demonstrate clear logical steps.

The benchmark includes a diverse range of question types:

Application and Compliance (18.3%)
Comparative Analysis (16.2%)
Numerical Reasoning (16.7%)
Problem Solving and Critical Thinking (16.5%)
Semantic Analysis (18.0%)
Updates and Current Affairs (15.9%)

Each category is carefully balanced between the private validation and test sets to ensure representative sampling. The evaluation process uses Claude 3.5 Sonnet Latest as judge for both answer correctness and stepwise reasoning assessment.

Overall Results

We can see a trend of OpenAI’s and Anthropic’s competitors (Google, xAI, Mistral, etc.) closing the gap and reaching similar performance as the top models.

TaxEval

We can also see that although GPT-4o is the best-performing model, Gemini 2.0 Flash is not far off, but far cheaper. We also notice that the chart is generally dominated by closed-source models - althogh more closed source models were benchmarked, the Llama models had only middling performance.

TaxEval

Model highlights

o1

Released date : 12/17/2024

Accuracy :

78.6%

Latency :

19.2s

Cost :

$15.00 / $60.00

Grok 3 Beta achieves the highest performance on this benchmark.
Despite scoring very similarly to o1, it demonstrates significantly better latency and cost efficiency.

View Model

DeepSeek R1

Released date : 1/20/2025

Accuracy :

76.7%

Latency :

162.9s

Cost :

$8.00 / $8.00

The Chinese-trained DeepSeek R1 model has been a significant revelation for the AI community, establishing itself as a formidable competitor to closed source models while being completely open source.
Despite these impressive results, competing closed source models have since evolved and now outperform DeepSeek R1.

View Model

Model Output Examples

The hardest questions for the models typically require complex, multi-step calculations, reasoning on what laws and numbers to use given the information, or accessing more recent data (which is then directly linked to the training cutoff)

Below is an example where only GPT-4o (2024-11-20) gets it right, while the other models perform the calculations incorrectly. It is also noticeable that the models need a long reasoning to get to the answer, making the output token limit significant here. Forcing an early cut might not let enough time for the model to get to the answer.

Michael, a married taxpayer filing jointly, has an adjusted gross income (AGI) of $300,000 for 2023, excluding any investment income. During the year, he received $20,000 in interest from corporate bonds and $15,000 in interest from municipal bonds issued by his state of residence. In addition, he sold a collectible artwork for $100,000 that he had purchased 5 years ago for $60,000. Calculate Michael's tax liability related to his investment income, including any applicable taxes on capital gains and considering the Net Investment Income Tax (NIIT).

Response:

Michael owes a total of $18,280 in taxes related to his investment income.

CORRECT

TaxEval (v2) Benchmark