Task Type:

TaxEval Benchmark

Last updated

1

OpenAI

o1 Preview

$15.00 / $60.00

83.5 %

7.38 s

2

OpenAI

o1 Mini

$3.00 / $12.00

72.4 %

3.41 s

3

Anthropic

Claude 3.5 Sonnet Latest

$3.00 / $15.00

67.1 %

0.43 s

4

OpenAI

GPT 4o

$2.50 / $10.00

66.4 %

0.71 s

5

Anthropic

Claude 3.5 Sonnet

$3.00 / $15.00

65.7 %

0.65 s

6

OpenAI

GPT 4

$10.00 / $30.00

60.5 %

0.64 s

7

Anthropic

Claude 3 Opus

$15.00 / $75.00

58.4 %

2.06 s

8

Meta

Llama 3.1 Instruct Turbo (405B)

$3.50 / $3.50

57.8 %

0.81 s

9

OpenAI

GPT 4o Mini $

$0.15 / $0.60

54.2 %

0.46 s

10

Meta

Llama 3.1 Instruct Turbo (70B)

$0.70 / $0.70

49.4 %

2.89 s

11

Anthropic

Claude 3 Sonnet

$3.00 / $15.00

48.0 %

0.81 s

12

Google

Gemini 1.5 Pro 001

$1.25 / $5.00

46.1 %

1.05 s

13

Meta

Llama 3 (70B)

$0.90 / $0.90

44.6 %

0.57 s

14

OpenAI

GPT 3.5

$0.50 / $1.50

40.2 %

0.34 s

15

Google

Gemini 1.0 Pro 002

$0.50 / $1.50

38.4 %

0.53 s

16

Cohere

Command R+

$3.00 / $15.00

37.5 %

0.25 s

17

Meta

Llama 3.1 Instruct Turbo (8B)

$0.18 / $0.18

35.7 %

0.29 s

18

Databricks

DBRX Instruct

$2.25 / $6.75

34.0 %

0.60 s

19

Cohere

Command R

$0.50 / $1.50

34.0 %

0.14 s

20

Meta

Llama 2 (70B)

$0.90 / $0.90

32.6 %

0.56 s

21

Mistral

Mixtral (8x7B)

$0.60 / $0.60

31.0 %

0.52 s

22

Meta

Llama 2 (13B)

$0.20 / $0.20

28.0 %

0.56 s

23

Meta

Llama 3 (8B)

$0.20 / $0.20

26.7 %

0.57 s

24

Mistral

Mistral (7B)

$0.18 / $0.18

25.8 %

0.52 s

25

Meta

Llama 2 (7B)

$0.20 / $0.20

23.9 %

0.38 s

Task type :

Key Takeaways

  • o1 preview and o1 Mini performed far, far, better than other models on this dataset. They are much better at handling the numerical capabilities required for many of the tasks, and logically chaining the multiple steps required for a given calculation.
  • In general, tax questions are a very challenging domain for Large Language Models. Most struggled across the board, but especially with math reasoning tasks.
  • The upgraded Claude 3.5 Sonnet model performed better on multiple choice, but the previous version of the model performed better on free response.
  • Apart from Llama 3/3.1, open-source models performed only marginally better than random guessing. It will take considerable work for these models to perform at a high standard on tax reasoning questions.

Context

There has been a considerable effort to measure language model performance in academic tasks and chatbot settings but these high-level benchmarks are contrived and not applicable to specific industry use cases. Further, model performance results released by LLM providers are highly biased - they are often manufactured to show state-of-the-art results.

Here we start to remedy this by reporting our third-party, application-specific findings and live leaderboard results on TaxEval. This dataset consists of multiple-choice and free-response US tax questions. Some of the major practice areas explored are as follows.

Income Tax:

  • Taxable income calculation: Understanding the differences between accounting income and taxable income, including permanent and temporary differences.
  • Tax rates: Applying the appropriate tax rates to calculate income tax expense.
  • Deferred tax assets and liabilities: Recognizing and measuring deferred tax assets and liabilities arising from temporary differences.
  • Effective tax rate: Calculating and analyzing the effective tax rate.

General Tax Concepts:

  • Matching principle: Applying the matching principle to recognize tax expense in the same period as the related revenue or expense.
  • Tax accounting methods: Understanding the differences between cash-basis and accrual-basis accounting for tax purposes.
  • Discontinued operations: Calculating the after-tax gain or loss on disposal of a discontinued operation.
  • Intangible assets: Understanding the tax implications of impairment losses on intangible assets.

Overall Results

On a task-by-task basis, o1 preview was often the best - but 4o and Llama 3.1 also claimed two of the top task-specific spots. We can also see that a lot of the bump in performance for o1 came from tasks in the “Rule” category - for example, it got a nearly perfect 98% on “Rule QA”, one of the few free response tasks in LegalBench.

o1 Preview

o1 Preview

Released date : 9/12/2024

Accuracy :

83.5%

Latency :

7.4s

Cost :

$15.00 / $60.00

  • o1 preview commands an impressive 16.4 percentage point lead against Sonnet overall
  • o1 mini performed well on the multiple choice (second to o1 preview), but less well on the free response.
  • Despite its impressive performance, the model was also extremely difficult to work with - we had to prepend additional instructions for it to follow formatting directions (see the Additional Notes section)

View Model

The per-task results are relayed in the bar chart below.

TaxEval

/

Most models had a relatively large gap between their multiple choice and free response performance - however, o1 preview had a smaller gap than the others. In terms of relative performance, there is a significant divide between the o1 and Anthropic, and another divide between Anthropic models and others, particularly on the free-response questions. Gemini Pro and the GPT 3.5 models were of extremely middling performance. The other open source models were hopeless with accuracies near to pure guessing.

Llama 3.1 405B performed well, but was not at the same level as Opus or o1 preview on this task. However, it was competitive, and a significant step up from both the previous Llama generations and other open source models. And even the Llama 2 models outcompeted Cohere.

TaxEval

/

The cost-accuracy graph shows a few models that define a Pareto curve of tradeoffs: GPT-4o mini, Llama 3.1 70B, Llama 3.1 405B, and o1 preview. Shortly off the Pareto curve are GPT-4o and Claude 3.5 Sonnet. Among these four models, the objective difference in performance is small (a matter of a few percentage points), however, they have wildly different prices. GPT-4o Mini particularly stands out as having a very high quality to price ratio. A cheaper model may still be a better choice in domains with high token usage or cost sensitivity.

The only three models that define the Pareto curve are all from OpenAI - GPT-4o Mini, GPT o1 Mini, and GPT o1 Preview. At every price point, these models are a considerable gap above their competition. It seems OpenAI has begun to crack the code on how to make models with stronger math and reasoning performance.


Model Output Example

We show here one free response question, in which Opus and Gemini were incorrect, while GPT4 and o1 preview were correct. This question relates to what type of fund a library should use.

This question is tricky, because both answers could be right, but an enterprise fund is a better answer. Enterprise funds are intended to be used for services that are wholly self sufficient. Furthermore, enterprise funds provide accrual accounting, which is useful for tracking revenue and expenses. Also, special revenue funds can only be used for a specific purpose, meaning if there is overage, the fund remains locked.

Gemini just says “Governmental fund”, which is a broad category of funds, rather than the specific type (Special revenue or Enterprise). However, this would not be useful to a user, as they would be interested in what type of governmental fund they should use.

Q

Answer concisely in one word, phrase or number. King City Council will be establishing a library fund. Library fees are expected to cover 55% of the library's annual resource requirements. King has decided that an annual determination of net income is desirable in order to maintain management control and accountability over library. What type of fund should King establish in order to meet their measurement objectives?

A

Model Response

Special revenue fund.

INCORRECT


Additional Notes

o1 preview

The existing base prompt for both multiple choice and free response contains a directive for every model to “Answer concisely in one word, phrase or number.” However, o1 preview, although getting the answer right, would consistently include far too much text for reasoning and description, which made evaluation difficult. Therefore, for o1 evaluation, we prepended every prompt with the following directive: “THE OUTPUT SHOULD BE ONE WORD, A FEW WORD PHRASE, NUMBER, OR A MONETARY FIGURE. DO NOT EXPLAIN REASONING.”

GPT o1 also did not let us configure the temperature used for the evaluation.

Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.