Task Type:

CorpFin (v2) Benchmark

Last updated

1

OpenAI

GPT 4.1

71.2%

$2.00 / $8.00

10.33 s

2

xAI

Grok 3 Beta

69.1%

$3.00 / $15.00

23.86 s

3

xAI

Grok 3 Mini Fast Beta High Reasoning

68.6%

$0.60 / $4.00

14.12 s

4

Google

Gemini 2.5 Pro Exp

68.4%

$1.25 / $10.00

17.99 s

5

Anthropic

Claude 3.7 Sonnet (Thinking)

68.0%

$3.00 / $15.00

22.33 s

6

xAI

Grok 3 Mini Fast Beta Low Reasoning

66.0%

$0.60 / $4.00

10.42 s

7

OpenAI

GPT 4.1 mini

$⚡︎

65.8%

$0.40 / $1.60

6.56 s

8

Anthropic

Claude 3.7 Sonnet

65.5%

$3.00 / $15.00

15.15 s

9

DeepSeek

DeepSeek R1

63.2%

$8.00 / $8.00

46.80 s

10

DeepSeek

DeepSeek V3 (03/24/2025)

60.9%

$1.20 / $1.20

36.59 s

11

DeepSeek

DeepSeek V3

60.7%

$0.90 / $0.90

28.88 s

12

Anthropic

Claude 3.5 Sonnet Latest

60.5%

$3.00 / $15.00

N/A

13

Anthropic

Claude 3.5 Haiku Latest

58.2%

$1.00 / $5.00

N/A

14

xAI

Grok 2

58.2%

$2.00 / $10.00

84.80 s

15

Meta

Llama 4 Maverick

57.6%

$0.27 / $0.85

6.59 s

16

OpenAI

GPT 4o (2024-11-20)

56.6%

$2.50 / $10.00

6.35 s

17

OpenAI

o3 Mini

55.7%

$1.10 / $4.40

31.42 s

18

OpenAI

GPT 4o Mini

55.0%

$0.15 / $0.60

N/A

19

Cohere

Command A

54.5%

$2.50 / $10.00

14.34 s

20

Meta

Llama 4 Scout

53.9%

$0.18 / $0.59

9.22 s

21

Mistral

Mistral Small 3.1 (03/2025)

53.2%

$0.07 / $0.30

12.25 s

22

Google

Gemini 2.0 Pro Exp

53.1%

$1.25 / $5.00

18.13 s

23

AI21 Labs

Jamba 1.6 Large

50.7%

$2.00 / $8.00

29.61 s

24

OpenAI

GPT 4.1 nano

50.4%

$0.10 / $0.40

5.30 s

25

Google

Gemini 1.5 Pro (002)

50.3%

$1.25 / $5.00

37.35 s

26

OpenAI

GPT 4o (2024-08-06)

49.3%

$2.50 / $10.00

N/A

27

Meta

Llama 3.1 Instruct Turbo (70B)

47.2%

$0.88 / $0.88

N/A

28

AI21 Labs

Jamba 1.5 Large

46.6%

$2.00 / $8.00

10.74 s

29

Google

Gemini 1.5 Flash (002)

46.6%

$0.07 / $0.30

28.38 s

30

AI21 Labs

Jamba 1.6 Mini

46.0%

$0.20 / $0.40

4.28 s

31

Meta

Llama 3.1 Instruct Turbo (8B)

43.5%

$0.18 / $0.18

N/A

32

AI21 Labs

Jamba 1.5 Mini

39.9%

$0.20 / $0.40

2.34 s

33

Google

Gemini 2.0 Flash (001)

38.5%

$0.10 / $0.40

31.35 s

34

Google

Gemini 1.5 Flash (001)

32.9%

$0.07 / $0.30

N/A

Task type :

Key Takeaways

  • GPT-4.1 is the best performing model on this benchmark achieving 72.3% accuracy.
  • Grok 3 Beta is the second best performing model on this benchmark achieving 69.1% accuracy.
  • Grok 3 Mini Fast Beta High Reasoning is just behind at 68.6% accuracy.
  • o3 Mini seems to struggle with large context windows, performing poorly on the Max Fitting Context task. It tends to lose the question if it is provided at the beginning of a large context window (around 150k tokens and more).

Dataset

In both the finance and legal industry, it is a common task to ask a specific question or understand a piece of information from a very long document. A common document type is a credit agreement - contracts, often over 200 pages, used when large corporations receive a line of credit from a banking entity (you can find an example agreement here).

For this dataset, we worked with a team of experts, including financial analysts, legal professionals, and academics,to create a set of questions and answers about these credit agreements. The dataset is divided into:

  • Public Validation: 20 questions from 1 document, accessible to anyone on request (email contact@vals.ai for access).
  • Private Validation: 340 questions from 17 documents, available for purchase to evaluate and improve your own models.
  • Test: 858 questions from 43 documents. This is the privately held set for Vals AI benchmarking and is never shared.

The dataset is also organized into three distinct tasks, which all use the entire set of questions, but pass a different subset of information into each models’ context window. The three tasks are:

  1. Exact Pages: This task provides only the necessary pages required to answer each question, typically resulting in a small context of only a few pages.
  2. Shared Max Context: This task includes approximately 80 PDF pages while ensuring all information needed for the correct answer is included. The subselection doesn’t necessarily start at the first page, which can make document structure comprehension challenging for models.
  3. Max Fitting Context: This task includes the largest possible chunk of the document, starting from the first page, that fits within the model’s context window. This means longer-context models have more information.

The types of questions written by the experts include:

  • Basic extraction of terms and numbers: Some examples include “Who is the borrower’s legal counsel?” or “What currencies are the USD 325mm RCF first lien available in?”
  • Summarization and interpretation questions: Some examples include, “Is there an erroneous payment provision?” and “How will the loan proceeds be used?”
  • Numeric reasoning or calculation-based questions: Some examples include “How much initial debt capacity is available to the Borrower on day one?”
  • Questions involving referring to multiple sections of the provided text, especially to previous definitions: For instance, one example is “What is the minimum required amount of Available Liquidity that the company must maintain?” The model must refer to a previous definition of Available Liquidity.
  • Giving opinions of terms based on market standards: For instance, “Are there any unusual terms used to define or adjust EBITDA?” These questions require the models to make a judgment call, rather than just make a statement of fact.
  • Questions making use of industry jargon: There are several terms like “baskets”, which have a commonly understood meaning in the industry, but are almost never explicitly used in the agreement itself. An example is “Does the contract contain a Chewy Blocker?” (a type of clause meant to prevent a subsidiary from being released from its debt obligations).

Overall Results

Although the bigger models, like those from OpenAI and Anthropic, are performing the best, the gap between the smaller models is not as big as in other benchmarks. When the latency and price are taken into account, it means that sometimes the smaller models provide a better accuracy for a given cost.

CorpFin

/


CorpFin

/


Model highlights

GPT 4.1

GPT 4.1

Released date : 4/14/2025

Accuracy :

71.2%

Latency :

10.3s

Cost :

$2.00 / $8.00

  • GPT-4.1 (2025-04-14) shows exceptional performance on this benchmark, taking the #1 spot with the highest accuracy.
  • It demonstrates superior ability to process financial documents and extract relevant information across all task types.
  • The model does particularly well on the Max Fitting Context task, partially explanable by its large context window.

View Model


Model Output Examples

The questions that require synthetizing content and calculating values across multiple sections in the document are some of the hardest. The parsing from the original PDFs also adds a lot of special characters which can make it difficult to find information, in tables for example.

Below is an example of a number retrieval question, on the Max Fitting Context task. This question can be difficult for models because in a long financial document, there can be several similar tables that relate to the question, but only one is correct.

In this case, the expected answer is 3.25 to 1.00.

Q

What is the Total Net Leverage Ratio limit for unlimited RPs and investments?

A

Response:

3.25 to 1.00

CORRECT


Additional Notes

Context Window

Our testing across different context lengths reveals that the context window is an extremely significant factor in the model’s performance. Models have different window sizes across providers:

Evaluation Methodology

All evaluations were conducted using Claude 3.5 Sonnet Latest as the judge, with temperature set to 0. Even with fixed temperature, OpenAI models showed slight non-determinism (1-2% variance). The evaluation process revealed that both the judging prompt and the judge model can significantly impact results.

Retrieval-Augmented Generation (RAG)

As an alternative to including the entire document in the context window, RAG techniques are extremely common. This method breaks up long documents and databases into “chunks” which are first retrieved, and then passed to the model as context. This is another area of evaluation we will explore further.

Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.