Task Type:

CorpFin (v2) Benchmark

Last updated

1

GPT 4.1

★

71.2%

$2.00 / $8.00

10.33 s

2

Grok 3 Beta

69.1%

$3.00 / $15.00

23.86 s

3

Grok 3 Mini Fast Beta High Reasoning

68.6%

$0.60 / $4.00

14.12 s

4

Gemini 2.5 Pro Exp

68.4%

$1.25 / $10.00

17.99 s

5

Claude 3.7 Sonnet (Thinking)

68.0%

$3.00 / $15.00

22.33 s

6

Grok 3 Mini Fast Beta Low Reasoning

66.0%

$0.60 / $4.00

10.42 s

7

GPT 4.1 mini

$⚡︎

65.8%

$0.40 / $1.60

6.56 s

8

Claude 3.7 Sonnet

65.5%

$3.00 / $15.00

15.15 s

9

DeepSeek R1

63.2%

$8.00 / $8.00

46.80 s

10

DeepSeek V3 (03/24/2025)

60.9%

$1.20 / $1.20

36.59 s

11

DeepSeek V3

60.7%

$0.90 / $0.90

28.88 s

12

Claude 3.5 Sonnet Latest

60.5%

$3.00 / $15.00

N/A

13

Claude 3.5 Haiku Latest

58.2%

$1.00 / $5.00

N/A

14

Grok 2

58.2%

$2.00 / $10.00

84.80 s

15

Llama 4 Maverick

57.6%

$0.27 / $0.85

6.59 s

16

GPT 4o (2024-11-20)

56.6%

$2.50 / $10.00

6.35 s

17

o3 Mini

55.7%

$1.10 / $4.40

31.42 s

18

GPT 4o Mini

55.0%

$0.15 / $0.60

N/A

19

Command A

54.5%

$2.50 / $10.00

14.34 s

20

Llama 4 Scout

53.9%

$0.18 / $0.59

9.22 s

21

Mistral Small 3.1 (03/2025)

53.2%

$0.07 / $0.30

12.25 s

22

Gemini 2.0 Pro Exp

53.1%

$1.25 / $5.00

18.13 s

23

Jamba 1.6 Large

50.7%

$2.00 / $8.00

29.61 s

24

GPT 4.1 nano

50.4%

$0.10 / $0.40

5.30 s

25

Gemini 1.5 Pro (002)

50.3%

$1.25 / $5.00

37.35 s

26

GPT 4o (2024-08-06)

49.3%

$2.50 / $10.00

N/A

27

Llama 3.1 Instruct Turbo (70B)

47.2%

$0.88 / $0.88

N/A

28

Jamba 1.5 Large

46.6%

$2.00 / $8.00

10.74 s

29

Gemini 1.5 Flash (002)

46.6%

$0.07 / $0.30

28.38 s

30

Jamba 1.6 Mini

46.0%

$0.20 / $0.40

4.28 s

31

Llama 3.1 Instruct Turbo (8B)

43.5%

$0.18 / $0.18

N/A

32

Jamba 1.5 Mini

39.9%

$0.20 / $0.40

2.34 s

33

Gemini 2.0 Flash (001)

38.5%

$0.10 / $0.40

31.35 s

34

Gemini 1.5 Flash (001)

32.9%

$0.07 / $0.30

N/A

Task type :

Key Takeaways

GPT-4.1 is the best performing model on this benchmark achieving 72.3% accuracy.
Grok 3 Beta is the second best performing model on this benchmark achieving 69.1% accuracy.
Grok 3 Mini Fast Beta High Reasoning is just behind at 68.6% accuracy.
o3 Mini seems to struggle with large context windows, performing poorly on the Max Fitting Context task. It tends to lose the question if it is provided at the beginning of a large context window (around 150k tokens and more).

Dataset

In both the finance and legal industry, it is a common task to ask a specific question or understand a piece of information from a very long document. A common document type is a credit agreement - contracts, often over 200 pages, used when large corporations receive a line of credit from a banking entity (you can find an example agreement here).

For this dataset, we worked with a team of experts, including financial analysts, legal professionals, and academics,to create a set of questions and answers about these credit agreements. The dataset is divided into:

Public Validation: 20 questions from 1 document, accessible to anyone on request (email contact@vals.ai for access).
Private Validation: 340 questions from 17 documents, available for purchase to evaluate and improve your own models.
Test: 858 questions from 43 documents. This is the privately held set for Vals AI benchmarking and is never shared.

The dataset is also organized into three distinct tasks, which all use the entire set of questions, but pass a different subset of information into each models’ context window. The three tasks are:

Exact Pages: This task provides only the necessary pages required to answer each question, typically resulting in a small context of only a few pages.
Shared Max Context: This task includes approximately 80 PDF pages while ensuring all information needed for the correct answer is included. The subselection doesn’t necessarily start at the first page, which can make document structure comprehension challenging for models.
Max Fitting Context: This task includes the largest possible chunk of the document, starting from the first page, that fits within the model’s context window. This means longer-context models have more information.

The types of questions written by the experts include:

Basic extraction of terms and numbers: Some examples include “Who is the borrower’s legal counsel?” or “What currencies are the USD 325mm RCF first lien available in?”
Summarization and interpretation questions: Some examples include, “Is there an erroneous payment provision?” and “How will the loan proceeds be used?”
Numeric reasoning or calculation-based questions: Some examples include “How much initial debt capacity is available to the Borrower on day one?”
Questions involving referring to multiple sections of the provided text, especially to previous definitions: For instance, one example is “What is the minimum required amount of Available Liquidity that the company must maintain?” The model must refer to a previous definition of Available Liquidity.
Giving opinions of terms based on market standards: For instance, “Are there any unusual terms used to define or adjust EBITDA?” These questions require the models to make a judgment call, rather than just make a statement of fact.
Questions making use of industry jargon: There are several terms like “baskets”, which have a commonly understood meaning in the industry, but are almost never explicitly used in the agreement itself. An example is “Does the contract contain a Chewy Blocker?” (a type of clause meant to prevent a subsidiary from being released from its debt obligations).

Overall Results

Although the bigger models, like those from OpenAI and Anthropic, are performing the best, the gap between the smaller models is not as big as in other benchmarks. When the latency and price are taken into account, it means that sometimes the smaller models provide a better accuracy for a given cost.

CorpFin

/

CorpFin

/

Model highlights

GPT 4.1

Released date : 4/14/2025

Accuracy :

71.2%

Latency :

10.3s

Cost :

$2.00 / $8.00

GPT-4.1 (2025-04-14) shows exceptional performance on this benchmark, taking the #1 spot with the highest accuracy.
It demonstrates superior ability to process financial documents and extract relevant information across all task types.
The model does particularly well on the Max Fitting Context task, partially explanable by its large context window.

View Model

Model Output Examples

The questions that require synthetizing content and calculating values across multiple sections in the document are some of the hardest. The parsing from the original PDFs also adds a lot of special characters which can make it difficult to find information, in tables for example.

Below is an example of a number retrieval question, on the Max Fitting Context task. This question can be difficult for models because in a long financial document, there can be several similar tables that relate to the question, but only one is correct.

In this case, the expected answer is 3.25 to 1.00.

Q

What is the Total Net Leverage Ratio limit for unlimited RPs and investments?

A

Response:

3.25 to 1.00

CORRECT

Additional Notes

Context Window

Our testing across different context lengths reveals that the context window is an extremely significant factor in the model’s performance. Models have different window sizes across providers:

Claude 3.5 Sonnet Latest and Claude 3.5 Haiku Latest can handle up to 200k tokens
GPT-4o 2024-11-20 and Llama 3.1 Instruct Turbo (8B) are limited to 128k tokens
Gemini 1.5 Flash (002) effectively has no limit for our test documents

Evaluation Methodology

All evaluations were conducted using Claude 3.5 Sonnet Latest as the judge, with temperature set to 0. Even with fixed temperature, OpenAI models showed slight non-determinism (1-2% variance). The evaluation process revealed that both the judging prompt and the judge model can significantly impact results.

Retrieval-Augmented Generation (RAG)

As an alternative to including the entire document in the context window, RAG techniques are extremely common. This method breaks up long documents and databases into “chunks” which are first retrieved, and then passed to the model as context. This is another area of evaluation we will explore further.

CorpFin (v2) Benchmark

Key Takeaways

Dataset

Overall Results

Model highlights

GPT 4.1

Model Output Examples

What is the Total Net Leverage Ratio limit for unlimited RPs and investments?

Additional Notes

Context Window

Evaluation Methodology

Retrieval-Augmented Generation (RAG)

Join our mailing list to receive benchmark updates on