Key Takeaways
- Claude 3.7 Sonnet (Thinking) is the best performing model on this benchmark by a significant margin achieving 68% accuracy. It has high consistency between tasks performance, and did not show any issues handling the question at the beginning of a large context window or having a partial document excerpt as context.
- Claude 3.7 Sonnet is just behind at 65.5% accuracy.
- DeepSeek R1 is third best performing model on this benchmark achieving 63.2% accuracy. It also shows great consistency, but however way worse latency because of the reasoning.
- o3 Mini seems to struggle with large context windows, performing poorly on the Max Fitting Context task. It tends to lose the question if it is provided at the beginning of a large context window (around 150k tokens and more).
- On the Exact Pages task, DeepSeek R1 and o3 Mini lead the board. The few points advantage they get is at the cost of way slower response time, reaching 20.65 seconds latency for DeepSeek R1 in average for relatively short context (2 pages).
- On the Shared Max Context task, 2 of the top 3 models are the reasoning models. Claude 3.5 Haiku Latest is surprinsingly getting the second place. We noticed that these small models hedge less when potentially incomplete information is given. On the other hand, the larger models that do not have a reasoning ability are more likely to express doubts or refuse to answer because they realize the information provided is just an extracted chunk. The eval system then considers the hedging and refusals as incorrect answers.
Dataset
In both the finance and legal industry, it is a common task to ask a specific question or understand a piece of information from a very long document. A common document type is a credit agreement - contracts, often over 200 pages, used when large corporations receive a line of credit from a banking entity (you can find an example agreement here).
For this dataset, we worked with a team of experts, including financial analysts, legal professionals, and academics,to create a set of questions and answers about these credit agreements. The dataset is divided into:
- Public Validation: 20 questions from 1 document, accessible to anyone on request (email contact@vals.ai for access).
- Private Validation: 340 questions from 17 documents, available for purchase to evaluate and improve your own models.
- Test: 858 questions from 43 documents. This is the privately held set for Vals AI benchmarking and is never shared.
The dataset is also organized into three distinct tasks, which all use the entire set of questions, but pass a different subset of information into each models’ context window. The three tasks are:
- Exact Pages: This task provides only the necessary pages required to answer each question, typically resulting in a small context of only a few pages.
- Shared Max Context: This task includes approximately 80 PDF pages while ensuring all information needed for the correct answer is included. The subselection doesn’t necessarily start at the first page, which can make document structure comprehension challenging for models.
- Max Fitting Context: This task includes the largest possible chunk of the document, starting from the first page, that fits within the model’s context window. This means longer-context models have more information.
The types of questions written by the experts include:
- Basic extraction of terms and numbers: Some examples include “Who is the borrower’s legal counsel?” or “What currencies are the USD 325mm RCF first lien available in?”
- Summarization and interpretation questions: Some examples include, “Is there an erroneous payment provision?” and “How will the loan proceeds be used?”
- Numeric reasoning or calculation-based questions: Some examples include “How much initial debt capacity is available to the Borrower on day one?”
- Questions involving referring to multiple sections of the provided text, especially to previous definitions: For instance, one example is “What is the minimum required amount of Available Liquidity that the company must maintain?” The model must refer to a previous definition of Available Liquidity.
- Giving opinions of terms based on market standards: For instance, “Are there any unusual terms used to define or adjust EBITDA?” These questions require the models to make a judgment call, rather than just make a statement of fact.
- Questions making use of industry jargon: There are several terms like “baskets”, which have a commonly understood meaning in the industry, but are almost never explicitly used in the agreement itself. An example is “Does the contract contain a Chewy Blocker?” (a type of clause meant to prevent a subsidiary from being released from its debt obligations).
Overall Results
Although the bigger models, like those from OpenAI and Anthropic, are performing the best, the gap between the smaller models is not as big as in other benchmarks. When the latency and price are taken into account, it means that sometimes the smaller models provide a better accuracy for a given cost.
/
/
Model highlights
Claude 3.7 Sonnet
Released date : 2/24/2025
Accuracy :
65.5%
Latency :
15.1s
Cost :
$3.00 / $15.00
- Claude 3.7 Sonnet (2025-02-24) shows very strong performance on this benchmark, even beating reasoning models like DeepSeek R1.
- It is also very consistent between the different tasks, showing minimal issues with the context window.
View Model
Model Output Examples
The questions that require synthetizing content and calculating values across multiple sections in the document are some of the hardest. The parsing from the original PDFs also adds a lot of special characters which can make it difficult to find information, in tables for example.
Below is an example of a number retrieval question, on the Max Fitting Context task. This question can be difficult for models because in a long financial document, there can be several similar tables that relate to the question, but only one is correct.
In this case, the expected answer is 3.25 to 1.00.
Q
What is the Total Net Leverage Ratio limit for unlimited RPs and investments?
A
3.25 to 1.00
CORRECT
Additional Notes
Context Window
Our testing across different context lengths reveals that the context window is an extremely significant factor in the model’s performance. Models have different window sizes across providers:
- Claude 3.5 Sonnet Latest and Claude 3.5 Haiku Latest can handle up to 200k tokens
- GPT-4o 2024-11-20 and Llama 3.1 Instruct Turbo (8B) are limited to 128k tokens
- Gemini 1.5 Flash (002) effectively has no limit for our test documents
Evaluation Methodology
All evaluations were conducted using Claude 3.5 Sonnet Latest as the judge, with temperature set to 0. Even with fixed temperature, OpenAI models showed slight non-determinism (1-2% variance). The evaluation process revealed that both the judging prompt and the judge model can significantly impact results.
Retrieval-Augmented Generation (RAG)
As an alternative to including the entire document in the context window, RAG techniques are extremely common. This method breaks up long documents and databases into “chunks” which are first retrieved, and then passed to the model as context. This is another area of evaluation we will explore further.