Key Takeaways
- o1 preview is head-and-shoulders above its predecessors. The improved reasoning capabilities likely improved its ability to understand and parse complex legal documents. This performance comes at a cost and latency price, though.
- The Claude 3.5 Sonnet (both the upgraded and the older version) and GPT-4o models perform very similarly, although the Sonnet models are ahead. They also are priced similarly.
- Models tended to struggle when questions required them to refer to multiple sections at a time. For instance, if an answer was in the document but relied on a definition in a different section, the models often were not able to refer to this definition. This resulted in incomplete or inaccurate answers.
- The Mixtral (8x7b) model performed surprisingly well. It had no refusals and consistently referred to correct sections for direct extraction questions.
- Although all models were tested with the same context windows for fairness, there is high potential for long context window models (Gemini 1.5, potentially Claude models) to perform these tasks very well in real-world applications. Read more about this in Additional Notes.
Context
There has been a considerable effort to measure the performance of language models in academic settings and chatbot settings. However, these high-level benchmarks are often contrived, or not applicable to specific industry use cases. Furthermore, model performance results released by LLM providers are highly biased - they are often cherry-picked to show state-of-the-art results.
Here, we start to remedy this by reporting our third-party, application-specific findings and live leaderboard results on the CorpFin dataset. This dataset, created by vals.ai, consists of questions and answers about public commercial credit agreements. These credit agreements are contracts used when large corporations receive a line of credit from a banking entity (you can find an example agreement here). The types of questions are as follows:
Basic extraction of terms and numbers: Some examples include “Who is the borrower’s legal counsel?” or “What currencies are the USD 325mm RCF first lien available in?”
Summarization and interpretation questions: Some examples include, “Is there an erroneous payment provision?” and “How will the loan proceeds be used?”
Numeric reasoning or calculation-based questions: Some examples include “How much initial debt capacity is available to the Borrower on day one?”
Questions involving referring to multiple sections of the provided text, especially to previous definitions: For instance, one example is “What is the minimum required amount of Available Liquidity that the company must maintain?” The model must refer to a previous definition of Available Liquidity.
Giving opinions of terms based on market standards: For instance, “Are there any unusual terms used to define or adjust EBITDA?” These questions require the models to make a judgment call, rather than just make a statement of fact.
Questions making use of industry jargon: There are several terms like “baskets”, which have a commonly understood meaning in the industry, but are almost never explicitly used in the agreement itself. An example is “Does the contract contain a Chewy Blocker?” (a type of clause meant to prevent a subsidiary from being released from its debt obligations).
Highest Quality Models
o1 Preview
Released date : 9/12/2024
Accuracy :
76.4%
Latency :
10.5s
Cost :
$15.00 / $60.00
- o1 preview performs far better than any previous model at this task - while other models have had very incremental improvements, it showed a large jump in performance
- It bucks the trend of price decrease we've seen from OAI previously - it is 3-5x the price of GPT-4o, the previous flagship model.
- It also uses more output tokens under-the-hood for the reasoning steps that take place (which also causes a corresponding latency hit).
View Model
Claude 3.5 Sonnet Latest
Released date : 10/22/2024
Accuracy :
71.8%
Latency :
1.9s
Cost :
$3.00 / $15.00
- Anthropic's recently upgraded Claude 3.5 Sonnet is a strong, fast, and cheap.
- The upgraded version boasts a few percentage points over it's predecessor.
- It still has much room for improvement, particularly in math reasoning and multi-step look-up questions.
View Model
The results per question type are summarized in the graph below.
/
Most models perform within a fairly narrow margin of each other. This is likely because all models reliably performed well on the easier direct extraction questions. However, the GPT-o1 stood out above the rest on the harder, differentiating questions - most likely because its ability to handle multi-step questions or questions that require second order reasoning. Sonnet 3.5 also performed well.
Of the open-source models, Llama 3.1 405B performed by far the best, only behind the GPT 4 and Claude 3 family. The Mixtral model also performed exceptionally, beating out Gemini, GPT 3.5, and Cohere. This was largely because of its low refusal rate. While other models would say that they have incomplete information or the answer is not contained in the text provided, the Mixtral would always attempt an answer (and often be correct).
/
GPT-o1’s performance comes at a high cost. It is one of the most expensive models, only slightly cheaper than Claude 3 Opus. It also comes with high latency penalties - over 10s, in some cases.
Claude 3.5 Sonnet and GPT-4o are similar in terms of performance. They also both come at similar, relatively reasonable costs. 3.5 Sonnet is the same price as Command R Plus and cheaper than Gemini 1.5, with much better performance. GPT-4o is 67% more expensive, but still in the same order of magnitude. For the easier tasks in the dataset, one of the more lightweight models like Llama, Command-R, or Mixtral may still be preferable.
A standout “budget” model is GPT 4o mini, performing 4th best despite being an order of magnitude cheaper than GPT or Opus. If open source is preferred, one could use Llama 3.1 70B or Mixtral. Interestingly enough, the Pareto curve can be defined by only three models - GPT 4 Mini, Claude 3.5 Sonnet, and o1.
Model Output Examples
We noticed that some of the harder questions were ones that required synthesizing content across multiple sections in the context provided. In this question, we ask how much money is made available in the credit agreement and in what currencies it can be accessed.
A perfect answer needs to mention that 400 million is available in a mix of USD and CAD but with a cap of 50 million available in CAD. Opus and Gemini were able to produce answers like this. GPT4 only gave the total amount available, saying that it was could be funded in USD or CAD without specifying the limit for CAD. Llama 3.1 405B also successfully mentions the sublimit for Canadian Dollars.
The following is the prompt that was provided to all models. The original input included 2 full pages from the credit agreement. For readability, only a subsection of the context is included.
Q
What is the total amount available through the Revolving Credit Facility (RCF) as specified in the credit agreement, and in which currencies can it be funded? Answer concisely in less than 3 sentences.
A
According to the credit agreement, the total amount available through the Revolving Credit Facility is $400,000,000. The Revolving Credit Loans can be funded in either U.S. Dollars or Canadian Dollars, with a sublimit of $50,000,000 for the portion denominated in Canadian Dollars.
CORRECT
Additional Notes
Context Window
Each large language model has a “context window” which refers to the input space in which text can be provided for the model to use. Different models have different maximum input sizes which limit the amount of text that can be provided.
For these tests, we ensured that we only provided as much input text as the model with the smallest context window could handle. This was 4096 tokens (word fragments that models take as inputs) or roughly 2-4 pages. We made sure that the pages provided in the context alongside each question were sufficient to give the correct answer.
It’s worth noting that in industry uses of these models, lawyers and analysts will likely want to submit the entire credit agreement for review, rather than doing the initial work to find pages relevant to their query. However, credit agreements can easily be 300 pages, and would not fit in the context window of these popular language models.
Long Context Models
The context passed into every model was relatively limited, because context window size varied widely between models. However, The Gemini 1.5 model and some privately supported versions of Claude 3 have 1 million token context windows which can support reading and answering questions about full credit agreements. We plan to design a task specifically to test long-context capabilities on the subset of models that can support it.
Retrieval-Augmented Generation (RAG)
Alternatively, RAG technique has gained considerable popularity and is being further refined. This method breaks up long documents and databases into “chunks” which are first retrieved, and then passed to the model as context. This is another area of evaluation we may explore further.