Partners in Evaluation
Key Takeaways
- By 1.7 percentage points, OpenAI’s o1 preview has found the lead on LegalBench. It does especially well on Rule application tasks - most likely because of its enhanced reasoning capabilities.
- Llama 3.1 Instruct (405B & 70B) is the second-best model — beating out both the other closed-source and open-source models. It is also more expensive than other open-source models for inference, but still very competitive on price compared to the closed models.
- The upgraded Claude 3.5 Sonnet and GPT-4o are exactly tied for third and fourth place. They are also priced very similarly.
- GPT-4o Mini and Llama 3.1 70B stand out as great budget models, achieving strong performance while being significantly cheaper than flag-ship models.
- A given model’s performance can vary dramatically across different legal tasks. There is still significant room for improvement for these models to perform well on legal tasks.
Context
There has been a considerable effort to measure language model performance in academic tasks and chatbot settings but these high-level benchmarks are not applicable to specific industry use cases. Here we start to remedy this by reporting our application-specific findings and live leaderboard results on LegalBench, a large crowd-sourced collection of legal reasoning tasks. The data set is quite comprehensive, covering six major categories.
- Issue-spotting, where a model must determine if a fact has relevance to a particular law or legal entity.
- Rule-recall, where a model must identify a relevant rule or state its characteristics.
- Rule-conclusion, where a model must predict a legal outcome.
- Rule-application, where a model must analyze how a rule was applied to reach a conclusion.
- Interpretation, where a model must parse and understand legal text.
- Rhetorical understanding, where a model must determine whether a legal argument performs a certain function.
Overall Results
On the overall task, o1 preview is the clear winner. It is able to outperform the other models on every question type. the bump in performance for o1 came from tasks in the “Rule” category - for example, it got a nearly perfect 98% on “Rule QA”, one of the few free response tasks in LegalBench.
o1 Preview
Released date : 9/12/2024
Accuracy :
81.7%
Latency :
10.3s
Cost :
$15.00 / $60.00
- o1 preview is the new state-of-the-art, performing especially well on Rule Tasks.
- It is very expensive - much more than 4o or Claude 3.5 Sonnet. It also comes at a high latency cost.
- It's also harder to prompt - it lacks the fine-grained controls we have collectively become accustomed to.
- o1 was very verbose on the non-multiple-choice tasks (see the model examples).
View Model
Llama 3.1 Instruct Turbo (405B)
Released date : 7/23/2024
Accuracy :
79.0%
Latency :
0.8s
Cost :
$3.50 / $3.50
- The latest Llama-3.1 70B model had a standout performance on LegalBench, and was previously state-of-the-art before being dethroned by O1.
- The model is priced higher than other open-source models, at $ 3.50 / MTok for both input and output. This is still in-line with GPT-4o and 3.5 Sonnet's pricing though.
- We will see how the closed-source models respond to this new entry into the foundation model market.
View Model
The results per question type are summarized in the graph below.
/
On a task-by-task basis, o1 preview was often the best - but 4o and Llama 3.1 also claimed two of the top task-specific spots. We can also see that a lot of the bump in performance for o1 came from tasks in the “Rule” category - for example, it got a nearly perfect 98% on “Rule QA”, one of the few free response tasks in LegalBench.
Gemini Pro, Claude Sonnet, and Cohere Command R+ competed for the middle spots on the leaderboard. The open source models also generally had average performance, although amongst them, the Llama models were significantly better. Interestingly, the Llama models did not perform well on Rule tasks, and 3.5 Sonnet upgraded did not perform well on issue tasks.
/
The cost-accuracy graph shows a few models that define a Pareto curve of tradeoffs: GPT-4o mini, Llama 3.1 70B, Llama 3.1 405B, and o1 preview. Shortly off the Pareto curve are GPT-4o and Claude 3.5 Sonnet. Among these four models, the objective difference in performance is small (a matter of a few percentage points), however, they have wildly different prices. GPT-4o Mini particularly stands out as having a very high quality to price ratio. A cheaper model may still be a better choice in domains with high token usage or cost sensitivity.
Amongst the other models, there is a somewhat logarithmic trend, with the more expensive models seeing diminishing returns for marginal cost. Also, models of previous “generations” (Claude 2, GPT 3.5, etc.), perform strictly worse for their price than the newer models — likely, providers are disincentivizing their use.
Gemini 1.5 does not perform significantly better than Gemini 1.0 — although its performance is better on certain tasks and categories, it performs significantly worse on others. It often is overly verbose, or does not understand the in-context examples without additional prompting.
Model Output Example
We show some examples from one of the LegalBench free response tasks (Rule QA) here. The models were asked “Where in the Federal Rules of Civil Procedure are notice requirements described?”
This question is tricky because both Rules 4 and 5 talk about the service process and notice requirements. However, Rule 4 is the main rule that defines the first notice in a proceeding, whereas Rule 5 gives the procedures for filing pleadings, etc.
This question is also difficult because implicit in the question is that it’s “notice” of a suit. GPT correctly determines that the question is talking about the initial notice, whereas Opus does not. Interestingly, Gemini gets the answer wrong, but it is much more terse than the other two models. Despite performing best overall, Llama 3.1 makes the same mistake as Opus, citing Rule 5 instead of Rule 4. Interestingly, o1 preview is a lot more verbose than other models for this question.
It’s important this is one question from a very large evaluation data set and is not representative of overall performance.
Q
Where in the Federal Rules of Civil Procedure are notice requirements described?
A
The notice requirements in federal civil litigation are primarily described in Rule 5 of the Federal Rules of Civil Procedure (FRCP). Specifically: 1. Rule 5(a) outlines the service requirements for pleadings and other papers, stating that unless the rules provide otherwise, each paper must be served on every party. 2. Rule 5(b) describes how service can be made, including the methods of service and the permissible means of electronic service. 3. Rule 5(c) states that all papers after the complaint required to be served must be filed with the court within a reasonable time after service. 4. Rule 5(d) covers the filing requirements, including electronic filing and the filing of discovery materials. Additionally, other rules in the FRCP may contain specific notice requirements for certain situations, such as Rule 4 for serving a summons and complaint, Rule 6 for computing and extending time, and Rule 55 for default judgments.
INCORRECT
Additional Notes
Gemini
The Gemini documentation and error handling were extremely poor. Even after turning content moderation to the least restrictive setting, the API frequently returned an “unsafe content” error. In some cases, we could circumvent this because the error payload included the supposedly unsafe generation
Often, we’d get an index out-of-bounds error originating from within Gemini’s own Python SDK, instead of a more meaningful error message. We debugged further, now believing this to be an additional level of content moderation not exposed to the user in any capacity.
In general, Gemini has much stricter content moderation on its output than most other models. This is true even if Content Blocking was set to the lowest possible settings. Because of this, the model was not able to successfully produce outputs for many tests in the learned_hands tasks and others. These cases are treated as failures. The Gemini Pro model results may improve considerably if the content moderation is better calibrated.
Gemini Pro pricing is per character, not per token. We go by the pricing listed here, and assume an average of 4 characters per token.
Claude 2
Almost all LegalBench tasks are multiple-choice, and expect a single word or choice as an output. Claude-2 has extreme difficulty producing outputs in this format. Even if explicit instructions are provided, such as “Don’t include an explanation for your choice”, “Answer in a single word only”, etc., the model reliably produced a paragraph-length output with some explanation.
To give the model a chance, we wrote a custom parser for Claude 2. We asked it to produce outputs in a JSON format with ‘explanation’ and ‘answer’ keys, then extracted the ‘answer’ key. We did not perform this for any subsequent Claude models - Claude 3.0 and 3.5 were evaluated normally.
When asked for a single-word response, the Claude 3 models still struggled to consistently follow these directions. These problems were solved once we provided a system prompt with these expectations.