Task Type:

CaseLaw Benchmark

Last updated

1

o1 Preview ★

$15.00 / $60.00

87.3 %

19.40 s

2

Claude 3.5 Sonnet

$3.00 / $15.00

85.0 %

4.33 s

3

Claude 3.5 Sonnet Latest

$3.00 / $15.00

84.9 %

4.51 s

4

Claude 3 Sonnet

$3.00 / $15.00

83.3 %

6.74 s

5

GPT 4o

$2.50 / $10.00

83.3 %

4.27 s

6

Llama 3.1 Instruct Turbo (8B) $

$0.18 / $0.18

80.3 %

2.32 s

7

Claude 3 Opus

$15.00 / $75.00

80.2 %

11.45 s

8

GPT 4

$10.00 / $30.00

77.9 %

8.55 s

9

o1 Mini

$3.00 / $12.00

77.8 %

6.36 s

10

Llama 3.1 Instruct Turbo (70B)

$0.70 / $0.70

77.1 %

6.41 s

11

Mixtral Instruct (8x22B)

$1.20 / $1.20

74.6 %

9.27 s

12

Command R+

$3.00 / $15.00

74.0 %

7.68 s

13

Command R

$0.50 / $1.50

73.1 %

3.43 s

14

GPT 4o Mini

$0.15 / $0.60

70.8 %

4.29 s

15

Mistral (7B)

$0.18 / $0.18

69.2 %

3.50 s

16

Gemini 1.5 Pro 001

$1.25 / $5.00

67.6 %

4.53 s

17

GPT 3.5

$0.50 / $1.50

58.4 %

2.13 s

18

Gemini 1.0 Pro 002

$0.50 / $1.50

58.1 %

5.79 s

Task type :

Partners in Evaluation

Key Takeaways

o1 Preview demonstrated the best performance, achieving a small percentage improvement over 3.5 Sonnet and GPT-4o, though at a high cost due to the dataset’s longer documents and increased token usage.
Claude 3.5 Sonnet narrowly outperformed GPT-4o overall and excelled in the “criminal” task, while GPT-4o showed better performance in the “family” task.
The Llama 3.1 70B model performed remarkably well, surpassing other source models and competing with leading closed-source models. Llama 405B, with its 4096-token context window limit, was excluded from analysis.
In some instances, the models provided answers superior to the human-established ground truth, as found during quality assurance checks.
A common issue for the models was identifying only parts of relevant document sections, relying more on their general knowledge rather than the specific document context, despite being instructed otherwise.
The top-performing models demonstrated high objective accuracy, making them strong candidates for tasks, especially with additional prompt tuning or production readiness.

Context

Our previous two custom datasets, CorpFin and ContractLaw, were focused on transactional use cases. In this report, we present our first study of using LLMs for litigation, particularly how law relates to public court systems.

In collaboration with the legaltech startup Jurisage, we announce the creation of our latest dataset, “Case Law”. In litigation, it is common to determine how a specific legal question was answered in numerous precedent cases. For example, determining the amount of child support required in several family law cases with similar fact patterns is a typical inquiry.

We pose these questions to models using case law data provided by Jurisage, covering two areas of law: family and criminal. This dataset is private; the case data is not publicly accessible, and all cases used are from June 2024, beyond the models’ training cutoff dates. The Q&A pairs were written with Jurisage’s support and are entirely novel.

Case law in the US and Canada is challenging to use at scale. Strict licensing requirements prevent LLMs from being trained on them. Organizations that have amassed substantial case histories provide access only for legal use at significant cost.

Testing models on this data offers several benefits. First, practicing lawyers will query models using references to recent cases that foundational models or applications have not been exposed to during their training. Second, the majority of legal LLM evaluations have focused solely on US law, and our study expands to explore their application in international legal systems.

Family Law: Concerns matters related to divorce, custody, child support, and emergency protection orders (EPO).

Criminal Law: Involves issues such as detention, validity of guilty pleas, the ability to adduce fresh evidence on appeal, burden of proof, mandatory minimums, and sentencing objectives.

Each question is presented against a case document (approximately 20k - 30k tokens). The questions are categorized into two types - extractive and summative. Extractive questions are verified to closely match the ground truth, whereas summative questions are checked to ensure they include the most pertinent semantic points.

Example Summative Question: How are “access costs” characterized in situations where parents live in different cities or provinces? Answer the question by paraphrasing excerpts solely from the document.

Example Extractive Question: Was the payor’s child support reduced? Provide the answer using only an excerpt or multiple excerpts from the document, without additional description or explanation.

Highest Quality Models

o1 Preview

Released date : 9/12/2024

Accuracy :

87.3%

Latency :

19.4s

Cost :

$15.00 / $60.00

o1 preview performed the best on this task, with a 2-percentage-point margin over Sonnet
On the long documents in this dataset, it's quite slow - taking almost 20s on average.
It's also quite expensive, especially given the longer-context nature of this task.

View Model

Claude 3.5 Sonnet Latest

Released date : 10/22/2024

Accuracy :

84.9%

Latency :

4.5s

Cost :

$3.00 / $15.00

Anthropic's recently released Claude 3.5 Sonnet was second highest performing model, and #1 on the 'criminal' subtask.
What's more, it achieved this performance at a mid-range price point - far cheaper than its predecessor Opus
When it got questions wrong, it was often because it picked a closely related segment of the case that was a less optimal citation.

View Model

The results per question type are summarized in the graph below.

CaseLaw

/

The total spread between the top-performing and worst-performing was 29.2 percentage points. Clearly, this is a dataset that strong models can perform well - but weak models struggle with. o1 was the winner on aggregate, but 3.5 Sonnet was able to claim the top spot on the “criminal” subtask.

Interestingly enough, Claude 3.0 Sonnet outperforms Claude 3.0 Opus - perhaps it is better suited towards the extraction tasks, despite (or because of) being less powerful.

For its size, Mixtral punched above its weight - beating out both Cohere and Gemini. It is also one of the few open-source models with a 32000 token context-window. However, it was overshadowed by the recently released Llama 3.1 models.

CaseLaw

/

If both cost and latency are no object, then GPT-o1 seems like the clear cut winner. However, very rarely is this the case — and on a document-based dataset like this, o1 preview would rapidly get expensive.

At its price point, therefore, Claude 3.5 Sonnet is also compelling. It has close performance to o1 preview, but at a fraction of the cost. We will see how Anthropic responds to o1 - by continuing to make cheaper models, or by releasing another expensive, powerful model (a la Opus).

Although Cohere Command R is also technically on the Pareto curve, it is the third worst performing model - so it would likely only be useful in an area where cost is a very high priority. Even so, it seems more likely that Mixtral or Llama 3.1 would be a better option in practice.

Model Output Examples

Here is an example of one of the extraction questions provided. In this case, R. v. Jones, 2024 ONSC 3181, the respondent Mr. Jones was carrying a handgun while on weapons prohibition and was also in possession of narcotics.

The prompt asks what “mitigating factors” exist - reasons to potentially give Mr. Jones a lower sentence. There are two relevant passages we’d expect the model to give - a correct answer would provide both, verbatim.

Expected Excerpt 1: Regrettably, there are few mitigating factors in this case. Mr. Jones did not plead guilty, but he had an uncontested trial after an unsuccessful Charter motion. Proceeding this way certainly saved the state the resources required to prosecute him, including court time, which is at a premium in this post-pandemic world. Although it is not as mitigating as a full guilty plea before trials or motions, it is certainly mitigating. Perhaps more importantly, Mr. Jones appears to be remorseful, and I take that into account. It is also mitigating that Mr. Jones has the support of his family. The EPSR was helpful in setting out some of the mitigating factors associated with Mr. Jones’ background. He had a significantly deprived upbring. He had few advantages and many disadvantages. There was nothing in his previous life to set him up for success, and much to set him up for failure. That has an impact on his moral blameworthiness. It is also mitigating that Mr. Jones spent a great deal of time in harsh pre-trial detention, where conditions were difficult. I will have more to say about this mitigating factor in a moment.

Expected Excerpt 2: I take these harsh conditions of custody into account as a mitigating factor: R. v. Duncan, supra; R. v. Marshall, 2021 ONCA 344.

The first is a section saying that there are not many mitigating factors, but summarizes the few that exist: he was remorseful, he did have an uncontested trial, he came from a significantly disadvantaged background, and he experienced harsh pre-trial conditions. The second expected excerpt is an elaboration on the harsh pre-trial conditions.

Generally, all of the models are able to accurately pull the first excerpt (for example, o1 successfully pulls both excerpts). However, only some of the models were able to accurately include the second expected excerpt.

Q

What did the court consider as mitigating factors? Answer the question only with an excerpt or multiple excerpts from the document - do not include additional description or explanation.

A

Model Response

26. Regrettably, there are few mitigating factors in this case. Mr. Jones did not plead guilty, but he had an uncontested trial after an unsuccessful Charter motion. Proceeding this way certainly saved the state the resources required to prosecute him, including court time, which is at a premium in this post-pandemic world. Although it is not as mitigating as a full guilty plea before trials or motions, it is certainly mitigating. Perhaps more importantly, Mr. Jones appears to be remorseful, and I take that into account. It is also mitigating that Mr. Jones has the support of his family. The EPSR was helpful in setting out some of the mitigating factors associated with Mr. Jones' background. He had a significantly deprived upbringing. He had few advantages and many disadvantages. There was nothing in his previous life to set him up for success, and much to set him up for failure. That has an impact on his moral blameworthiness. It is also mitigating that Mr. Jones spent a great deal of time in harsh pre-trial detention, where conditions were difficult. I will have more to say about this mitigating factor in a moment. 35. I take these harsh conditions of custody into account as a mitigating factor: R. v. Duncan, supra; R. v. Marshall, 2021 ONCA 344. 36. I also note that the EPSR was ordered on the day that Mr. Jones had his uncontested trial before me. It was originally supposed to be ready in the fall of 2023. Through no fault of his, Mr. Jones was required to wait in remand while the EPSR was prepared. I find that this is also a mitigating factor. In R. v. Hartling, 2020 ONCA 243, the Court of Appeal reduced a sentence by five months because of the time it took to obtain a Gladue report – as in this case, well over a year. I will treat the lengthy time it took to obtain the EPSR as a mitigating factor.

CORRECT

CaseLaw Benchmark

Partners in Evaluation

Key Takeaways

Context

Highest Quality Models

o1 Preview

Claude 3.5 Sonnet Latest

Model Output Examples

What did the court consider as mitigating factors? Answer the question only with an excerpt or multiple excerpts from the document - do not include additional description or explanation.

Join our mailing list to receive benchmark updates on