Task Type:

ContractLaw Benchmark

Last updated

1

Meta

Llama 3.1 Instruct Turbo (405B)

$3.50 / $3.50

75.2 %

2.19 s

2

Anthropic

Claude 3 Opus

$15.00 / $75.00

74.0 %

5.97 s

3

OpenAI

o1 Mini

$3.00 / $12.00

72.8 %

4.01 s

4

OpenAI

GPT 4o Mini $

$0.15 / $0.60

72.4 %

1.92 s

5

OpenAI

GPT 4

$10.00 / $30.00

71.8 %

3.26 s

6

OpenAI

o1 Preview

$15.00 / $60.00

69.0 %

12.83 s

7

Anthropic

Claude 3.5 Sonnet Latest

$3.00 / $15.00

68.7 %

2.28 s

8

Meta

Llama 3.1 Instruct Turbo (70B)

$0.70 / $0.70

68.6 %

4.74 s

9

Cohere

Command R+

$3.00 / $15.00

68.2 %

1.17 s

10

Anthropic

Claude 3.5 Sonnet

$3.00 / $15.00

68.2 %

1.61 s

11

Google

Gemini 1.5 Pro 001

$1.25 / $5.00

68.0 %

4.11 s

12

Anthropic

Claude 3 Sonnet

$3.00 / $15.00

67.6 %

3.03 s

13

Meta

Llama 3 (70B)

$0.90 / $0.90

66.8 %

2.92 s

14

Google

Gemini 1.0 Pro 002

$0.50 / $1.50

66.7 %

0.89 s

15

Cohere

Command R

$0.50 / $1.50

66.7 %

0.56 s

16

Mistral

Mixtral (8x7B)

$0.60 / $0.60

64.9 %

1.11 s

17

Meta

Llama 3.1 Instruct Turbo (8B)

$0.18 / $0.18

63.4 %

0.92 s

18

Mistral

Mistral (7B)

$0.18 / $0.18

63.0 %

0.87 s

19

Meta

Llama 3 (8B)

$0.20 / $0.20

62.7 %

0.57 s

20

OpenAI

GPT 4o

$2.50 / $10.00

61.7 %

1.32 s

21

OpenAI

GPT 3.5

$0.50 / $1.50

61.6 %

1.09 s

22

Meta

Llama 2 (70B)

$0.90 / $0.90

59.7 %

1.39 s

23

Meta

Llama 2 (13B)

$0.20 / $0.20

55.9 %

1.47 s

24

Databricks

DBRX Instruct

$2.25 / $6.75

48.0 %

1.26 s

25

Meta

Llama 2 (7B)

$0.20 / $0.20

42.5 %

0.80 s

Task type :

Partners in Evaluation


Key Takeaways

  • Meta’s Llama 3.1 405B was the top-performing model - it achieved 75.2% accuracy, setting a new SOTA on this task. All of the Llama 3.1 models perform particularly on the extraction task.
  • Anthropic’s Claude 3 Opus model was second. It showed particular strength in determining whether contract language was in accordance with firm standards and suggesting corrections.
  • o1 Mini performed the best of the OpenAI models - although all of the GPT-4 models were clustered relatively closely (likely within random deviation). Surprisingly, O1 Preview performed worse than others - mainly due to its poor performance on matching tasks.
  • Overall, language models are reasonably capable of performing tasks on contract law-related questions for documents of this type. It is likely that we will continue to see improvement as new models are released.

Context

There has been a considerable effort to measure language model performance in academic tasks and chatbot settings, but these high-level benchmarks are contrived and not applicable to specific industry use cases. Further, model performance results released by LLM providers are highly biased - they are often manufactured to show state-of-the-art results.

Here we start to remedy this by reporting our third-party, application-specific findings and live leaderboard results on the ContractLaw dataset, which was created in collaboration with SpeedLegal. This dataset consists of three task types which all pertain to various contract types. The different tasks are as follows.

Extraction: Asking the model to retrieve a part of the contract that relates to a relevant term. The model must understand the legal term within the contract that is being searched for and extract the relevant phrase or sentence that relates to it. Some extraction terms include “Non-Competition Covenant” or “Governing Law”.

Matching: Providing a model with an excerpt of a contract and a standard text to determine whether the contract upholds the standard expected. When lawyers review legal contracts, they determine whether the language is within the expectations of their client. Statements that are too risky or non-standard should be identified and corrected before contracts are signed. Here, the model was asked whether a given statement should be flagged.

Correction: Given an excerpt of a contract text and standard text, the model is asked to correct the contract text to meet the standard. This is the fix that a lawyer might write to send a new contract to the opposing party for review. These tasks were evaluated over five contract types. These contract types were Non-Disclosure Agreements (NDA), Data Processing Agreements (DPA), Master Service Agreements (MSA), Sales Agreements, and Employment Agreements.


Highest Quality Models

Llama 3.1 Instruct Turbo (405B)

Llama 3.1 Instruct Turbo (405B)

Released date : 7/23/2024

Accuracy :

75.2%

Latency :

2.2s

Cost :

$3.50 / $3.50

  • The latest Llama-3.1 405B model achieved state of the art on this task.
  • It was particularly good on extraction tasks, achieving the top three spots.
  • Unlike other models, it is priced the same for input and output tokens - and priced higher than the other open source models.

View Model


Claude 3 Opus

Claude 3 Opus

Released date : 2/29/2024

Accuracy :

74.0%

Latency :

6.0s

Cost :

$15.00 / $75.00

  • The Opus model was overall the second-best performing model, by accuracy. Unlike in other tasks, it outperformed the newer Sonnet 3.5 model.
  • It especially showed its ability, above other models, to correct contract law language to be in accordance with a standard.
  • Opus did not perform well in extraction questions. It had an equivalent performance to the less-powerful Claude Sonnet.

View Model

The results per question type are summarized in the graph below.

ContractLaw

/

Llama 3.1 was the top-performing model on the extraction task, and overall. It was also tied for first on the correction task, which was clearly the most challenging task overall. It separated Llama, GPT4, Claude 3 models, and Gemini from the rest of the pack. It is understandable why this task was so challenging because correction requires the model to interpret a standard text and generate a novel contract text revision.

GPT-4o Mini performed the best on the matching task, followed by Opus and Llama 3.1. Despite its small size and low cost, this budget model packs a powerful punch, and is very competitive.

GPT-4o performed a few percentage points better than GPT-4 on extraction and correction — this was expected, and matches what we see on other tasks. However, it performed extremely poorly on the matching task. A deeper look into its outputs showed that it was outputting the same answer on almost every prompt — since the classes were balanced, it achieved roughly 50% performance. This brought its overall average significantly down.

o1 preview did not perform as well as one would expect on this task. This is mostly because it performs very poorly on the matching task - asking if a contract matches a standard text. Although it outputs the binary label (matched or unmatched) in the correct format, for whatever reason, it outputs the wrong label at a much higher frequency than other models.

Gemini 1.5 Pro performs exceedingly well on the Extraction ContractLaw tasks — second to Llama only. It performs reasonably well on matching tasks as well (4th). However, it does very poorly on Correction - it is much too verbose, rather than cleanly answering the task.

Anthropic’s latest Claude 3.5 Sonnet performed a few percentage points better than Sonnet 3.0 — however, we did not see the massive performance gains or new SOTAs like on the TaxEval or CorpFin dataset. It performed especially poorly on extraction, and middle-of-the-pack on the other two tasks.

ContractLaw

/

Overall, the accuracy-cost scatter plot follows a logarithmic curve: the highest-performing models see increasingly diminishing returns for their higher costs.

There is a cluster of “mid-range” models (Gemini 1.5, Sonnet 3/3.5, and R+) that all performed very similarly, at a very close price point. Although Gemini is slightly more expensive than the other two for input tokens, it is much cheaper for output tokens ($7 / M compared to $15 / M).

Interestingly enough, Llama 3.1 is actually more expensive than Claude 3.5 Sonnet (for input tokens), and matches GPT-4o. This coincides with Meta marketing it as the first “premium” open-source model - it is priced as one. This is also in-line with the model’s size, it is significantly larger than previous open-source models.

Within the “budget” models, GPT-4o Mini is the clear winner - significantly cheaper and better than any other lightweight or midweight model.

The two o1 models are priced very similarly to the Anthropic models of the previous generation - o1-preview is priced similarly to Opus (although slightly cheaper), and likewise, o1-mini is priced similarly to 3.5 Sonnet (although also slightly cheaper on output tokens).


Model Output Examples

In the following example, we asked a model to take a contract and suggest a correction in keeping with the provided standard text. With each question, we also provided the model with a few in-context examples of ideal corrections.

The challenge with this task is to adapt the existing contract language in a way that is in keeping with the standard. Simply replacing the text with the standard text does not suffice. Models must understand the nuance of the clauses to form a good correction.

For the question asked, the answer we were looking for was “This Agreement shall continue for a period of three (3) years from the Effective Date or until such time as a definitive agreement(s) is entered into by the Parties with respect to the Purpose, whichever occurs first.”

In particular, it is important for the correction to describe that the agreement should continue for a period of three years OR until a definitive agreement is reached. Both parts of this logical statement must be included.

In the example, we see that Gemini Pro 1.0 and GPT4 are able to produce an answer like this while Opus simply reproduces the standard text. Llama 3.1 405B also produces an accurate suggested fix, and is also more verbose (this is still a pass under our grading system).

Q

You are a lawyer reviewing an NDA contract text. Please correct the contract text to match the criteria/information included in the standard text. Respond with a provision suggested fix.

A

Model Response

Provision Suggested Fix: This Agreement, and all obligations thereof unless otherwise stated in the relevant provisions, shall continue for a period of three (3) years from the Effective Date.

INCORRECT

Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.