Partners in Evaluation
Key Takeaways
- Claude Opus 4.8 leads Legal Research Bench with 43.75% all-pass accuracy, followed by GPT 5.5 at 40.39% and Claude Sonnet 4.6 at 38.46%. The top three cluster within about five points, but spread widely from the rest of the field.
- Under partial-credit scoring, top models exceed 80% weighted pass rate, yet no model clears 44% under strict all-pass grading, where every rubric check must pass. The gap shows models often get most of an answer right but fail on one or two required elements.
- Performance varies considerably by practice area. Health and Administrative / Regulatory score highest on average; Family and Immigration are by far the hardest.
- Top models research efficiently: Claude Opus 4.8 averages 12 turns and 28 tool calls per task, while the weaker performers run far more without matching accuracy. Kimi K2.6, for instance, averages over 90 turns and 100 tool calls.
- Reconciling conflicting authority is the most reliable failure mode: every model scores lower on questions that require synthesizing across jurisdictions, courts, or regimes than on those that don’t, a drop of 6 to 17 points per model and 10 points when pooled.
Background
Legal Research Bench evaluates AI agents on realistic legal research tasks drawn from diverse areas of US law. Each task requires an agent to research a legal question using a set of tools including case law search, web search, and document retrieval, then produce a well-supported answer.
The benchmark tests whether models can conduct the kind of multi-source research that junior associates and paralegals routinely perform: finding relevant statutes and case law, applying precedent to a fact pattern, and synthesizing across jurisdictions or regulatory regimes. Spanning U.S. federal and state law, the questions require more than simple retrieval. Each demands multiple steps of legal reasoning rather than a single lookup.
The tasks span eight practice areas: Administrative / Regulatory, Business & Commercial, Civil Litigation, Constitutional / Civil Rights, Criminal, Family, Health, and Immigration. This breadth tests general legal reasoning ability across common research workflows.
All questions, gold-standard answers, and rubrics are authored and peer-reviewed by practicing lawyers, drawing on real research questions that arise in legal practice.
Scoring
Each question carries a rubric, written by legal experts. Experts include only items that a correct answer must contain; each rubric item is a required element, not an optional one. Rubrics weight items by importance and range from 1 to 31 items (mean 9.35, mode 10).
Our primary metric is all-pass: a question counts as correct only if the response satisfies every rubric item: 100% if all checks pass, 0% otherwise. We also report a partial-credit weighted score, the share of rubric points earned. All grading is performed by an LLM judge; see Methodology for the harness and judge validation.
Results
The Pareto chart above shows how accuracy trades off against cost and latency across 13 models, with additional metrics such as answer length and sources cited available as alternate axes. Claude Opus 4.8 leads on strict accuracy at 43.75% at a cost of ~$2.82 per task and latency of ~12 minutes. GPT 5.5 sits second at 40.39% but at more than twice the cost and roughly three times the latency. GLM 5.2 is the most cost-efficient of the leaders, reaching 31.25% at under a dollar per task, while Gemini 3.5 Flash offers a strong mid-tier option at 30.77% accuracy with the lowest latency among top performers.
Performance by Practice Area
The benchmark spans eight broad areas of U.S. federal and state practice, weighted toward the domains where professional legal research is most often performed. In descending order of coverage: Administrative / Regulatory, Criminal, Business & Commercial, Constitutional / Civil Rights, Health, Civil Litigation, Family, and Immigration. The four largest categories together account for roughly 70% of the benchmark. Because many questions implicate more than one area of law, these categories are not mutually exclusive; the distribution reflects the interconnected nature of legal research, where a single matter frequently draws on multiple bodies of law at once.
Mean all-pass accuracy varies widely across practice areas, from about 43% in Health and 34% in Administrative / Regulatory at the top down to roughly 21% in Immigration and 11% in Family at the bottom. Constitutional / Civil Rights, Business & Commercial, Civil Litigation, and Criminal cluster in the mid-20s.
The chart above shows the best single-model all-pass score in each practice area, with areas ordered by their average score; hover to see the top three models. No one model leads everywhere: Claude Opus 4.8 tops Business & Commercial, Civil Litigation, Criminal, and Constitutional / Civil Rights; GPT 5.5 leads Administrative / Regulatory, Health (66.7%), and Immigration; and Claude Sonnet 4.6 tops Family. Even in Family, the hardest area, the strongest model reaches only 27.3% all-pass.
The per-model heatmap below breaks the same data out model-by-model, with practice areas ordered hardest-first (left) to easiest (right). Health and Administrative / Regulatory stay green across nearly every model, while Family is the reddest column, and even Claude Opus 4.8 manages only 18% all-pass there. The accuracy spread within a single area is wide: Health ranges from about 17% to 67% across models, so practice-area difficulty and model capability compound rather than one dominating the other.
Performance by Question Type
Each task carries two kinds of label: base reasoning types that capture the core legal work a question demands, and overlay flags that mark an added source of difficulty layered on top of that work. A question may involve more than one base type and may carry either, both, or neither flag, so each bar below reports the all-pass rate across all questions that carry that label.
There are three base reasoning types:
- Statutory Interpretation: determining the meaning and scope of statutory text and how courts have construed it.
- Regulatory Framework Interpretation: working out how administrative regulations and agency guidance implement a statute and operate together.
- Doctrinal Rule Reasoning: identifying the controlling legal test and applying precedent to a fact pattern.
Pooled across the evaluated models, Regulatory Framework Interpretation is the most tractable at 36.6% all-pass, while Statutory Interpretation (25.4%) and Doctrinal Rule Reasoning (24.8%) are harder, since both demand reasoning beyond locating the governing text.
The two overlay flags mark complications that can attach to any base type:
- Reconciliation: the question requires synthesizing across multiple authorities, such as comparing jurisdictions, weighing agreement against disagreement among courts, resolving conflicting precedent, or untangling interacting legal regimes.
- Temporal Validity: the answer turns on whether a rule is still good law or on how the rule evolved over time.
Reconciliation is the clearest difficulty signal in the benchmark. At 19.9% all-pass it is the hardest category of all, well below the 27.3% overall rate, and the gap holds for every single model: each one scores lower on reconciliation questions than on the rest. Synthesizing conflicting authority, rather than locating a single rule, is where models break down most reliably. Temporal Validity questions, by contrast, score 25.2%, close to the overall rate, so reasoning about whether a rule still holds is not by itself a major obstacle.
The heatmap below breaks the same categories out model by model, with overlay flags marked in orange.
Weighted vs All-Pass Scoring
Under partial-credit scoring, top models exceed 80% weighted pass, yet strict all-pass accuracy remains below 45% for every model. The gap shows models often satisfy most rubric checks but miss one or two required elements.
We report all-pass as our primary metric because, in legal work, a partially correct answer can be more dangerous than a wrong one: it may read as sound while omitting a critical point. A weighted score above 80% can mask that an answer misses a required element.
Sources Cited
Most models cite more sources on average than the gold-standard answers do, but source quality and relevance, not count, drive the source-check score in grading.
Answer Length
Every model writes far longer answers than the gold-standard references, which average 536 words. Claude Sonnet 4.6 is the most verbose at roughly 2,300 words (over four times the reference length), while GPT 5.4 Mini is the most concise. Length tracks loosely with verbosity rather than accuracy: the longest responses are not the highest-scoring, and grading rewards correct, well-supported reasoning over volume.
Tool Call Analysis
Agents iterate with legal research tools (case-law search, web search, document retrieval, and HTML parsing), storing fetched content in a session database until they submit a final answer. Each task has a three-hour time limit.
web_search is the most-used tool across models, averaging 26 calls per session versus 11 for courtlistener_search. Top performers allocate effort more selectively: Claude Opus 4.8 balances case-law and web search in roughly equal proportion, while weaker models often run many more total calls, especially web_search, without converting that into higher accuracy. retrieve_information and parse_html_page support extraction from fetched documents once sources are identified.
Model Output Examples
Below are the agents’ full answers to a single representative Business & Commercial question, one of the benchmark’s public sample tasks: a Virginia equipment-leasing dispute that turns on UCC Article 2A finance-lease rules. Switch tabs to compare how each model framed its analysis; models are ordered by overall benchmark accuracy.
Q
Client is a Virginia corporation ("Client") that hauls refrigerated freight out of Rockbridge County. In early 2024, Client needed two new reefer trailers and went to a national manufacturer ("Manufacturer") to pick out a model and negotiate the refrigeration specs. Client then arranged for a Virginia equipment financing company ("Lessor") to buy the trailers ("Trailers") from Manufacturer for $85,000 each and lease them to Client under a written 60-month lease ("Lease") at $2,200 per month per trailer, and Lessor gave Client a complete copy of its purchase agreement with Manufacturer, which included Manufacturer's standard two-year warranty on all mechanical and refrigeration components, before Client signed.
Four months after delivery, both Trailers had compressor failures that left the units unable to hold temperature, and Manufacturer agreed to fix them under the warranty but filed for Chapter 7 two months later, where the trustee confirmed it wouldn't honor outstanding warranty claims. That said, the Lease doesn't include any warranties from Lessor and tells Client to take equipment claims to Manufacturer. It also says that Client's payments are "absolute and unconditional, irrespective of any defense or any right of setoff, counterclaim, or recoupment." Client has stopped paying Lessor. Given these facts, can Client stop its payments to Lessor, and is Lessor responsible for the defective Trailers?
## Short Answer
**No on both counts.** Under Virginia's enactment of UCC Article 2A (Va. Code Title 8.2A):
1. **Client cannot lawfully stop paying Lessor.** The transaction is a **"finance lease"** that is **not a consumer lease**, so by statute Client's payment promises became **"irrevocable and independent" upon its acceptance of the Trailers** (Va. Code § 8.2A‑407). The Lease's "absolute and unconditional… no setoff, counterclaim, or recoupment" language is a classic enforceable **"hell or high water"** clause. The compressor failures and the collapse of the warranty are **not defenses** to the rent. If Client withholds payment, it is in default.
2. **Lessor is not responsible for the defective Trailers.** A finance lessor functions as a financier, not a warrantor. Virginia law **expressly excludes finance leases from the implied warranties** of merchantability (§ 8.2A‑212) and fitness (§ 8.2A‑213), the Lessor made **no express warranty** (§ 8.2A‑210; the Lease contains none), and the Lease channels equipment claims to Manufacturer. The warranties run from **Manufacturer**, and § 8.2A‑209 passes the benefit of Manufacturer's two‑year warranty through to Client to enforce **against Manufacturer** — but the risk that Manufacturer fails or goes bankrupt falls on **Client (the finance lessee), not Lessor**.
---
## Analysis
### 1. The Lease is a "finance lease" under Va. Code § 8.2A‑103(1)(g)
A "finance lease" requires three things, all satisfied here:
- **(i) The lessor did not "select, manufacture, or supply the goods."** Client picked the model and negotiated the refrigeration specs directly with Manufacturer; Lessor merely bought the units. ✔
- **(ii) "The lessor acquires the goods… in connection with the lease."** Lessor purchased the Trailers from Manufacturer for $85,000 each specifically to lease them to Client. ✔
- **(iii) One of the four notice conditions occurs** — here condition **(A)**: "The lessee receives a copy of the contract by which the lessor acquired the goods… before signing the lease contract." The facts state Lessor gave Client **a complete copy of its purchase agreement with Manufacturer (including the two‑year warranty) before Client signed.** ✔
This is also **not a "consumer lease,"** which is limited to a lessee "who is an individual" leasing "primarily for a personal, family, or household purpose" (§ 8.2A‑103(1)(e)). Client is a Virginia corporation leasing reefer trailers for its freight business. The special non‑consumer finance‑lease rules therefore apply.
### 2. Client cannot stop paying — the statutory and contractual "hell or high water" rule
**Statutory rule — § 8.2A‑407.** "In the case of a finance lease that is not a consumer lease the lessee's promises under the lease contract become **irrevocable and independent upon the lessee's acceptance of the goods**" (subsection (1)). Such a promise is "**effective and enforceable**" and "**not subject to cancellation, termination, modification, repudiation, excuse, or substitution** without the consent of the party to whom the promise runs" (subsection (2)).
Client accepted the Trailers — it took delivery and operated them for roughly four months before the compressor failures (acceptance under §§ 8.2A‑515/‑516). Once accepted, Client's promise to pay rent is, by operation of statute, **independent of the condition of the equipment**. The later compressor failures, the unit's inability to hold temperature, and Manufacturer's failure to honor the warranty are **not grounds to stop payment, reduce payment, or assert setoff/recoupment against Lessor.**
**Contractual clause — preserved by § 8.2A‑407(3).** The Lease's statement that payments are "**absolute and unconditional, irrespective of any defense or any right of setoff, counterclaim, or recoupment**" is a textbook **"hell or high water" clause.** Section 8.2A‑407(3) confirms the section "does not affect the validity under any other law of a covenant in any lease contract making the lessee's promises irrevocable and independent upon the lessee's acceptance of the goods." So the clause is independently enforceable on its own terms.
**Rationale.** The Official Comment to UCC § 2A‑407 explains that this irrevocable‑promise rule is the statutory "hell or high water clause," and that it is the **"price" the finance lessee pays for its third‑party‑beneficiary status to the supplier's warranties** under § 2A‑209: the lessee continues paying the financier and looks to the supplier for defects. Courts applying these uniform provisions agree. See *C & J Vantage Leasing Co. v. Wolfe*, 795 N.W.2d 65 (Iowa 2011) (a hell‑or‑high‑water clause makes the rent obligation "**independent of the state of the goods and irrevocable, so that the lessee looks to the manufacturer or supplier of goods for warranties and remedies for defects in the goods, not to the lessor**"; for a finance lease "an express hell‑or‑high‑water clause is unnecessary because such a provision automatically attaches… by statute," and an express clause is "fully enforceable").
**Consequence of stopping payment.** Ceasing payments puts Client **in default**, exposing it to acceleration, repossession, and a deficiency claim plus the Lessor's other remedies under the Lease and §§ 8.2A‑523 *et seq.* Withholding rent is not a permissible self‑help remedy here.
### 3. Lessor is not responsible for the defective Trailers
In a finance lease the lessor is essentially a secured financier and does **not** stand behind the equipment:
- **No implied warranty of merchantability.** § 8.2A‑212(1) applies "**Except in a finance lease**." So Lessor makes none.
- **No implied warranty of fitness for a particular purpose.** § 8.2A‑213 likewise applies "**Except in a finance lease**."
- **No express warranty from Lessor.** Express warranties arise only from the lessor's own affirmations, descriptions, samples, or models (§ 8.2A‑210). The Lease "doesn't include any warranties from Lessor" and tells Client to take equipment claims to Manufacturer — so none exists.
**Where Client's warranty rights do lie — against Manufacturer (the supplier).** Under § 8.2A‑209(1), "the benefit of a supplier's promises… and of all warranties, whether express or implied, under the supply contract, extends to the lessee to the extent of the lessee's leasehold interest under a finance lease… but is subject to the terms of the warranty and of the supply contract and all defenses or claims arising therefrom." Manufacturer's two‑year warranty on the mechanical and refrigeration components therefore runs through to Client, and Client retains any independent rights it has against Manufacturer (§ 8.2A‑209(4)). But § 8.2A‑209 does **not** make Lessor a warrantor; it channels the claim to the supplier.
**Who bears Manufacturer's insolvency?** Client does, not Lessor. With Manufacturer in Chapter 7 and the trustee disclaiming outstanding warranty claims, Client's practical recourse is to **file a proof of claim in Manufacturer's bankruptcy** (most likely a general unsecured claim, often worth only cents on the dollar) and to pursue any other responsible parties (e.g., the compressor/component maker or any separate dealer obligations). The allocation of supplier‑default risk to the finance lessee — while it keeps paying the lessor — is the deliberate (if harsh) design of the finance‑lease structure.
### 4. No applicable escape hatch on these facts
The hell‑or‑high‑water rule does **not** bar defenses going to the **formation or validity of the lease itself** — e.g., fraud in the inducement or unconscionability (*C & J Vantage*, recognizing such carve‑outs; see also Va. Code § 8.2A‑108 on unconscionability). But nothing in the facts suggests any wrongdoing by Lessor: Lessor delivered/paid for conforming trailers and **affirmatively gave Client the complete supply contract disclosing Manufacturer's warranty terms before signing.** This is an arm's‑length commercial deal between businesses, so unconscionability is not realistically available. And even if a court were to recharacterize the deal as a disguised secured sale rather than a true/finance lease, the **express** "absolute and unconditional / no‑setoff" clause would still be independently enforceable (*C & J Vantage*) — so Client's payment obligation stands either way.
### Practical bottom line
Client should **keep paying Lessor** (or negotiate a workout) and direct its defect remedies to **Manufacturer's bankruptcy estate** and any other responsible third parties. The defective Trailers and the failed warranty are claims against Manufacturer, not defenses against Lessor, and Lessor bears no responsibility for the defects.
---
```json
{
"sources": [
{
"name": "Code of Virginia, Title 8.2A (Commercial Code \u2014 Leases), \u00a7 8.2A-103 (General definitions; \"Finance lease\" (1)(g); \"Consumer lease\" (1)(e); \"Supplier\"; \"Supply contract\")",
"federal_or_state": "State",
"primary_or_secondary": "Primary",
"url": "https://law.lis.virginia.gov/vacode/8.2A-103",
"status": "Enacted",
"citation": "Va. Code Ann. \u00a7 8.2A-103"
},
{
"name": "Code of Virginia, Title 8.2A (Commercial Code \u2014 Leases), \u00a7 8.2A-407 (Irrevocable promises: finance leases)",
"federal_or_state": "State",
"primary_or_secondary": "Primary",
"url": "https://law.lis.virginia.gov/vacode/8.2A-407",
"status": "Enacted",
"citation": "Va. Code Ann. \u00a7 8.2A-407"
},
{
"name": "Code of Virginia, Title 8.2A (Commercial Code \u2014 Leases), \u00a7 8.2A-209 (Lessee under finance lease as beneficiary of supply contract)",
"federal_or_state": "State",
"primary_or_secondary": "Primary",
"url": "https://law.lis.virginia.gov/vacode/8.2A-209",
"status": "Enacted",
"citation": "Va. Code Ann. \u00a7 8.2A-209"
},
{
"name": "Code of Virginia, Title 8.2A (Commercial Code \u2014 Leases), \u00a7 8.2A-212 (Implied warranty of merchantability; \"Except in a finance lease\")",
"federal_or_state": "State",
"primary_or_secondary": "Primary",
"url": "https://law.lis.virginia.gov/vacode/8.2A-212",
"status": "Enacted",
"citation": "Va. Code Ann. \u00a7 8.2A-212"
},
{
"name": "Code of Virginia, Title 8.2A (Commercial Code \u2014 Leases), \u00a7 8.2A-213 (Implied warranty of fitness for particular purpose; \"Except in a finance lease\")",
"federal_or_state": "State",
"primary_or_secondary": "Primary",
"url": "https://law.lis.virginia.gov/vacode/8.2A-213",
"status": "Enacted",
"citation": "Va. Code Ann. \u00a7 8.2A-213"
},
{
"name": "Code of Virginia, Title 8.2A (Commercial Code \u2014 Leases), \u00a7 8.2A-210 (Express warranties)",
"federal_or_state": "State",
"primary_or_secondary": "Primary",
"url": "https://law.lis.virginia.gov/vacode/8.2A-210",
"status": "Enacted",
"citation": "Va. Code Ann. \u00a7 8.2A-210"
},
{
"name": "C & J Vantage Leasing Co. v. Wolfe",
"federal_or_state": "State",
"primary_or_secondary": "Primary",
"url": "https://www.courtlistener.com/opinion/4472454/c-j-vantage-leasing-co-assignor-to-frontier-leasing-corp-assignee-v/",
"status": "N/A",
"citation": "C & J Vantage Leasing Co. v. Wolfe, 795 N.W.2d 65 (Iowa 2011)"
},
{
"name": "Uniform Commercial Code, Official Comment to \u00a7 2A-407 (statutory \"hell or high water\" clause; the \"price\" of the finance lessee's third-party-beneficiary status under \u00a7 2A-209) (as reproduced with the codified comment to Kan. Stat. Ann. \u00a7 84-2a-407)",
"federal_or_state": "State",
"primary_or_secondary": "Secondary",
"url": "https://www.ksrevisor.gov/statutes/chapters/ch84/084_002a_0407.html",
"status": "N/A",
"citation": "U.C.C. \u00a7 2A-407 cmt. 1"
}
]
}
```
Trajectory Analysis
The chart below traces how each model researched that same question. The top three models by accuracy are shown by default; use the selector to add or remove others. Every tool call is bucketed into a category and laid out in the order it was made, so each bar shows both how long the run was and how the model spent its effort.
The models reach an answer in very different ways. Claude Opus 4.8 works through 38 tool calls, opening with web search to frame the issue, then leaning on document reading and extraction to parse the Virginia statutes and pull specific passages, with a few targeted case-law searches before answer submission. GPT 5.5 takes a more case-law-driven path across 49 calls. Kimi K2.6 runs by far the longest at 174 calls, the vast majority of them web searches, without converting that volume into stronger accuracy. At the other extreme, Gemini 3.1 Pro Preview (02/26) is the most economical at 19 calls, with GLM 5.2 close behind at 37. Every model closes with one submission step.
Methodology
Agents are evaluated on a shared harness with access to five tools. Each task has a three-hour time limit; the agent may iterate with tools until it submits a final answer or the limit is reached.
courtlistener_search: searches the CourtListener database for case law, statutes, and legal documentsweb_search: general web search for secondary sources, regulatory text, and commentaryretrieve_information: queries over previously fetched documents to extract specific passagesparse_html_page: downloads and parses an HTML page from a URLsubmit_final_result: submits the agent’s final answer for evaluation
Grading
All responses are graded by GPT 5.4 as a judge against each task’s rubric (see Scoring for the all-pass metric and rubric structure). We selected this judge because its grades align more closely with a majority vote of 3 human expert reviewers than any single human reviewer does with that majority: GPT 5.4 agrees with the majority vote 87.4% of the time, compared to an average human baseline agreement of 83.4%.
Dataset
The dataset comprises 413 expert-authored, peer-reviewed questions, each paired with a gold-standard answer, authoritative legal sources, and a detailed grading rubric. It is divided into three parts: Public (5 open-source samples), Private Validation (200 samples available for license), and Test (208 samples).
- The Public set and agent harness are fully open and can be accessed here.
- The Private Validation set is available for license. Interested parties are encouraged to contact us directly for access.
- The Test set will remain private. All results reported on this page are based solely on the Test set to prevent overfitting.
The dataset splits were sampled to preserve the distribution of question types, practice areas, and difficulty.
Citation
If you use this benchmark in your research, please cite:
Citation (BibTeX)
@misc{valsai2026legalresearch,
title = {Legal Research Bench: Evaluating Agents on US Legal Research Tasks},
author = {Vals AI},
year = {2026},
month = jun,
howpublished = {Vals AI},
url = {https://github.com/vals-ai/legal-research-bench},
}