Harvey's Legal Agent Benchmark

Partners in Evaluation

Key Takeaways

Claude Fable 5 is the current top model by Harvey final score at 11.25%, followed by Claude Opus 4.8 at 9.58%.
MiniMax-M3 is the strongest open-weight model in this run at 4.17%, while DeepSeek V4 matches GPT 5.5 on final score at 3.75%.
Top models satisfy most individual criteria, with criteria pass rates around 85%.

Benchmark

The Legal Agent Benchmark is a benchmark recently released by Harvey to test the ability of models to support legal work in an agentic setting. There are two datasets as of today: the public set and the held-out test set. The results here are from the held-out set, and initial results have already been released by Harvey.

Each task asks an agent to produce legal work against a set of task-specific criteria. The agent is provided with six tools: Read File, Edit File, Write File, Glob, Bash, and Grep. It also has three skills: docx, pptx, and xlsx.

Reported results use the same methodology as Harvey’s initial leaderboard. Criteria pass rate is included to show how often models satisfy individual requirements, even when they do not fully resolve the task.

Results

Harvey's Legal Agent Benchmark

xAI

Among models with a nonzero final score, Grok 4.3, DeepSeek V4, MiniMax-M3, Claude Sonnet 4.6, Claude Opus 4.8, and Claude Fable 5 form the cost/performance frontier.

Claude Fable 5 leads on the overall Harvey final score at 11.25%, with Claude Opus 4.8 second at 9.58% and Claude Sonnet 4.6 third at 5.0%. Fable 5 fell back to Claude Opus 4.8 on 4 tasks; counting those as failures gives a no-fallback score of 10.42%. The criteria pass rates are much higher: Fable 5 reaches 90.48%, Opus 4.8 reaches 87.86%, and Sonnet 4.6 reaches 86.66%.

Criteria Pass Rate by Task Type

Percent of criteria passed per task type

Task type	Claude Fable 590.48% avg	Claude Opus 4.887.86% avg	Claude Sonnet 4.686.66% avg	MiniMax-M386.28% avg	DeepSeek V484.40% avg
Intellectual Property	92.6%	90.5%	90.1%	90.5%	88.3%
Corporate M&A	92.0%	92.9%	91.0%	90.1%	88.5%
Data Privacy/Cybersecurity	94.3%	94.3%	89.0%	92.2%	85.9%
Energy/Natural Resources	96.0%	97.0%	95.0%	93.4%	93.4%
Real Estate	93.3%	91.0%	90.4%	88.3%	89.2%
Banking Finance	92.6%	87.7%	91.4%	86.6%	86.2%
Corporate Governance	92.1%	89.0%	87.0%	88.0%	86.0%
Capital Markets	91.4%	92.2%	86.7%	89.0%	86.3%
Trusts & Estates/Private Client	92.0%	87.8%	85.1%	85.5%	88.2%
International Trade Sanctions	92.5%	85.7%	81.7%	86.5%	82.9%

The leaderboard can be filtered by task type and switched between task pass rate and criteria pass rate. Models perform best on task resolution in Energy/Natural Resources and Healthcare/Life Sciences, while criteria pass rates averaged across all models are highest in Intellectual Property, Corporate M&A, and Data Privacy/Cybersecurity.

Criteria Pass Rate vs. Task Resolution

Criteria pass rateTask resolution

Claude Fable 5

90.5%/11.3%

Claude Opus 4.8

87.9%/9.6%

Claude Sonnet 4.6

86.7%/5.0%

MiniMax-M3

86.3%/4.2%

DeepSeek V4

84.4%/3.8%

Qwen 3.7 Max

83.5%/1.7%

GPT 5.5

80.0%/3.8%

Gemini 3.5 Flash

79.3%/2.5%

GPT 5.4 (xhigh)

74.5%/0.0%

Kimi K2.6

74.1%/1.7%

GLM 5.1

71.7%/0.0%

Grok 4.3

64.8%/0.4%

Qwen 3.7 Plus

62.7%/0.0%

Harvey grades a task as resolved only if every criterion passes. A model can satisfy most individual criteria and still miss task resolution credit.

There is a clear trend: strong models and agents satisfy most criteria, around 85% for top models. The remaining gaps are large enough that task resolution stays low even when criterion-level performance looks strong.

Tool Calling Statistics

Average tool count per item, across the six tools in the benchmark harness.

Models heavily prefer Bash and Read File. Write File appears regularly for some models, while Edit File, Glob, and Grep are lower-volume.

The available skills rely on shell commands in their instructions and to run their scripts. The tools also often encourage models to read files through the harness.

The skills do not directly emphasize Edit File, Write File, Glob, or Grep. Those tools still appear in traces, but less consistently than Bash and Read File. Grep is not well-suited to the binary format of .docx, .pptx, and .xlsx files. Likewise, Edit is useful for text-based files such as .md files, not for these filetypes.

Skill Invocation Statistics

Average skill invocations per item, across the three skills available to the agent.

Skill usage is dominated by docx, followed by xlsx. pptx is used less often, but it is not absent.

Methodology

We use Harvey’s generation and grading protocol in the same environment, with internet access disabled.

Harvey grades each submission with two LLM judges. Each judge computes a task pass rate. A task passes only if 100% of its criteria pass, and Harvey’s final score is the average of the two judge task pass rates.

The two judges were GPT 5.5 and Claude Sonnet 4.6. GPT 5.5 used medium reasoning and Claude Sonnet 4.6 was not modified.

While running the benchmark, we found that redline criteria needed DOCX tracked changes preserved when reading submitted files so judges could see inserted and deleted text. We fixed that bug and merged it upstream in harveyai/harvey-labs#76. The scoring rubric is unchanged.

The benchmark was modified to use our model library, an abstraction over various LLM provider APIs, and to run on Valkyrie, our framework for running agentic benchmarks. These are infrastructure changes and do not impact model performance.

To improve judge performance and reduce cost, we split the instruction prompt provided to each judge so common elements could be cached. This did not modify prompt content outside of caching.