Anthropic's most intelligent model.

Released Date: 10/22/2024

Avg. Accuracy:

74.3%

Latency:

2.00s

Performance by Benchmark

Benchmarks

Accuracy

Rankings

LegalBench

78.8%

( 6 / 24 )

CorpFin

71.8%

( 3 / 26 )

CaseLaw

84.9%

( 3 / 18 )

ContractLaw

68.7%

( 10 / 25 )

TaxEval

67.1%

( 4 / 25 )

Cost Analysis

Input Cost

$3.00 / M Tokens

Output Cost

$15.00 / M Tokens

Cost Per Test

$0.23 / 100 tests

Overview

Claude 3.5 Sonnet is Anthropic’s latest mid-tier model, positioned between the more powerful Opus and the previous 3.0 versions. It offers a strong balance of performance and cost-effectiveness.

Key Specifications

  • Context Window: 200,000 tokens
  • Output Limit: 8,192 tokens
  • Training Cutoff: April 2024
  • Pricing:
    • Input: $3.00 per million tokens
    • Output: $15.00 per million tokens

Performance Highlights

  • Legal Domain: Particularly strong in criminal law tasks, outperforming GPT-4 in several legal reasoning benchmarks
  • Cost-Efficiency: Better performance/cost ratio compared to Claude 3 Opus
  • Consistency: Shows more stable performance across different task types compared to previous versions

Benchmark Results

The model shows strong performance across our benchmarks. It is the state-of-the-art for on our Corporate Finance benchmark. It is comparatively lacking on the Contract Law benchmark.

Consistently high performing across benchmarks:

  • CaseLaw: One of the top three models for these tasks.
  • TaxEval: Comparatively strong performance with significant room for improvement.
  • CorpFin: Strong question-answering ability over credit agreements.
  • ContractLaw: The worst performing domain for this model.

Use Case Recommendations

Best suited for:

  • Legal document analysis
  • Complex reasoning tasks
  • Long-form content generation
  • Tasks requiring high accuracy with cost constraints

Limitations

  • Occasionally produces verbose outputs

Comparison with Other Models

Major improvements over Claude 3.0 Sonnet:

  • Improved accuracy in legal reasoning
  • Better handling of nuanced instructions
  • More consistent performance across tasks
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.