Anthropic's most intelligent model.

Released Date: 10/22/2024

Avg. Accuracy:

69.9%

Latency:

8.66s

Performance by Benchmark

Benchmarks

Accuracy

Rankings

CorpFin

60.5%

( 7 / 25 )

CaseLaw

84.9%

( 3 / 44 )

ContractLaw

68.7%

( 12 / 51 )

TaxEval

73.7%

( 11 / 31 )

MortgageTax

78.1%

( 4 / 15 )

Math500

72.4%

( 20 / 27 )

AIME

10.0%

( 18 / 23 )

MGSM

92.5%

( 3 / 25 )

LegalBench

78.8%

( 11 / 49 )

MedQA

83.2%

( 12 / 29 )

GPQA

59.1%

( 8 / 24 )

MMLU Pro

78.4%

( 7 / 24 )

MMMU

68.9%

( 6 / 14 )

Academic Benchmarks
Proprietary Benchmarks (contact us to get access)

Cost Analysis

Input Cost

$3.00 / M Tokens

Output Cost

$15.00 / M Tokens

Input Cost (per char)

$0.83 / M chars

Output Cost (per char)

$5.20 / M chars

Overview

Claude 3.5 Sonnet is Anthropic’s latest mid-tier model, positioned between the more powerful Opus and the previous 3.0 versions. It offers a strong balance of performance and cost-effectiveness.

Key Specifications

  • Context Window: 200,000 tokens
  • Output Limit: 8,192 tokens
  • Training Cutoff: April 2024
  • Pricing:
    • Input: $3.00 per million tokens
    • Output: $15.00 per million tokens

Performance Highlights

  • Legal Domain: Particularly strong in criminal law tasks, outperforming GPT-4 in several legal reasoning benchmarks
  • Cost-Efficiency: Better performance/cost ratio compared to Claude 3 Opus
  • Consistency: Shows more stable performance across different task types compared to previous versions

Benchmark Results

The model shows strong performance across our benchmarks. It is the state-of-the-art for on our Corporate Finance benchmark. It is comparatively lacking on the Contract Law benchmark.

Consistently high performing across benchmarks:

  • CaseLaw: One of the top three models for these tasks.
  • TaxEval: Comparatively strong performance with significant room for improvement.
  • CorpFin: Strong question-answering ability over credit agreements.
  • ContractLaw: The worst performing domain for this model.

Use Case Recommendations

Best suited for:

  • Legal document analysis
  • Complex reasoning tasks
  • Long-form content generation
  • Tasks requiring high accuracy with cost constraints

Limitations

  • Occasionally produces verbose outputs

Comparison with Other Models

Major improvements over Claude 3.0 Sonnet:

  • Improved accuracy in legal reasoning
  • Better handling of nuanced instructions
  • More consistent performance across tasks
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.