-
We evaluated Anthropic’s new Claude Sonnet 5 across our benchmark suite. It comes in at #3 on the Vals Index (68.61%), behind only Claude Fable 5 (75.15%) and Claude Opus 4.8 (70.36%), and narrowly ahead of GPT 5.5 (67.95%).
-
It’s a substantial generational step: +8.5 points on the Vals Index over Claude Sonnet 4.6 (60.07%) — and almost all of that gain comes from coding.
-
The models main strength is in coding. Compared with Sonnet 4.6, Claude Sonnet 5 jumps +30.7 points on Vibe Code Bench (56.22% → 86.90%) and +17.2 points on Terminal-Bench 2.1 (57.30% → 74.53%), while SWE-bench Verified actually slips slightly (77.45% → 75.49%). The gains are concentrated in sustained, multi-step agentic coding rather than one-shot patch generation.
-
Outside coding it’s a quieter step. Sonnet 5 posts 67.95% on CorpFin v2 (+1.3 vs Sonnet 4.6) and 51.98% on Finance Agent (+0.9). Its index score clears last-generation Claude Opus 4.7 (66.10%).
We noticed a small but noticeable number of refusals on CorpFin v2 — 15 of 858 tasks, mostly flagged as “bio.”
The model has a 1M-token context window. Evaluations were run with compute effort set to “max” on all benchmarks except Terminal-Bench 2.1 (run at “high”), 128k max output tokens, and default temperature and top-p. Terminal-Bench 2.1 was averaged over 3 runs (71.9%–78.7%).
Congrats to the Anthropic team on the release!