-
We evaluated Anthropic’s new Claude Fable 5 across our benchmark suite. It resets the top of the leaderboard, taking #1 on the Vals Index (75.15%) and the Vals Multimodal Index (74.15%).
-
Fable 5 topped most of the suite: #1 on CorpFin v2 (71.83%), MedScribe (88.52%), LegalBench (88.56%), MMLU Pro (91.50%), MMMU (89.31%), and ProofBench (77.00%).
-
Coding is the headline. Claude Fable 5 is #1 on every coding benchmark with a scored Fable 5 result: Vibe Code Bench (90.35%), SWE-bench Verified (95.00%), Terminal-Bench 2.1 (80.52%), LiveCodeBench (89.78%), and IOI (72.25%).
-
Vibe Code Bench is the standout: Anthropic now holds the top three spots, and Fable 5 leads the best non-Anthropic model by more than 20 points. Six months ago, no model cracked 20% on this benchmark; Fable 5 now reaches 90.35%.
-
Compared with Claude Opus 4.8, Fable 5 improves by roughly 5 points on the Vals Index and roughly 8 points on Vibe Code Bench, even though Opus 4.8 took #1 just a week ago.
The API showed a high rate of refusals, especially on bio and cyber-related questions. Because of this, we ran Fable 5 with Claude Opus 4.8 as a fallback: if Fable 5 refused a task, Opus 4.8 handled the request instead. This mainly affected Terminal-Bench 2.1, GPQA, MMLU Pro, and MMMU.
The model has a 1M-token context window and 128k max output tokens. Evaluations were run with compute effort set to “max” and temperature set to 1.0.
Congrats to the Anthropic team on the release!