Updates
Model
03/28/2025
Gemini 2.5 Pro Exp evaluated on all benchmarks!
We just evaluated Gemini 2.5 Pro Exp on all benchmarks!
- Gemini 2.5 Pro Exp is Google’s latest experimental model and the new State-of-the-Art, achieving an impressive average accuracy of 82.3% across all benchmarks with a latency of 24.68s.
- The model ranks #1 on many of our benchmarks including CorpFin, Math500, LegalBench, GPQA, MMLU Pro, and MMMU.
- It excels in academic benchmarks, with standout performances on Math500 (95.2%), MedQA (93.0%), and MGSM (92.2%).
- Gemini 2.5 Pro Exp demonstrates strong legal reasoning capabilities with 86.1% accuracy on CaseLaw and 83.6% on LegalBench, though it scores lower on ContractLaw (64.7%).
View Models Page
Model
03/26/2025
DeepSeek V3 evaluated on all benchmarks!
We just evaluated DeepSeek V3 on all benchmarks!
- DeepSeek V3 is DeepSeek’s latest model, boasting speeds of 60 tokens/second and claiming to be 3x faster than V2, with an average accuracy of 73.9% (4.2% better than previous versions).
- DeepSeek V3 performs comparably (slightly better) to Claude 3.7 Sonnet (71.7%).
- The model demonstrates strong legal capabilities, scoring particularly well on CaseLaw and LegalBench, though it scores lower on ContractLaw.
- It shows impressive academic versatility with top-tier performance on MGSM, Math500, and MedQA.
View Models Page
Benchmark
03/26/2025
New Multimodal Benchmark: MMMU Evaluates Visual Reasoning Across 30 Subjects
Today, we’re releasing results from the Multimodal Multi-task Benchmark (MMMU), a comprehensive evaluation of AI models’ ability to reason across multiple modalities spanning 30 subjects in 6 major disciplines.
- o1 achieved the highest overall accuracy at 77.7%, surpassing the performance of the worst human experts (76.2%).
- Claude 3.7 Sonnet (Thinking) delivers performance nearly identical to o1 at a more favorable price point
- Even the best models remain well below the performance of the best human experts (88.6%), highlighting opportunities for further advancement
View Benchmark
Latest Benchmarks
View All Benchmarks
Latest Model Releases
Anthropic Claude 3.7 Sonnet (Thinking)
Release date : Invalid Date
Anthropic Claude 3.7 Sonnet
Release date : Invalid Date
OpenAI O3 Mini
Release date : Invalid Date
DeepSeek R1
Release date : Invalid Date