Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Updates

Model

03/28/2025

Gemini 2.5 Pro Exp evaluated on all benchmarks!

We just evaluated Gemini 2.5 Pro Exp on all benchmarks!

  • Gemini 2.5 Pro Exp is Google’s latest experimental model and the new State-of-the-Art, achieving an impressive average accuracy of 82.3% across all benchmarks with a latency of 24.68s.
  • The model ranks #1 on many of our benchmarks including CorpFin, Math500, LegalBench, GPQA, MMLU Pro, and MMMU.
  • It excels in academic benchmarks, with standout performances on Math500 (95.2%), MedQA (93.0%), and MGSM (92.2%).
  • Gemini 2.5 Pro Exp demonstrates strong legal reasoning capabilities with 86.1% accuracy on CaseLaw and 83.6% on LegalBench, though it scores lower on ContractLaw (64.7%).

View Models Page

Model

03/26/2025

DeepSeek V3 evaluated on all benchmarks!

We just evaluated DeepSeek V3 on all benchmarks!

  • DeepSeek V3 is DeepSeek’s latest model, boasting speeds of 60 tokens/second and claiming to be 3x faster than V2, with an average accuracy of 73.9% (4.2% better than previous versions).
  • DeepSeek V3 performs comparably (slightly better) to Claude 3.7 Sonnet (71.7%).
  • The model demonstrates strong legal capabilities, scoring particularly well on CaseLaw and LegalBench, though it scores lower on ContractLaw.
  • It shows impressive academic versatility with top-tier performance on MGSM, Math500, and MedQA.

View Models Page

Benchmark

03/26/2025

New Multimodal Benchmark: MMMU Evaluates Visual Reasoning Across 30 Subjects

Today, we’re releasing results from the Multimodal Multi-task Benchmark (MMMU), a comprehensive evaluation of AI models’ ability to reason across multiple modalities spanning 30 subjects in 6 major disciplines.

  • o1 achieved the highest overall accuracy at 77.7%, surpassing the performance of the worst human experts (76.2%).
  • Claude 3.7 Sonnet (Thinking) delivers performance nearly identical to o1 at a more favorable price point
  • Even the best models remain well below the performance of the best human experts (88.6%), highlighting opportunities for further advancement

View Benchmark

Latest Benchmarks

View All Benchmarks

Latest Model Releases

Anthropic Claude 3.7 Sonnet (Thinking)

Anthropic Claude 3.7 Sonnet (Thinking)

Release date : Invalid Date

View Model
Anthropic Claude 3.7 Sonnet

Anthropic Claude 3.7 Sonnet

Release date : Invalid Date

View Model
OpenAI O3 Mini

OpenAI O3 Mini

Release date : Invalid Date

View Model
DeepSeek R1

DeepSeek R1

Release date : Invalid Date

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.