Invalid Date
Model
Gemini 2.5 Pro Exp evaluated on all benchmarks!
We just evaluated Gemini 2.5 Pro Exp on all benchmarks!
- Gemini 2.5 Pro Exp is Google’s latest experimental model and the new State-of-the-Art, achieving an impressive average accuracy of 82.3% across all benchmarks with a latency of 24.68s.
- The model ranks #1 on many of our benchmarks including CorpFin, Math500, LegalBench, GPQA, MMLU Pro, and MMMU.
- It excels in academic benchmarks, with standout performances on Math500 (95.2%), MedQA (93.0%), and MGSM (92.2%).
- Gemini 2.5 Pro Exp demonstrates strong legal reasoning capabilities with 86.1% accuracy on CaseLaw and 83.6% on LegalBench, though it scores lower on ContractLaw (64.7%).
Invalid Date
Model
DeepSeek V3 evaluated on all benchmarks!
We just evaluated DeepSeek V3 on all benchmarks!
- DeepSeek V3 is DeepSeek’s latest model, boasting speeds of 60 tokens/second and claiming to be 3x faster than V2, with an average accuracy of 73.9% (4.2% better than previous versions).
- DeepSeek V3 performs comparably (slightly better) to Claude 3.7 Sonnet (71.7%).
- The model demonstrates strong legal capabilities, scoring particularly well on CaseLaw and LegalBench, though it scores lower on ContractLaw.
- It shows impressive academic versatility with top-tier performance on MGSM, Math500, and MedQA.
Invalid Date
Benchmark
New Multimodal Benchmark: MMMU Evaluates Visual Reasoning Across 30 Subjects
Today, we’re releasing results from the Multimodal Multi-task Benchmark (MMMU), a comprehensive evaluation of AI models’ ability to reason across multiple modalities spanning 30 subjects in 6 major disciplines.
- o1 achieved the highest overall accuracy at 77.7%, surpassing the performance of the worst human experts (76.2%).
- Claude 3.7 Sonnet (Thinking) delivers performance nearly identical to o1 at a more favorable price point
- Even the best models remain well below the performance of the best human experts (88.6%), highlighting opportunities for further advancement
Invalid Date
Model
Command A evaluated on all benchmarks!
We just evaluated Command A on all benchmarks!
- Command A is Cohere’s most efficient and performant model to date, specializing in agentic AI, multilingual, and human evaluations for real-life use cases.
- On our proprietary benchmarks, Command A shows mixed performance, ranking 23rd out of 28 models on TaxEval but a good 10th out of 22 models on CorpFin.
- The model performs better on some academic benchmarks, scoring 78.7% on LegalBench (9th place) and 86.8% on MGSM (13th place).
- However, it struggles with AIME (13.3%, 12th place) and GPQA (29.3%, 18th place).
Invalid Date
Model
Jamba 1.6 Large and Mini Evaluated on All Benchmarks.
We just evaluated Jamba 1.6 Large and Jamba 1.6 Mini models!
- Jamba 1.6 Large and Jamba 1.6 Mini are the latest versions of the open source Jamba models, developed by AI21 Labs.
- On our private benchmarks, Jamba 1.6 Large shows reasonable performance, getting the 16th place out of 27 models on TaxEval. with 65.3% accuracy, beating GPT-4o Mini and Claude 3.5 Haiku.
- However both models are not competitive on public benchmarks, they get the last two places on AIME and GPQA.
Invalid Date
Benchmark
Academic Benchmarks Released: GPQA, MMLU, AIME (2024 and 2025), Math 500, and MGSM
Today, we’ve released five new academic benchmarks on our site: three evaluating mathematical reasoning, and two on general question-answering.
Unlike results released by model providers on these benchmarks, we applied a consistent methodology and prompt-template across models, ensuring an apples-to-apples comparison. You can find detailed information about our evaluation approach on each benchmark’s page:
Invalid Date
Benchmark
New Multimodal Mortgage Tax Benchmark Released
We just released a new benchmark in partnership with Vontive!
- The MortgageTax benchmark evaluates language models on extracting information from tax certificates.
- It tests multimodal capabilities with 1258 document images, including both computer-written and handwritten content.
- The benchmark includes two key tasks: semantic extraction (identifying year, parcel number, county) and numerical extraction (calculating annualized amounts).
Claude 3.7 Sonnet leads the pack with 80.6% accuracy, and the other top 3 models are all from Anthropic.
Invalid Date
News
Vals Legal AI Report Released
We just released the VLAIR! Whereas our previous benchmarks study foundation model performance, here we investigate the ability of the most popular legal AI products to perform real world legal tasks.
To build a large, high quality dataset, we worked with some of the top global law firms, including Reed Smith, Fisher Phillips, McDermott Will & Emery, Ogletree Deakins, Paul Hastings among others. This is the first benchmark in which we also collected a human baseline against which we measure performance.
In sum, this enabled us to study how these legal AI systems perform on practical tasks and especially how the work of generative AI tools compared to that of a human lawyer.
Read the report for full results.
Invalid Date
Model
Anthropic's Claude 3.7 Sonnet Evaluated on All Benchmarks.
We just evaluated Anthropic’s Claude 3.7 Sonnet model!
- We evaluted the model with Thinking Disabled on all benchmarks. It shows great performance and reaches second place just behind its Thinking Enabled counterpart on Corp Fin.
- We also evaluated the model with Thinking Enabled. Unlike most models that excel in specific areas, Anthropic’s Claude 3.7 Sonnet (Thinking) demonstrates remarkable consistency, achieving top-tier performance across all evaluated benchmarks. The remaining two benchmarks are currently in progress due to their higher token requirements.
We have also run Google 2.0 Flash Thinking Exp and Google 2.0 Pro Exp on most benchmarks.
Invalid Date
Model
OpenAI's o3-mini Evaluated on All Benchmarks.
We just evaluated OpenAI’s o3-mini model!
- The model shows a good price-performance trade-off, reaching close to top places on our most recent and proprietary benchmarks like Tax Eval.
- However, o3-mini seems to struggle with large context windows, performing poorly on the Max Fitting Context task of CorpFin. It tends to lose the question if it is provided at the beginning of a large context window (around 150k tokens and more).
We have also run DeepSeek R1 on our CorpFin benchmark, on which it reaches the top place, beating all other models we have tested.
Invalid Date
Model
DeepSeek R1 Evaluated on TaxEval, CaseLaw, ContractLaw
🐳 We just evaluated DeepSeek’s R1 model on three of our private datasets! 🐳
- The model demonstrates its strong reasoning ability, rivaling Open AI’s o1 model on our Tax dataset.
- However, R1 performs extremely poorly on ContractLaw and with middling performance on CaseLaw. The model’s performance is not uniform, suggest task-specific evaluation must be done before adoption
- Overall, this large Chinese model shows impressive ability and further closes the gap between closed and open-source models.
Invalid Date
Benchmark
Two New Proprietary Benchmarks Released
We just released two new benchmarks!
- We have released a completely new version of our CorpFin benchmark - with 1200 expert generated financial questions on very long context docs (200-300 pages).
- We have also released a completely new TaxEval benchmark, with more than 1500 expert reviewed tax questions.
We also are releasing several new models such as Grok 2 and Gemini 2.0 Flash Exp.
Invalid Date
Benchmark
New Medical Benchmark Released
Vals AI and Graphite Digital partnered to release the first medical benchmark on Vals AI.
This report offers the first third-party, highly-exhaustive evaluation of over 15 of the most popular LLMs on graduate-level medical questions.
We assessed models under two conditions: unbiased and bias-injected questions, measuring the models’ general accuracy and the ability to handle racial bias in medical contexts.
Our top-performing model was OpenAI’s o1 Preview and best value was Meta’s Llama 3.1 70b.
Read the full report to find out more!
Invalid Date
News
Refresh to Vals AI
We’ve just implemented a re-design of this benchmarking website!
Apart from being easier on the eyes, this new version of the site is much more useful.
- Models cards are displayed on their own dedicated pages, showing results across all benchmarks.
- Every Benchmark page is time-stamped and updated with changelogs.
- Our Methodology page now shares more details around our approach and plan.
Invalid Date
Model
Results for the new 3.5 Sonnet (Latest) model
- On Legalbench, it’s now exactly tied with GPT 4o, and beats 4o on CorpFin and CaseLaw
- It usually, but not always, performs a few percentage points better than the previous version - for example, on Legalbench (+1.3%), ContractLaw Overall (+0.5%), and CorpFin (+0.8%).
- There are some instances where it experienced a performance regression - including TaxEval Free Response (-3.2%) and CaseLaw Overall (-0.1%).
- Although it’s competitive with 4o, it’s still not at the level of GPT o1, which still claims the top spots on almost all of our leaderboards.
Invalid Date
News
Vals AI Legal Report Announced
Vals AI and Legaltech Hub are partnering with leading law firms and top legal AI vendors to conduct a first-of-its-kind benchmark.
The study will evaluate the platforms across eight legal tasks including Document Q&A, Legal Research, EDGAR Research. All data will be collected from the law firms, to ensure it’s representative of real legal work.
The report will be published in early 2025.