Updates
View All Updates
Model
05/05/2025
Google's Gemini 2.5 Flash evaluated on most benchmarks
We just evaluated Gemini 2.5 Flash Preview on most benchmarks.
- Gemini 2.5 Flash Preview is a lightweight alternative to Google’s flagship model, Gemini 2.5 Pro Exp. Gemini 2.5 Flash Preview runs at a fraction of the cost and latency, rendering it a more accessible option.
- Like Claude 3.7 Sonnet, Gemini 2.5 Flash Preview is a hybrid reasoning model, meaning it can adaptively choose how much to think before responding.
- Gemini 2.5 Flash Preview excels on LegalBench, coming second only to the flagship Gemini 2.5 Pro Exp (and outperforming its own thinking variant, Gemini 2.5 Flash Preview (Thinking), by 1%).
- We consistently had difficulty with Google’s API during evaluation, which prevented us from reporting full results. We’re working with a representative from the Gemini team to resolve those issues.
View Models Page
Model
05/05/2025
Qwen 3 235B evaluations released!
We just evaluated Qwen 3 235B on all benchmarks!
-
Qwen 3 235B demonstrates exceptional math reasoning capabilities, ranking #3 on Math500, #5 on AIME, and #3 on MGSM.
-
With its “thinking allowed” approach, Qwen 3 outperforms several prominent closed-source reasoning models including Claude 3.7 Sonnet and o4-mini in mathematical reasoning tasks.
-
Private benchmark challenges: Qwen 3 shows limitations on proprietary benchmarks, particularly struggling on TaxEval where it ranks #29 out of 43 evaluated models.
-
This evaluation showcases Qwen 3’s strong specialized reasoning capabilities while highlighting areas where further improvements could enhance its performance on domain-specific tasks.
View Models Page
Benchmark
04/22/2025
Our new Finance Agent Benchmark is live!
- Our new Finance Agent Benchmark evaluates AI agents’ ability to perform tasks expected of entry-level financial analysts.
- Developed in collaboration with industry experts, it includes 537 questions covering skills like simple retrieval, market research, and projections.
- The models are expected to use a set of 4 tools to search through the web or the EDGAR database and parse the results to answer the questions.
- Current AI models do not exceed 50% accuracy, highlighting the need for further development before reliable deployment in the finance industry.
- At the time of this benchmark’s release, o3 is the best performing model reaching 48.3%, but at the cost of an average of $3.69 per question.
- It is followed closely by Claude Sonnet 3.7 Thinking which gets 44.1% accuracy, at the much lower price per question of $1.05.
View Benchmark
Latest Benchmarks
View All Benchmarks
Latest Model Releases
View All Models
Qwen 3 (235B)
Release date : 4/28/2025
Gemini 2.5 Flash Preview
Release date : 4/17/2025
Gemini 2.5 Flash Preview (Thinking)
Release date : 4/17/2025
o4 Mini
Release date : 4/16/2025