Home

Updates

View All Updates

Model

05/05/2025

We just evaluated Gemini 2.5 Flash Preview on most benchmarks.

Gemini 2.5 Flash Preview is a lightweight alternative to Google’s flagship model, Gemini 2.5 Pro Exp. Gemini 2.5 Flash Preview runs at a fraction of the cost and latency, rendering it a more accessible option.
Like Claude 3.7 Sonnet, Gemini 2.5 Flash Preview is a hybrid reasoning model, meaning it can adaptively choose how much to think before responding.
Gemini 2.5 Flash Preview excels on LegalBench, coming second only to the flagship Gemini 2.5 Pro Exp (and outperforming its own thinking variant, Gemini 2.5 Flash Preview (Thinking), by 1%).
We consistently had difficulty with Google’s API during evaluation, which prevented us from reporting full results. We’re working with a representative from the Gemini team to resolve those issues.

View Models Page

Model

05/05/2025

We just evaluated Qwen 3 235B on all benchmarks!

Qwen 3 235B demonstrates exceptional math reasoning capabilities, ranking #3 on Math500, #5 on AIME, and #3 on MGSM.
With its “thinking allowed” approach, Qwen 3 outperforms several prominent closed-source reasoning models including Claude 3.7 Sonnet and o4-mini in mathematical reasoning tasks.
Private benchmark challenges: Qwen 3 shows limitations on proprietary benchmarks, particularly struggling on TaxEval where it ranks #29 out of 43 evaluated models.
This evaluation showcases Qwen 3’s strong specialized reasoning capabilities while highlighting areas where further improvements could enhance its performance on domain-specific tasks.

View Models Page

Benchmark

04/22/2025

Our new Finance Agent Benchmark evaluates AI agents’ ability to perform tasks expected of entry-level financial analysts.
Developed in collaboration with industry experts, it includes 537 questions covering skills like simple retrieval, market research, and projections.
The models are expected to use a set of 4 tools to search through the web or the EDGAR database and parse the results to answer the questions.
Current AI models do not exceed 50% accuracy, highlighting the need for further development before reliable deployment in the finance industry.
At the time of this benchmark’s release, o3 is the best performing model reaching 48.3%, but at the cost of an average of $3.69 per question.
It is followed closely by Claude Sonnet 3.7 Thinking which gets 44.1% accuracy, at the much lower price per question of $1.05.

View Benchmark