New Finance Agent Benchmark Released

Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Model

05/05/2025

Google's Gemini 2.5 Flash evaluated on most benchmarks

We just evaluated Gemini 2.5 Flash Preview on most benchmarks.

View Models Page

Model

05/05/2025

Qwen 3 235B evaluations released!

We just evaluated Qwen 3 235B on all benchmarks!

  • Qwen 3 235B demonstrates exceptional math reasoning capabilities, ranking #3 on Math500, #5 on AIME, and #3 on MGSM.

  • With its “thinking allowed” approach, Qwen 3 outperforms several prominent closed-source reasoning models including Claude 3.7 Sonnet and o4-mini in mathematical reasoning tasks.

  • Private benchmark challenges: Qwen 3 shows limitations on proprietary benchmarks, particularly struggling on TaxEval where it ranks #29 out of 43 evaluated models.

  • This evaluation showcases Qwen 3’s strong specialized reasoning capabilities while highlighting areas where further improvements could enhance its performance on domain-specific tasks.

View Models Page

Benchmark

04/22/2025

Our new Finance Agent Benchmark is live!

  • Our new Finance Agent Benchmark evaluates AI agents’ ability to perform tasks expected of entry-level financial analysts.
  • Developed in collaboration with industry experts, it includes 537 questions covering skills like simple retrieval, market research, and projections.
  • The models are expected to use a set of 4 tools to search through the web or the EDGAR database and parse the results to answer the questions.
  • Current AI models do not exceed 50% accuracy, highlighting the need for further development before reliable deployment in the finance industry.
  • At the time of this benchmark’s release, o3 is the best performing model reaching 48.3%, but at the cost of an average of $3.69 per question.
  • It is followed closely by Claude Sonnet 3.7 Thinking which gets 44.1% accuracy, at the much lower price per question of $1.05.

View Benchmark

Latest Benchmarks

View All Benchmarks

Latest Model Releases

View All Models

Qwen 3 (235B)

Qwen 3 (235B)

Release date : 4/28/2025

View Model
Gemini 2.5 Flash Preview

Gemini 2.5 Flash Preview

Release date : 4/17/2025

View Model
Gemini 2.5 Flash Preview (Thinking)

Gemini 2.5 Flash Preview (Thinking)

Release date : 4/17/2025

View Model
o4 Mini

o4 Mini

Release date : 4/16/2025

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.