New Finance Agent Benchmark Released

Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Benchmark

04/22/2025

Our new Finance Agent Benchmark is live!

  • Our new Finance Agent Benchmark evaluates AI agents’ ability to perform tasks expected of entry-level financial analysts.
  • Developed in collaboration with industry experts, it includes 537 questions covering skills like simple retrieval, market research, and projections.
  • The models are expected to use a set of 4 tools to search through the web or the EDGAR database and parse the results to answer the questions.
  • Current AI models do not exceed 50% accuracy, highlighting the need for further development before reliable deployment in the finance industry.
  • At the time of this benchmark’s release, o3 is the best performing model reaching 48.3%, but at the cost of an average of $3.69 per question.
  • It is followed closely by Claude Sonnet 3.7 Thinking which gets 44.1% accuracy, at the much lower price per question of $1.05.

View Benchmark

Model

04/18/2025

o3 and o4 Mini evaluated on all benchmarks!

We just evaluated o3 and o4 Mini on all benchmarks!

  • o3 achieved the #1 overall accuracy ranking on our benchmarks, with exceptional performance on complex reasoning tests like MMMU (#1/22), MMLU Pro (#1/35), GPQA (#1/35) and proprietary benchmarks like TaxEval (#1/42) and CorpFin (#2/35).

  • o4 Mini achieved the second-highest accuracy across our benchmarks (82.8%), driven by strong performance on public math tests like MGSM (#1/36), MMMU (#2/22), and Math500 (#4/38).

  • Legal benchmark weaknesses: Both models demonstrated significant weaknesses on our proprietary legal benchmarks, with lower ranks on ContractLaw (o3: #34/62, o4 Mini: #14/62) and CaseLaw (o3: #15/55, o4 Mini: #18/55).

  • Cost-effectiveness comparison: With similar performance levels, cost becomes a key differentiator. o4 Mini costs $4.40 for output, compared to $40.00 for o3 — a tenfold price difference that makes o4 Mini the more economical choice for many use cases.

View Models Page

Model

04/15/2025

GPT 4.1, 4.1 Mini, and 4.1 Nano evaluated on all benchmarks!

We just evaluated GPT 4.1, GPT 4.1 Mini, and GPT 4.1 Nano on all benchmarks!

  • GPT 4.1 delivers impressive results with a 75.5% average accuracy across benchmarks.

  • Impressive performance on proprietary benchmarks! GPT 4.1 is now the leader on CorpFin (71.2%), and shows strong performance on CaseLaw (85.8%, 4/53), and MMLU Pro (80.5%, 6/33).

  • GPT 4.1 Nano and GPT 4.1 Mini bring AI to time-sensitive applications with an outstanding latency of only 3.62s and 6.60s respectively while still achieving 59.1% and 75.1% average accuracy.

  • Compact but capable! Despite its size, GPT 4.1 Mini performs admirably on Math500 (88.8%, 10/36) and MGSM (87.9%, 20/34).

  • Size versus performance tradeoff: The smaller models do show lower performance on some complex tasks, with GPT 4.1 Nano ranking near the bottom on MMLU Pro (62.3%, 30/33) and MGSM (69.8%, 32/34).

View Models Page

Latest Benchmarks

View All Benchmarks

Latest Model Releases

View All Models

Anthropic Claude 3.7 Sonnet (Thinking)

Anthropic Claude 3.7 Sonnet (Thinking)

Release date : 2/19/2025

View Model
Anthropic Claude 3.7 Sonnet

Anthropic Claude 3.7 Sonnet

Release date : 2/19/2025

View Model
OpenAI O3 Mini

OpenAI O3 Mini

Release date : 1/31/2025

View Model
DeepSeek R1

DeepSeek R1

Release date : 1/20/2025

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.