Updates
View All Updates
Benchmark
04/22/2025
Our new Finance Agent Benchmark is live!
- Our new Finance Agent Benchmark evaluates AI agents’ ability to perform tasks expected of entry-level financial analysts.
- Developed in collaboration with industry experts, it includes 537 questions covering skills like simple retrieval, market research, and projections.
- The models are expected to use a set of 4 tools to search through the web or the EDGAR database and parse the results to answer the questions.
- Current AI models do not exceed 50% accuracy, highlighting the need for further development before reliable deployment in the finance industry.
- At the time of this benchmark’s release, o3 is the best performing model reaching 48.3%, but at the cost of an average of $3.69 per question.
- It is followed closely by Claude Sonnet 3.7 Thinking which gets 44.1% accuracy, at the much lower price per question of $1.05.
View Benchmark
Model
04/18/2025
o3 and o4 Mini evaluated on all benchmarks!
We just evaluated o3 and o4 Mini on all benchmarks!
-
o3 achieved the #1 overall accuracy ranking on our benchmarks, with exceptional performance on complex reasoning tests like MMMU (#1/22), MMLU Pro (#1/35), GPQA (#1/35) and proprietary benchmarks like TaxEval (#1/42) and CorpFin (#2/35).
-
o4 Mini achieved the second-highest accuracy across our benchmarks (82.8%), driven by strong performance on public math tests like MGSM (#1/36), MMMU (#2/22), and Math500 (#4/38).
-
Legal benchmark weaknesses: Both models demonstrated significant weaknesses on our proprietary legal benchmarks, with lower ranks on ContractLaw (o3: #34/62, o4 Mini: #14/62) and CaseLaw (o3: #15/55, o4 Mini: #18/55).
-
Cost-effectiveness comparison: With similar performance levels, cost becomes a key differentiator. o4 Mini costs $4.40 for output, compared to $40.00 for o3 — a tenfold price difference that makes o4 Mini the more economical choice for many use cases.
View Models Page
Model
04/15/2025
GPT 4.1, 4.1 Mini, and 4.1 Nano evaluated on all benchmarks!
We just evaluated GPT 4.1, GPT 4.1 Mini, and GPT 4.1 Nano on all benchmarks!
-
GPT 4.1 delivers impressive results with a 75.5% average accuracy across benchmarks.
-
Impressive performance on proprietary benchmarks! GPT 4.1 is now the leader on CorpFin (71.2%), and shows strong performance on CaseLaw (85.8%, 4/53), and MMLU Pro (80.5%, 6/33).
-
GPT 4.1 Nano and GPT 4.1 Mini bring AI to time-sensitive applications with an outstanding latency of only 3.62s and 6.60s respectively while still achieving 59.1% and 75.1% average accuracy.
-
Compact but capable! Despite its size, GPT 4.1 Mini performs admirably on Math500 (88.8%, 10/36) and MGSM (87.9%, 20/34).
-
Size versus performance tradeoff: The smaller models do show lower performance on some complex tasks, with GPT 4.1 Nano ranking near the bottom on MMLU Pro (62.3%, 30/33) and MGSM (69.8%, 32/34).
View Models Page
Latest Benchmarks
View All Benchmarks
Latest Model Releases
View All Models
Anthropic Claude 3.7 Sonnet (Thinking)
Release date : 2/19/2025
Anthropic Claude 3.7 Sonnet
Release date : 2/19/2025
OpenAI O3 Mini
Release date : 1/31/2025
DeepSeek R1
Release date : 1/20/2025