New Finance Agent Benchmark Released

Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Model

9/08/2025

Qwen3 Max Preview Benchmarked

We evaluated Qwen 3 Max Preview on our benchmarks. Despite the model’s large size, we found the performance did not live up to the hype.

  • On our benchmarks, it was generally in the middle of the pack - but not in the top 5 on any benchmarks, and on most, it was outside the top 20. On Finance Agent, it only managed to get 17% accuracy.

  • Qwen 3 Max Preview did have comparatively strong performance on MGSM and GPQA, but these benchmarks are saturated, and incremental gains here do not signify meaningful differences in model intelligence.

  • This model is not open source, which is one of the main benefits of the Qwen series. It is also more expensive than its open source counterpart, Qwen 3 (235B) , but often performs worse.

Alibaba has currently only released the non-reasoning version of max preview. We’re excited to benchmark the reasoning version when it’s available, which may improve responses on benchmarks like AIME.

View Qwen3 Max Preview Results

Model

08/29/2025

GLM 4.5 Evaluated!

We evaluated Z.ai’s GLM 4.5 model and found the following:

GLM 4.5 definitely still has room for improvement. We’re looking forward to seeing how open-source models continue to progress, but for now there is still a long way to go.

View GLM 4.5 Results

Benchmark

08/27/2025

Grok Code Evaluated on Coding Benchmarks!

We evaluated xAI’s Grok Code Fast on three of our coding benchmarks and found it to be much faster (and cheaper) for practical coding tasks, but significantly worse than xAI’s flagship model Grok 4 in general. Our findings are below:

Grok Code Fast is a snappier (and cheaper) model optimized for coding, and our results show that while there is significant room for improvement relative to other frontier models including xAI’s Grok 4 , it performs competitively on practical coding tasks while offering benefits in terms of latency and cost.

View Grok Code Results

Latest Benchmarks

View All Benchmarks

Latest Model Releases

View All Models

Qwen 3 Max Preview

Qwen 3 Max Preview

Release date : 9/5/2025

View Model
Grok Code Fast

Grok Code Fast

Release date : 8/25/2025

View Model
GPT 5 Nano

GPT 5 Nano

Release date : 8/7/2025

View Model
GPT 5 Mini

GPT 5 Mini

Release date : 8/7/2025

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.