Updates
View All Updates
Model
9/08/2025
Qwen3 Max Preview Benchmarked
We evaluated Qwen 3 Max Preview on our benchmarks. Despite the model’s large size, we found the performance did not live up to the hype.
-
On our benchmarks, it was generally in the middle of the pack - but not in the top 5 on any benchmarks, and on most, it was outside the top 20. On Finance Agent, it only managed to get 17% accuracy.
-
Qwen 3 Max Preview did have comparatively strong performance on MGSM and GPQA, but these benchmarks are saturated, and incremental gains here do not signify meaningful differences in model intelligence.
-
This model is not open source, which is one of the main benefits of the Qwen series. It is also more expensive than its open source counterpart, Qwen 3 (235B) , but often performs worse.
Alibaba has currently only released the non-reasoning version of max preview. We’re excited to benchmark the reasoning version when it’s available, which may improve responses on benchmarks like AIME.
View Qwen3 Max Preview Results
Model
08/29/2025
GLM 4.5 Evaluated!
We evaluated Z.ai’s GLM 4.5 model and found the following:
- GLM 4.5 delivers solid top-twenty results on AIME (#5/51), GPQA (#16/53), MMLU Pro (#15/51), LiveCodeBench (#15/53), and our own CaseLaw benchmark (#20/27).
- When compared directly to U.S. open-source peers, GLM 4.5 performs better than models such as Llama 4 Maverick , but is still outperformed by GPT OSS 120B across nearly every benchmark.
GLM 4.5 definitely still has room for improvement. We’re looking forward to seeing how open-source models continue to progress, but for now there is still a long way to go.
View GLM 4.5 Results
Benchmark
08/27/2025
Grok Code Evaluated on Coding Benchmarks!
We evaluated xAI’s Grok Code Fast on three of our coding benchmarks and found it to be much faster (and cheaper) for practical coding tasks, but significantly worse than xAI’s flagship model Grok 4 in general. Our findings are below:
- Grok Code Fast scores 62% on LCB, placing the model in the middle of the pack, comparable to other reasoning models like Claude Sonnet 4 (Nonthinking) , but for a tenth of the price.
- On IOI, Grok Code Fast gets 4.3% placing the model at 8/12. By contrast, Grok 4 gets 26.2% and places first overall!
- On SWE-bench, Grok Code Fast gets an impressive 57.6% percent, placing 4th right behind Grok 4 ‘s 58.6%, but with a latency of 264.68s compared to Grok 4 ‘s 704.8s.
Grok Code Fast is a snappier (and cheaper) model optimized for coding, and our results show that while there is significant room for improvement relative to other frontier models including xAI’s Grok 4 , it performs competitively on practical coding tasks while offering benefits in terms of latency and cost.
View Grok Code Results
Latest Benchmarks
View All Benchmarks
Latest Model Releases
View All Models
Qwen 3 Max Preview
Release date : 9/5/2025
Grok Code Fast
Release date : 8/25/2025
GPT 5 Nano
Release date : 8/7/2025
GPT 5 Mini
Release date : 8/7/2025