We found that Qwen 3 Max Thinking shines on financial tasks, but struggles on coding evaluations when compared to its predecessor, Qwen 3 Max.
We attribute the difference to use of new “reasoning” mode like those featured by closed-source providers.
Its most impressive finish is second on CorpFin, behind Kimi K2.5.
It also improves by 10% over its predecessor on our Finance Agent Benchmark.
By contrast, it performs worse than Qwen 3 Max on both SWE-bench and Terminal-Bench 2, placing 20th and 25th, respectively.
We also found that the model can get expensive on long-context agentic tasks, with the highest pricing tier equivalent to that of the Claude Sonnet models.
Additionally, the model is worse than its predecessor at context caching, which increases prices per token further.
Congrats to the team at Alibaba on the release!