Industry Leaderboard

Select industry:

Legal Industry

Benchmarks in this category

Benchmark data not found

Updates

We evaluated GLM 4.6 on all benchmarks and found it improves upon the already strong performance of GLM 4.5 , leading the [open-source category of our Vals Index.

While GLM 4.6 outperforms its predecessor GLM 4.5 , it underperforms compared to Qwen 3 Max
GLM 4.6 excels at coding, placing in the top 10 on both SWE-bench, Terminal-Bench, and LiveCodeBench. This reproduces z.AI internal results on the latter two benchmarks, though lags 10 points behind internal numbers on SWE-bench.
GLM 4.6 struggles on our private benchmarks, underperforming compared to GLM 4.5 on our Finance Industry Leaderboard.

See our comparison with GLM 4.5 for more context.

We evaluated Claude Haiku 4.5 (Thinking) and found strong performance:

The model places 3rd on our Vals Index, demonstrating well-rounded capabilities across diverse tasks.
On Terminal Bench, Haiku 4.5 achieves 3rd place, showing particular strength on coding tasks.
While it performs well on certain coding benchmarks, the model achieves middle-of-the-pack performance on most other benchmarks.
The model struggles significantly on our proprietary CaseLaw benchmark and the public MedQA, GPQA, MMLU Pro, MMMU, and LiveCodeBench benchmarks.
Compared to Claude Sonnet 4.5 (Thinking) , Haiku trades some performance for significantly faster speed and lower cost - see our model comparison for details.

Overall, Haiku 4.5 sits firmly on the Pareto frontier.

Our new Student Assessment with Generative Evaluation (SAGE) benchmark evaluates the ability of Language Models to grade handwritten student work at the undergraduate level.

No models score higher than 50%; by contrast, top models consistently achieve >90% accuracy on mathematical benchmarks like AIME.
Gemini 2.5 Flash (7/17) (Nonthinking) takes first place with 44.8% - outperforming Gemini 2.5 Pro by 3%!
Current results show significant variation in performance by model provider, with less variation in particular model. With the exception of the two smallest models we tested ( Gemini 2.5 Flash Lite (Nonthinking) and GPT 5 Nano ), all OpenAI and Google models do better than all Anthropic models, which in turn do better than Grok 4 .

We find significant room for improvement in models’ capabilities here, and are excited for the SAGE benchmark to enable continued model development for education!

We evaluated Magistral Medium 1.2 (09/2025) and Magistral Small 1.2 (09/2025) - and found that both models perform decently for their size, especially on coding tasks. However, the models also struggled on many benchmarks.

Magistral Medium performs well on academic and coding benchmarks, placing in the top 20 on LiveCodeBench and AIME. However, the model struggles on our proprietary benchmarks, particularly MortgageTax and CaseLaw.
Surprisingly, Magistral Small tends to do better on finance and academic benchmarks, most notably outperforming Medium on MortgageTax (+8.8%). The model also does well on LiveCodeBench and AIME. However, Small struggled on our proprietary CorpFin and CaseLaw benchmarks, along with GPQA and MMLU Pro.
A large chunk of the performance loss was the result of models not outputting results in the format that was required.

The Medium model is priced at $2 / $5, and the Small at $0.5 / $1.5. The Small model has open weights, whereas the Medium model is only available via API.

model

10/28/2025