Benchmarks
Models
Methodology
Updates
VLAIR
Careers
About
Changelog
Private question-answer benchmark over Canadian court cases.
Updated 03/28/2025
Benchmarking model performance on Contract Law Tasks
Evaluating language models on a wide range of open source legal reasoning tasks.
Our completely new version of CorpFin benchmark
Evaluating Language Models on Mortgage Tax Certificates
Our completely new version of TaxEval benchmark
Evaluating language model bias in medical questions.
Extremely challenging math exam given to students
A multilingual benchmark for mathematical questions.
Academic math benchmark on probability, algebra, and trigonometry
Graduate-level Google-Proof Q&A benchmark evaluating models on questions that require deep reasoning.
Academic multiple-choice benchmark covering 14 subjects including STEM, humanities, and social sciences.
Multimodal Multi-task Benchmark