Overview
Our benchmarks measure the capability and reliability of AI models and agents in realistic tasks. In contrast with contrived exam-style benchmarks, we focus on economically valuable and scientifically important domains—finance, healthcare, math, coding, and more. Developed in collaboration with domain experts, our datasets are carefully curated to be of extremely high quality and push models to their limits.
Task Design
Our benchmarks reflect the complexity of real-world tasks, which necessitates evaluating multiple types of capabilities:
- Tool-Use: How well can models call the right tools to solve problems?
- Multiple Modalities: How well can models handle images, tabular data, files, and other modalities beyond text?
- Reasoning: Models are increasingly trained to output reasoning before answering; do these capabilities actually improve real-world utility?
- Long-Context Capabilities: Can models reason over long contexts, such as extensive legal documents or large codebases?
- Long-Horizon Tasks: Can models autonomously work on tasks that take minutes, hours, or longer?
Public and Private Sets
A major problem with evaluations of AI models is test-set leakage [1]. Benchmark data can contaminate training sets either directly or through synthetic data [2], undermining the validity of reported results. Thus, we offer private benchmarking; for transparency and fairness, we offer (for most benchmarks):
- Public Validation Set: A completely open dataset, to provide transparency in the types of samples we use for evaluation.
- Private Validation Set: A larger, privately held dataset, which we license for companies to do their own internal validation. We provide (statistical) proof that this is correlated with our test suite.
- Test Set: This dataset remains private at all times, and is the only dataset that is used for the benchmarks we publish. It is private to prevent leakage into the training dataset of foundation models.
Metrics and evaluation
Benchmarks often report only accuracy numbers; however, it is important to consider factors such as efficiency, cost, time taken per test, failure modes, and more. Our evaluation framework provides detailed insights into model performance through multiple metrics:
- Accuracy: Evaluates the correctness of model outputs for each task and benchmark. This includes strict accuracy checks, as well as rubric-based LLM-as-a-judge accuracy metrics.
- Latency: Measures the response time of models in returning a complete response.
- Cost: Analyzes the operational cost of running each model from an API provider.
- Additional quantitative and qualitative insights: For each benchmark, we also provide further information, including but not limited to statistics regarding tool-use, qualitative insights about the nature of the errors, and comparisons between different models. This provides information beyond the raw benchmark numbers, and also helps contextualize the performance of models.
This information enables us to offer a more comprehensive, holistic view of model performance, including accuracy, reliability, efficiency, and qualitative insights.
Error Bars
We report standard errors alongside benchmark scores to reflect statistical uncertainty.
Our methodology depends on how the benchmark is structured:
Single-run benchmarks
For benchmarks evaluated once, we follow standard uncertainty reporting practice, as suggested by “Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations” by Evan Miller [3]. Error bars are computed as the standard error of the mean (SEM) over instance-level scores.
In particular, let be instance-level scores. Standard error of the mean (SEM), using sample standard deviation is given by:
where is the mean.
These error bars capture measurement uncertainty in the benchmark itself. They do not reflect variability across prompts, seeds, deployment settings, or the stochastic nature of LLM generation.
Multiple-run benchmarks (AIME, Caselaw)
When a benchmark includes multiple independent runs, we compute the SEM of the score of each runs, estimating uncertainty over runs rather than over individual instances.
Let be the average score from each of independent runs and the average score be .
Standard error over runs is then given by:
Composite benchmarks
For benchmarks that combine multiple tasks, we propagate uncertainty from each component using weighted variance pooling.
Let component standard errors be and weights
The propagated standard error is then:
In all cases, we use standard statistical definitions of the standard error of the mean, with sample standard deviation where applicable.
Evaluating not just models, but agentic systems and scaffolds
Since LLMs are often used as part of agentic systems with general scaffolds, and part of larger workflows or products, it is important to design evaluations that measure capabilities of these kinds of systems. Our benchmarks test crucial aspects of this, such as tool-calling, multi-turn flows, coding skills and computer-use.
In the future, our benchmarks will also evolve to test not only agentic systems we design, but also custom user-provided scaffolds and products.
These benchmarks ensure comprehensive evaluation of AI systems, addressing their growing utility in real-world applications and their ability to function autonomously in larger systems.
References
[1] https://arxiv.org/abs/2410.08385