Our Methodology

Our Benchmarks

Our benchmarks are designed to address existing limitations with academic benchmarking. Developed in collaboration with experts and leading firms, these datasets are:

High-Quality and Realistic: Reflecting the complexities and nuances of real-world applications.
Challenging: Designed to push LLMs to their limits, going beyond simple tasks and testing advanced reasoning and analysis.
Private: For each dataset, we maintain a held-out test set that is never released to model providers.

Public and Private Sets

For transparency and fairness, we offer (for most benchmarks):

Public Validation Set: A completely open dataset, to provide transparency in the types of samples we use for evaluation.
Private Validation Set: A larger, privately held dataset, which we license for companies to do their own internal validation. We provide proof that this is correlated with our test suite.
Test Set: This dataset remains private at all times, and is the only dataset that is used for the benchmarks we publish. It is private to prevent leakage into the training dataset of foundation models.

Granular Task Design

Tasks are divided into specific categories, enabling detailed evaluation across:

Simple QA: Straightforward question-answering.
Multiple Choice: Evaluating choice selection under varying complexities.
Numerical Reasoning: Includes numerical reasoning ‘complex’ steps to get to the answer.
General Reasoning: Answering and reasoning on a general question.
Large-Context Tasks: Interpreting and reasoning over extensive legal or financial documents.
Multiple Modalities: Relying on both image and text inputs.

Benchmark Evaluation

While traditional evaluation metrics like accuracy work well for simpler tasks (e.g., Yes/No or multiple-choice questions), they become less relevant in evaluating the models’ ability to produce complex answers. Language models are intended to be used to produce paragraph or essay-length responses and should be evaluated in those settings. To address this, we employ the LLM-as-Judge Framework, which offers:

Generality: Capable of evaluating diverse and complex outputs.
Efficiency: Reduces reliance on costly and time-consuming human evaluation.
Minimized Bias: Well-structured evaluations reduce subjective interpretation, as evidenced by research (e.g., arXiv:2402.10669).

LLM-as-Judge Framework

The LLM-as-judge framework has become a standard in evaluating LLM outputs, offering efficiency and generality while reducing the cost and time associated with human reviews. Literature shows that GPT-4 models demonstrate high agreement with human-based reviews, validating their reliability for such evaluations.

To maintain robustness and fairness, we use the latest GPT-4o models as judges in our evaluations. However, the emergence of new models such as Claude Sonnet 3.5 and Claude Haiku 3.5 necessitates revisiting and comparing agreement levels between LLM evaluations and human reviews.

To further reduce bias and incorporate diverse perspectives, we are studying an LLM-as-Jury Approach. This involves leveraging multiple models for evaluation, enabling collective reasoning and decision-making across diverse systems.

Evaluation Framework Process

Our evaluation pipeline involves comparing the model output to the ground truth answer, evaluated by LLM-as-judge models.

Workflow:

Input Question → Query to be addressed.
Model Under Test → Generates an answer.
LLM-as-Judge → Evaluates the generated answer by:
- Providing reasoning for its decision.
- Assigning a label (Correct/Incorrect).

By refining this process and incorporating newer models, we aim to keep evaluations robust and unbiased.

Metrics

Our evaluation framework provides detailed insights into model performance through multiple metrics:

Accuracy: Evaluates the correctness of model outputs for each task and benchmark.
Latency: Measures the response time of models in returning a complete response.
Cost: Analyzes the operational cost of running each model from an API provider.

These metrics offer a holistic view of model performance, balancing accuracy, responsiveness, and efficiency.

Agents and the Future of Evaluation

As LLMs transition into autonomous agents (arXiv:2410.10934), capable of goal-setting, decision-making, and handling multi-step tasks, our benchmarks will evolve to test these capabilities. Agent benchmarks focus on three key areas:

Multi-Step Problem Solving: Evaluating how well agents handle intricate tasks requiring persistence, planning, and intermediate decisions.
Function Calling: Testing the ability of agents to interact programmatically, such as calling Python functions or APIs, to solve complex tasks.
Interactions with Complex Tools: Assessing agents’ capacity to use tools like Google Search, calculators, or external applications effectively, ensuring seamless integration with dynamic environments.

These benchmarks ensure comprehensive evaluation of agents, addressing their growing utility in real-world applications and their ability to function autonomously in complex systems.