Vals Index

Motivation

As AI capabilities rapidly advance, understanding their potential to transform economic sectors has become increasingly critical for organizations making deployment decisions. Unlike existing aggregated metrics that treat all capabilities equally, the Vals Index is designed to reflect the potential economic impact of AI models on the U.S. economy. We accomplish this by computing a weighted average of model performance across key sectors, where the weights correspond to each sector’s contribution to the U.S. economy in trillions of dollars.

Vals AI has developed a comprehensive suite of benchmarks measuring AI models’ ability to perform real-world tasks across finance and software engineering. These benchmarks were designed to evaluate practical performance on actual professional workflows, making them well-suited for assessing economic impact. The Vals Index leverages this existing work to provide a high-signal measure that accounts for the real-world tradeoffs between capability, latency, and cost that practitioners face when deploying AI systems.

Results

Industry Average Accuracy Comparison

Key Takeaways

AI models are advancing rapidly in their ability to handle complex, real-world tasks across critical economic sectors. The results demonstrate that frontier models are becoming increasingly capable at automating work in finance and software engineering—domains that collectively represent a substantial portion of economic activity. Claude Fable 5 leads the Vals Index at 75.15%, ahead of Claude Opus 4.8, GPT 5.5, and Claude Opus 4.7.

Methodology

Benchmark Selection, Economic Weighting, and Formula

The Vals Index aggregates performance across two major economic sectors, weighted by their approximate contribution to U.S. GDP. Market size estimations were computed based on data from the Federal Reserve Economic Data (FRED) and the Bureau of Labor Statistics. While this represents a vast oversimplification of how AI might impact the economy, it provides a useful proxy for measuring the potential economic significance of model capabilities:

Finance (weight: 2.0): ~$2T contribution to U.S. GDP

CorpFin: Corporate finance document analysis
Finance Agent v2: Multi-step financial reasoning tasks

Coding (weight: 1.4): ~$1.4T contribution to U.S. GDP

SWE-bench Verified: Real-world software engineering tasks
Terminal-Bench 2.1: Command-line interface problem solving
Vibe Code Bench: End-to-end app-building tasks

These weights combine in the following formula:

Coding = 0.25 * SWE_Bench + 0.25 * TBench + 0.5 * VibeCodeBench
Vals_Index = (2.0 * AVG(CorpFin, FinanceAgent) + 1.4 * Coding) / 3.4

The denominator (3.4) normalizes the index to a 0-100 scale, where the score represents the weighted average performance across sectors proportional to their economic contribution.

Subset Selection Process

To enable efficient and cost-effective evaluation while maintaining strong correlation with full benchmark performance, we developed representative subsets for several benchmarks:

Selection Methodology: To balance evaluation efficiency with accuracy, we created representative subsets for select benchmarks using a sampling process that maximizes correlation with full benchmark scores. We validated this approach using holdout models to ensure that subset performance reliably predicts full benchmark results.

Benchmark-Specific Subsets:

SWE-bench Verified: 33 randomly sampled instances from each difficulty level (categorized by solution time: <15min, 15min-1hr, 1-4hr, >4hr), plus all 3 instances from the hardest category
CorpFin: 3 randomly selected questions per unique document from the original test set
Finance Agent v2: 16-model index subset evaluated with three runs per model
Vibe Code Bench: 22 representative app-building tasks selected to cover a range of UI, data, and workflow patterns

Full Benchmarks:

Terminal-Bench 2.1: All questions evaluated (no sampling)

This methodology ensures the Vals Index provides a rapid, cost-effective evaluation framework while maintaining the predictive validity needed for reliable model comparison.

Updates

5/27/2026

Updated the coding bucket to Terminal-Bench 2.1. The Vals Index now evaluates coding performance with Terminal-Bench 2.1 while keeping the same 0.25 coding weight.

5/13/2026

Swapped Finance Agent to Finance Agent v2. Finance now uses the Finance Agent v2 index subset, averaging three runs per model.

5/4/2026

Added Vibe Code Bench to the coding bucket. Coding is now a weighted average of three benchmarks (SWE-bench Verified 0.25, Terminal-Bench 2.0 0.25, Vibe Code Bench 0.5), giving end-to-end app-building tasks half of the coding signal. VCB is evaluated on a 22-task subset selected for coverage across UI, data, and workflow patterns.
Removed the Law sector (CaseLaw). The CaseLaw benchmark had become saturated, and was no longer providing useful differentiation between models. Consequently, it was removed from the index. The denominator was rebalanced from 3.7 → 3.4 to reflect the dropped 0.3 law weight, and the Industry Average chart no longer displays a Law column.