New Finance Agent Benchmark Released

Task Type:

SWE-bench Benchmark

Last updated

Task type :

Best Performing
$Best Budget
⚡︎Best Speed
Reasoning ModelReasoning Model

Takeaways

  • Claude Sonnet 4.5 (Thinking) leads with a performance of 69.8%, scoring the highest across all models, though it comes with significantly higher computational costs compared to other top performers.

  • GPT 5 Codex is second with 69.4%, followed closely by GPT 5 at 68.8%. Both models closely follow Claude Sonnet 4 (Nonthinking) .

  • Grok Code Fast delivers impressive results at 57.6% while having much lower latency (264.68s) than other top-performing models. Optimized for coding tasks, it offers competitive performance with significantly better speed relative to other top models.


Instance Resolution by Model

Background

SWE-bench, introduced by Jimenez et al. in their seminal paper “Can Language Models Resolve Real-World GitHub Issues?”, has emerged as a prominent benchmark for evaluating Large Language Models (LLMs) in software engineering contexts.

The benchmark comprises 500 tasks, each executed within an isolated Docker container. These tasks represent real-world GitHub issues from various repositories. Models are provided with a suite of agentic tools and must generate a “patch” to resolve each issue. The success of a model’s solution is determined by running unit tests against the generated patch.

A notable complexity of SWE-bench lies in its dual evaluation of both the agentic harness and the underlying foundation model. This leads to different methodologies adopted by foundation model labs when they report their results. Additionally, the benchmark’s computational requirements make it resource-intensive to reproduce results.

To enable fair and consistent comparisons across foundation models, we implemented a standardized evaluation framework (based on SWE Agent), which is used for all of our evaluations.

Results


SWE-bench

/


Models that perform well on SWE-bench tend to be proficient with Bash and search tools. Increasing tool usage beyond a certain threshold does not significantly impact performance.

There is also a clear trend that closed-source models perform better on SWE-bench than open-source models, with the exception of Qwen Max Instruct, which performed well above average.

The three hardest tasks are estimated to take more than 4 hours to complete. SOTA models have been unable to solve more than one task in this category. The clearest performance differences appear among tasks that take between 15 minutes and 1 hour to complete. Models that perform well on these tasks score higher overall on the benchmark.


SWE-bench

/


Tool Use Analysis

Within the SWE-agent harness, tools are organized into groups called bundles. The following table details which tools are contained within each bundle.

BundleTools
Defaultcreate, goto, open, scroll_down, scroll_up
Searchfind_file, search_file, search_dir
Edit/Replaceedit, insert
Submitsubmit
Bashbash

The following visualization shows the distribution of tool usage across different models for each task, represented as percentages.


SWEBench Tool Usage

The tool usage patterns reveal distinct strategic approaches across models. o4 Mini tends to use default and search tools more. In contrast, Claude Sonnet 4 (Nonthinking) shows a more balanced approach with moderate usage across all tool categories, using approximately 9,000-10,000 default tools and fewer search operations, indicating a more targeted problem-solving methodology.

Most models show relatively low usage of Edit/Replace, Submit, and Bash tools compared to Default and Search categories. These usage patterns align with the performance results, where o3’s exhaustive search approach and Claude 4 Sonnet’s balanced strategy both contribute to their strong accuracy scores, though at different computational costs.

Methodology

For the standardized harness, we used the one created by the SWE-agent team. The agentic tools provided included the ability to edit and search files, use bash commands, diff files, and more. You can find more information here: SWE Agent.

We used the SWE-bench Verified subset of the dataset. SWE-bench Verified is a human validated section of the SWE-bench dataset released by Open AI in August 2024. Each task in the split has been carefully reviewed and validated by human experts, resulting in a curated set of 500 high-quality test cases from the original benchmark. You can find more information about the Verified split of the dataset here.

All models had access to the same tools to ensure a fair, apples-to-apples comparison. Models were run with the default configuration given by the provider, except for the max token limit, which was always set to the highest value possible. Due to the limiting amount of cache break points per Anthropic API key, we used the rotating key option provided by SWE-agent.

All experiments were ran on a 8 vCPU, 32GB RAM EC2 instance. Latency was calculated starting with the first step the model took within each task.

Evaluated models were constrained to a maximum of 150 steps per task. This limit was determined by analyzing the highest step count needed to resolve instances in the “1-4 hours” difficulty category, with a 150% buffer added to ensure fair comparison across all models. We selected the “1-4 hours” difficulty category (comprising 42 instances) as it provided the optimal balance of complexity and variance. This allowed us to capture a comprehensive range of steps between resolved instances, where a “step” is a single interaction turn within the SWE-agent framework. Both larger and smaller models were evaluated across this test set.

It may be possible to build a better harnesses than SWE-Agent for a given model - for example, Anthropic has claimed their custom harness leads to a ten percentage point improvement in accuracy. However, our aim was to adopt a fair framework with which to evaluate all models.

Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.