SWE-bench

Takeaways

Claude Opus 4.6 (Thinking) leads with a performance of 79.20%, achieving the best accuracy on SWE-bench.
GPT 5.4 is second with 77.20%, followed by Gemini 3 Flash (12/25) with 76.20%.

Instance Resolution by Model

Background

SWE-bench, introduced by Jimenez et al. in their seminal paper “Can Language Models Resolve Real-World GitHub Issues?”, has emerged as a prominent benchmark for evaluating Large Language Models (LLMs) in software engineering contexts.

The benchmark comprises 500 tasks, each executed within an isolated Docker container. These tasks represent real-world GitHub issues from various repositories. Models are provided with a suite of agentic tools and must generate a “patch” to resolve each issue. The success of a model’s solution is determined by running unit tests against the generated patch.

A notable complexity of SWE-bench lies in its dual evaluation of both the agentic harness and the underlying foundation model. This leads to different methodologies adopted by foundation model labs when they report their results. Additionally, the benchmark’s computational requirements make it resource-intensive to reproduce results.

To enable fair and consistent comparisons across foundation models, we implemented a standardized evaluation framework (based on SWE Agent), which is used for all of our evaluations.

Results

SWE-bench

Tool Use Analysis

Within the SWE-agent harness, tools are organized into groups called bundles. The following table details which tools are contained within each bundle.

Bundle	Tools
Default	create, goto, open, scroll_down, scroll_up
Search	find_file, search_file, search_dir
Edit/Replace	edit, insert
Submit	submit
Bash	bash

The following visualization shows the distribution of tool usage across different models for each task, represented as percentages.

SWEBench Tool Usage

The tool usage patterns reveal distinct strategic approaches across models. o4 Mini tends to use default and search tools more. In contrast, Claude Sonnet 4 (Nonthinking) shows a more balanced approach with moderate usage across all tool categories, using approximately 9,000-10,000 default tools and fewer search operations, indicating a more targeted problem-solving methodology.

Most models show relatively low usage of Edit/Replace, Submit, and Bash tools compared to Default and Search categories. These usage patterns align with the performance results, where o3’s exhaustive search approach and Claude 4 Sonnet’s balanced strategy both contribute to their strong accuracy scores, though at different computational costs.

Methodology

For the standardized harness, we used the one created by the SWE-agent team. The agentic tools provided included the ability to edit and search files, use bash commands, diff files, and more. You can find more information here: SWE Agent.

We used the SWE-bench Verified subset of the dataset. SWE-bench Verified is a human validated section of the SWE-bench dataset released by Open AI in August 2024. Each task in the split has been carefully reviewed and validated by human experts, resulting in a curated set of 500 high-quality test cases from the original benchmark. You can find more information about the Verified split of the dataset here.

All models had access to the same tools to ensure a fair, apples-to-apples comparison. Models were run with the default configuration given by the provider, except for the max token limit, which was always set to the highest value possible. Due to the limiting amount of cache break points per Anthropic API key, we used the rotating key option provided by SWE-agent.

All experiments were ran on a 8 vCPU, 32GB RAM EC2 instance. Latency was calculated starting with the first step the model took within each task.

Evaluated models were constrained to a maximum of 150 steps per task. This limit was determined by analyzing the highest step count needed to resolve instances in the “1-4 hours” difficulty category, with a 150% buffer added to ensure fair comparison across all models. We selected the “1-4 hours” difficulty category (comprising 42 instances) as it provided the optimal balance of complexity and variance. This allowed us to capture a comprehensive range of steps between resolved instances, where a “step” is a single interaction turn within the SWE-agent framework. Both larger and smaller models were evaluated across this test set.

It may be possible to build a better harnesses than SWE-Agent for a given model - for example, Anthropic has claimed their custom harness leads to a ten percentage point improvement in accuracy. However, our aim was to adopt a fair framework with which to evaluate all models.

SWE-bench

Takeaways

Background

Results

Tool Use Analysis

Methodology

Join our mailing list to receive benchmark updates