Published February 17, 2026

SWE-fficiency: Evaluating AI Agent Efficiency in Code Fixing

SWE-fficiency: Evaluating Not Just an AI's Bug-Finding Ability, But the Efficiency of Its Fixes

A new benchmark assesses how quickly and accurately AI agents fix code, not just identify problems – taking into account time, attempts, and real-world working conditions.

Development
Event Source: OpenHands Reading Time: 4 – 6 minutes

When we evaluate an AI's ability to write or fix code, we usually look at one thing: did it complete the task? Found a bug – good. Fixed it – excellent. Yet, in real-world work, it's not just about what was accomplished, but how: how much time the agent spent, how many times it queried the model, what tools it used, and how economically it utilized resources.

These very questions – focusing on the process, not just the result – formed the basis for a new benchmark called SWE-fficiency. Its creators believe it's time to evaluate AI agents not just on whether they solve a task, but on how reasonably they act during the process.

Why Efficiency Matters: AI Solution Process vs. Accuracy

Why the Path to a Solution Matters, Not Just Accuracy

The classic approach to evaluating AI programming agents is straightforward: assign a task, then check the result. If the code works, the task is considered solved. However, under real-world conditions, this represents only part of the picture.

Imagine one agent finds and fixes a bug in three minutes, querying the language model twice. Another agent tackles the same task but takes half an hour and makes twenty requests. Formally, both completed the job, but it's clear the first approach is far more efficient – and cheaper, considering the cost of model calls.

SWE-fficiency aims to account for precisely this. The benchmark evaluates not only an agent's ability to find a solution but also how rationally it does so: how many steps it takes, how long the execution lasts, how often it queries the model, and what tools it uses.

SWE-fficiency Benchmark Structure and Evaluation

How the Benchmark Is Structured

At the core of SWE-fficiency is a set of real-world code-fixing tasks. These aren't synthetic examples but situations developers face in their daily work: they need to find a bug, understand its cause, and make changes to get the code working again.

But unlike traditional benchmarks, this one records not just the final result, but the entire process:

  • How many times the agent queried the language model;
  • How long the entire fixing process took;
  • Which tools were used – editing files, running tests, searching through code;
  • How many attempts were needed to reach a working solution.

This allows for comparing agents not just by the percentage of tasks solved, but by how effective they are under real-world conditions. One agent might solve 70% of tasks, but do so quickly and economically. Another might handle 80% but consume significantly more resources. Which one is better depends on the context of its use.

Initial Results: AI Accuracy vs. Resource Efficiency

What the Initial Results Show

The benchmark's authors tested several popular AI agents, and the results were surprising. It turned out that a high task-solving accuracy doesn't always guarantee high efficiency.

Some agents performed well on classic benchmarks but took many unnecessary actions: running tests multiple times, editing the same files repeatedly, and querying the model even when it wasn't necessary. Others worked faster and more precisely, even though their overall accuracy was slightly lower.

This is an important signal: if we want AI agents to become genuinely useful tools in development, it's not enough to simply teach them to find the right answers. We need to teach them to act reasonably, without wasting time and resources.

Why SWE-fficiency Is Important for Developers

Why This Matters to Developers

For those creating AI agents, SWE-fficiency offers a new perspective on quality assessment. Now, it's possible to see not just the final accuracy score but also understand how an agent arrives at a solution. This helps identify weak spots: for instance, if an agent queries the model too often, perhaps its ability to analyze context needs improvement. If it spends a lot of time editing code, the problem might lie in how it plans its actions.

For those who use agents in their work, this is also useful. When choosing a tool, you can focus not only on whether it can handle the task but also on how quickly and economically it will do so. Ultimately, this affects both the cost of use and the overall user experience.

What's Next for AI Agent Evaluation

What's Next

SWE-fficiency is an attempt to shift the focus from the result to the process. For now, the benchmark is new, and it's unclear how widely it will be adopted. But the idea itself seems logical: if we want AI agents to become a part of everyday development, it's crucial to teach them to work not just correctly, but efficiently.

Perhaps, over time, other metrics will emerge that account for not only accuracy but also speed, cost, and user experience. For now, however, SWE-fficiency is one of the first steps in this direction.

Original Title: SWE-fficiency: Evaluating How to Fix Code, Not Just What to Fix
Publication Date: Feb 16, 2026
OpenHands openhands.dev An open-source project developing AI agents for software engineering and automation tasks.
Previous Article How SGLang-Diffusion Speeds Up Video Generation by 8x Next Article Qwen3.5: The First Natively Multimodal Model

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.5 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.5 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

Don’t miss a single experiment!

Subscribe to our Telegram channel —
we regularly post announcements of new books, articles, and interviews.

Subscribe