Published on February 17, 2026

SWE-fficiency: Evaluating AI Agent Efficiency in Code Fixing

SWE-fficiency: Evaluating Not Just an AI's Bug-Finding Ability, But the Efficiency of Its Fixes

A new benchmark assesses how quickly and accurately AI agents fix code, not just identify problems – taking into account time, attempts, and real-world working conditions.

Development 4 – 6 minutes min read

Event Source: OpenHands 4 – 6 minutes min read

When we evaluate an AI's ability to write or fix code, we usually look at one thing: did it complete the task? Found a bug – good. Fixed it – excellent. Yet, in real-world work, it's not just about what was accomplished, but how: how much time the agent spent, how many times it queried the model, what tools it used, and how economically it utilized resources.

These very questions – focusing on the process, not just the result – formed the basis for a new benchmark called SWE-fficiency. Its creators believe it's time to evaluate AI agents not just on whether they solve a task, but on how reasonably they act during the process.

Why Efficiency Matters: AI Solution Process vs. Accuracy

Why the Path to a Solution Matters, Not Just Accuracy

The classic approach to evaluating AI programming agents is straightforward: assign a task, then check the result. If the code works, the task is considered solved. However, under real-world conditions, this represents only part of the picture.

Imagine one agent finds and fixes a bug in three minutes, querying the language model twice. Another agent tackles the same task but takes half an hour and makes twenty requests. Formally, both completed the job, but it's clear the first approach is far more efficient – and cheaper, considering the cost of model calls.

SWE-fficiency aims to account for precisely this. The benchmark evaluates not only an agent's ability to find a solution but also how rationally it does so: how many steps it takes, how long the execution lasts, how often it queries the model, and what tools it uses.

SWE-fficiency Benchmark Structure and Evaluation

How the Benchmark Is Structured

At the core of SWE-fficiency is a set of real-world code-fixing tasks. These aren't synthetic examples but situations developers face in their daily work: they need to find a bug, understand its cause, and make changes to get the code working again.

But unlike traditional benchmarks, this one records not just the final result, but the entire process:

How many times the agent queried the language model;
How long the entire fixing process took;
Which tools were used – editing files, running tests, searching through code;
How many attempts were needed to reach a working solution.

This allows for comparing agents not just by the percentage of tasks solved, but by how effective they are under real-world conditions. One agent might solve 70% of tasks, but do so quickly and economically. Another might handle 80% but consume significantly more resources. Which one is better depends on the context of its use.

Initial Results: AI Accuracy vs. Resource Efficiency

What the Initial Results Show

The benchmark's authors tested several popular AI agents, and the results were surprising. It turned out that a high task-solving accuracy doesn't always guarantee high efficiency.

Some agents performed well on classic benchmarks but took many unnecessary actions: running tests multiple times, editing the same files repeatedly, and querying the model even when it wasn't necessary. Others worked faster and more precisely, even though their overall accuracy was slightly lower.

This is an important signal: if we want AI agents to become genuinely useful tools in development, it's not enough to simply teach them to find the right answers. We need to teach them to act reasonably, without wasting time and resources.

Why SWE-fficiency Is Important for Developers

Why This Matters to Developers

For those creating AI agents, SWE-fficiency offers a new perspective on quality assessment. Now, it's possible to see not just the final accuracy score but also understand how an agent arrives at a solution. This helps identify weak spots: for instance, if an agent queries the model too often, perhaps its ability to analyze context needs improvement. If it spends a lot of time editing code, the problem might lie in how it plans its actions.

For those who use agents in their work, this is also useful. When choosing a tool, you can focus not only on whether it can handle the task but also on how quickly and economically it will do so. Ultimately, this affects both the cost of use and the overall user experience.

What's Next for AI Agent Evaluation

What's Next

SWE-fficiency is an attempt to shift the focus from the result to the process. For now, the benchmark is new, and it's unclear how widely it will be adopted. But the idea itself seems logical: if we want AI agents to become a part of everyday development, it's crucial to teach them to work not just correctly, but efficiently.

Perhaps, over time, other metrics will emerge that account for not only accuracy but also speed, cost, and user experience. For now, however, SWE-fficiency is one of the first steps in this direction.

#analysis #methodology #neural networks #ai development #engineering #human–machine interaction #ai benchmarks #agent benchmarking

Link to Original: https://openhands.dev/blog/20260216-swefficiency-benchmark

Original Title: SWE-fficiency: Evaluating How to Fix Code, Not Just What to Fix

Publication Date: Feb 16, 2026

OpenHands openhands.dev An open-source project developing AI agents for software engineering and automation tasks.

Previous Article How SGLang-Diffusion Speeds Up Video Generation by 8x Next Article Qwen3.5: The First Natively Multimodal Model

SWE-fficiency: Evaluating AI Agent Efficiency in Code Fixing

Why Efficiency Matters: AI Solution Process vs. Accuracy

SWE-fficiency Benchmark Structure and Evaluation

Initial Results: AI Accuracy vs. Resource Efficiency

Why SWE-fficiency Is Important for Developers

What's Next for AI Agent Evaluation

Related Publications

Test-Driving AI Agents: Real-World Trials, Not Toy Problems

Hugging Face Community Evals: When the Community Decides to Test Models Itself

Open Coding Agents: AI Code Assistants That Work With Any Repository

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration