When we evaluate an AI's ability to write or fix code, we usually look at one thing: did it complete the task? Found a bug – good. Fixed it – excellent. Yet, in real-world work, it's not just about what was accomplished, but how: how much time the agent spent, how many times it queried the model, what tools it used, and how economically it utilized resources.
These very questions – focusing on the process, not just the result – formed the basis for a new benchmark called SWE-fficiency. Its creators believe it's time to evaluate AI agents not just on whether they solve a task, but on how reasonably they act during the process.
Why Efficiency Matters: AI Solution Process vs. Accuracy
Why the Path to a Solution Matters, Not Just Accuracy
The classic approach to evaluating AI programming agents is straightforward: assign a task, then check the result. If the code works, the task is considered solved. However, under real-world conditions, this represents only part of the picture.
Imagine one agent finds and fixes a bug in three minutes, querying the language model twice. Another agent tackles the same task but takes half an hour and makes twenty requests. Formally, both completed the job, but it's clear the first approach is far more efficient – and cheaper, considering the cost of model calls.
SWE-fficiency aims to account for precisely this. The benchmark evaluates not only an agent's ability to find a solution but also how rationally it does so: how many steps it takes, how long the execution lasts, how often it queries the model, and what tools it uses.
SWE-fficiency Benchmark Structure and Evaluation
How the Benchmark Is Structured
At the core of SWE-fficiency is a set of real-world code-fixing tasks. These aren't synthetic examples but situations developers face in their daily work: they need to find a bug, understand its cause, and make changes to get the code working again.
But unlike traditional benchmarks, this one records not just the final result, but the entire process:
- How many times the agent queried the language model;
- How long the entire fixing process took;
- Which tools were used – editing files, running tests, searching through code;
- How many attempts were needed to reach a working solution.
This allows for comparing agents not just by the percentage of tasks solved, but by how effective they are under real-world conditions. One agent might solve 70% of tasks, but do so quickly and economically. Another might handle 80% but consume significantly more resources. Which one is better depends on the context of its use.
Initial Results: AI Accuracy vs. Resource Efficiency
What the Initial Results Show
The benchmark's authors tested several popular AI agents, and the results were surprising. It turned out that a high task-solving accuracy doesn't always guarantee high efficiency.
Some agents performed well on classic benchmarks but took many unnecessary actions: running tests multiple times, editing the same files repeatedly, and querying the model even when it wasn't necessary. Others worked faster and more precisely, even though their overall accuracy was slightly lower.
This is an important signal: if we want AI agents to become genuinely useful tools in development, it's not enough to simply teach them to find the right answers. We need to teach them to act reasonably, without wasting time and resources.
Why SWE-fficiency Is Important for Developers
Why This Matters to Developers
For those creating AI agents, SWE-fficiency offers a new perspective on quality assessment. Now, it's possible to see not just the final accuracy score but also understand how an agent arrives at a solution. This helps identify weak spots: for instance, if an agent queries the model too often, perhaps its ability to analyze context needs improvement. If it spends a lot of time editing code, the problem might lie in how it plans its actions.
For those who use agents in their work, this is also useful. When choosing a tool, you can focus not only on whether it can handle the task but also on how quickly and economically it will do so. Ultimately, this affects both the cost of use and the overall user experience.
What's Next for AI Agent Evaluation
What's Next
SWE-fficiency is an attempt to shift the focus from the result to the process. For now, the benchmark is new, and it's unclear how widely it will be adopted. But the idea itself seems logical: if we want AI agents to become a part of everyday development, it's crucial to teach them to work not just correctly, but efficiently.
Perhaps, over time, other metrics will emerge that account for not only accuracy but also speed, cost, and user experience. For now, however, SWE-fficiency is one of the first steps in this direction.