When it comes to AI agents that can write code, fix bugs, and solve technical problems, a question eventually arises: how do we know which one performs best? The answer seems simple – you need a test. But in reality, creating a fair and indicative test for such systems turns out to be surprisingly difficult.
The team behind OpenHands – an open-source platform for running AI agents to solve development tasks – decided to tackle this very issue. They didn't just create another ranking; they tried to understand what was wrong with existing ones and make them fairer and more informative.
OpenHands Index for AI Coding Agents
What Is the OpenHands Index and Why Is It Needed?
The OpenHands Index is a set of tasks used to evaluate AI agents that can work with code. Its foundation is SWE-bench Verified, one of the most well-known sets of real-world tasks from open repositories on GitHub. The concept is simple: an agent is given a real bug or task from a real project, and its ability to solve it is observed.
The problem is that such benchmarks start to «leak» over time. Models are trained on data from the internet, which means some of the test tasks may have ended up in the training data – even if implicitly. This is called data contamination, and it's a serious issue: the agent appears to solve tasks it has already «seen», and its results no longer reflect its true abilities.
This is exactly where the authors started: they analyzed to what extent existing benchmark tasks were «exposed» in public sources.
Problems with Conventional AI Evaluation Metrics
What's Wrong with Conventional Metrics?
The standard metric in these tests is the percentage of solved tasks. It seems logical: the more tasks an agent solves, the better it is. But the authors discovered that this metric hides important nuances.
First, the tasks in the test are not uniform in difficulty. Some are solved by almost all agents – they are too simple to reveal anything about the system's real capabilities. Others are solved by no one – and they are equally uninformative. The most informative tasks turn out to be the «mid-level» ones, where some agents succeed and others fail.
Second, when several agents achieve a similar final percentage, it doesn't mean they are solving the same tasks. Two agents with the same score might be «closing» completely different sets of tasks – each has its own strengths. If you only look at the final score, this information is lost.
Third, different runs of the same agent produce different results – AI systems are not always reproducible. This means that small differences in final scores might just be statistical noise, not a real advantage of one system over another.
OpenHands Solutions for AI Agent Evaluation
How They Tried to Fix This
The OpenHands team proposed several changes to how agents are evaluated – and to how test tasks are formulated.
The first is updating the task set. The authors added new tasks that are less likely to have already appeared in models' training data. This helps make the test less «predictable» for agents that might have accidentally memorized something similar.
The second is a shift toward tasks that actually differentiate between agents. A task solved by everyone or no one doesn't help determine which is better. Therefore, the focus is placed on tasks where there is a real spread in the results.
The third is a more careful interpretation of the results. The authors urge against drawing bold conclusions from small differences in scores, especially if they fall within the margin of statistical error. Two agents with scores of 43% and 45% are likely a «tie», not a victory for one over the other.
Importance of Quality AI Benchmarks
Why Bother with Benchmark Quality Anyway?
This might seem like a technical issue, far removed from practice. But in reality, a lot depends on the quality of these tests.
If a benchmark is flawed, companies and researchers start optimizing their agents for it – not for real-world tasks. The agent will look good in the rankings but perform poorly in practice. This is a well-known trap: when a measure becomes a target, it ceases to be a good measure.
Moreover, users and developers who choose AI tools for their projects rely on public rankings. If these rankings are misleading, decisions are made based on false data.
The authors of OpenHands are honest about this: they admit their own index is imperfect and describe concrete steps to improve it – not because it's beneficial from a marketing standpoint, but because it's hard to move forward without proper measurement.
Future Challenges in AI Agent Evaluation
What Remains an Open Question
The problem of data contamination won't just disappear on its own. The longer a benchmark exists, the higher the probability that its tasks will end up in the training data of new models. It's a kind of race: test creators must constantly update tasks to maintain their relevance.
The question of how to compare agents that approach tasks in fundamentally different ways also remains open. One might be cautious and precise, while another is fast but prone to errors. The single metric of «percentage of solved tasks» doesn't capture this behavioral difference.
Finally, there's always the question: how similar are the benchmark tasks to what real developers actually face? SWE-bench Verified uses tasks from open repositories, but real-world development isn't just about bugs in public projects. It involves internal systems, non-standard architectures, and non-obvious contexts.
The OpenHands team doesn't claim to have solved all these issues. But the very fact that they've raised them and attempted to unpack them is a step toward a more honest conversation about what AI agents can actually do, versus what the pretty numbers in the tables might suggest.