When discussing the capabilities of language models, standard benchmarks are usually employed. These benchmarks show the percentage of correct answers on test sets, but it's not always clear how this correlates with real-world work. OpenHands decided to approach evaluation differently – they launched the OpenHands Index, a benchmark for AI agents that tests them on actual tasks from GitHub.
How OpenHands Index Tests AI Agents on Real GitHub Tasks
What Is This Benchmark?
The OpenHands Index is, essentially, a collection of real-world problems from open-source repositories. It includes bug fixes, adding new features, improving documentation, and other typical developer tasks. Agents receive a task description and must independently write code, modify the necessary files, and solve the problem in a way that passes verification.
Simply put, these are not abstract questions like «what will this code output»?, but comprehensive work: deciphering someone else's project, understanding the context, locating the right spot, and making the correct edit.
Why Real-World Code Benchmarks Matter for AI Agent Evaluation
Why Is This Important? 🔍
Most existing code benchmarks test models on synthetic tasks or isolated functions. These tests can verify logic, syntax knowledge, and algorithmic skills. But in real development, everything is more complex: you need to understand the project architecture, work with multiple files simultaneously, and account for dependencies and code style.
The OpenHands Index attempts to get closer to this reality. Here, the agent doesn't simply write a function – it works with an entire repository, just like a human would.
How Does Verification Work?
Each task in the index is linked to a specific issue or pull request from GitHub. The agent has access to the repository code, the problem description, and context. It must:
- analyze the task;
- find the necessary files;
- make changes;
- ensure the code works (if tests exist).
After this, the solution is automatically checked. The success criterion is that the solution must be functionally correct, meaning it meets the requirements outlined in the issue.
OpenHands Index Performance Results for Leading AI Models
First Results 📊
OpenHands has already tested several models on their index. The results show that even advanced models do not successfully handle every task. This is expected: working with real projects requires not only knowledge of a programming language but also the ability to navigate someone else's code, understand the developer's intent, and account for many nuances.
Interestingly, some models handle certain types of tasks better than others. For example, bug fixes may be easier than adding new functionality because there is already error context, and the location where something broke is often indicated.
Who Might Find This Useful?
First, AI agent developers. If you are creating a tool for programming automation, the OpenHands Index provides a clear way to check how well it performs in practice.
Second, those choosing a model for their work. Instead of relying only on abstract metrics, you can see how a specific model handles tasks similar to yours.
Third, this is a useful signal for the entire industry. The more realistic benchmarks there are, the clearer it becomes where models are truly strong, and where there is still work to be done.
Future Development Plans for OpenHands Index
What's Next?
OpenHands plans to expand the index by adding new tasks and repositories. This is important because task diversity helps avoid overfitting to specific patterns. The broader the set, the harder it is for models to «fit» the solution to known examples.
The team also promises to open-source the data and methodology so that others can reproduce the results or use the index for their own experiments.
Current Limitations of Real-World AI Agent Benchmarking
Limitations and Questions
Of course, this approach also has its difficulties. First, real GitHub tasks can be ambiguous. Sometimes even humans argue about how to properly solve a problem. Automatic verification cannot always account for all nuances.
Second, the set of tasks is still finite. There is a risk that models will eventually start optimizing indirectly for it, especially if the data gets into training sets.
Third, it is not yet entirely clear how the index accounts for code quality. It is one thing to solve a task; it's another to do it cleanly, readably, and in accordance with the project's style.
Nevertheless, this is a step in the right direction. Realistic benchmarks help us better understand where AI agents can be useful right now, and where they still need to develop.