Most tests for AI agents are designed in a similar way: you take a task, provide a fixed codebase, and the model attempts to fix or write something. If it succeeds, it passes. But there's a problem: real code doesn't stand still. Repositories get updated, dependencies change, and tasks are created and closed – and what worked yesterday might be obsolete today.
The creators of the new benchmark, EvoClaw, have attempted to address this very contradiction.
What's Wrong with the Old Tests?
One of the most well-known benchmarks in automated code generation is SWE-Bench. It's built on real-world tasks from GitHub: a problem is taken, a snapshot of the repository at the time the problem appeared is provided, and an AI agent is tasked with solving it. Sounds reasonable.
However, over time, a few inconvenient truths have emerged. First, many of these tasks are already known – they've made their way into the models' training data. The agent isn't «solving» the task; it's essentially «recalling» it. Second, the benchmark is static: same tasks, same repositories. Models begin to be «overfitted» to it, and the results no longer reflect their true capabilities.
Simply put: a high score on SWE-Bench no longer guarantees that an agent can handle a live project.
EvoClaw: The Benchmark That Doesn't Stand Still
EvoClaw is an attempt to create a benchmark that can't be memorized. The main idea is that testing tasks are sourced from repositories in real-time. As soon as a new bug or task appears in a project, it can become part of the test. The repository is used in its current state, not as a snapshot from the past.
This means the agent encounters code it has definitely not seen during its training. No «recalling» – just genuine problem-solving in conditions that are as close as possible to the work of a real developer.
Another important point: EvoClaw tracks not only whether the agent solves the task right now, but also whether its solution remains functional as the project continues to evolve. This is a completely different level of requirement – the agent must write code that «lives» within the project, not just passes tests at the moment of evaluation.
Why Is This Important Right Now?
In recent months, AI agents for code generation have taken a prominent place in the industry. Tools like those built on the latest models from OpenAI or Anthropic are already being actively used by developers – and companies are racing to report impressive benchmark results.
But this is precisely where the question arises: what's really behind these numbers? If a model gets a high score on a test it has effectively «passed» during its training, that's not an indicator of its real capabilities. It's an indicator of a good memory.
EvoClaw offers a different criterion: not «how well the agent knows old tasks», but «how well it handles new ones.» And this distinction is fundamental – especially for those who are seriously considering AI agents as assistants in real-world development.
What Did the Initial Results Show?
The creators of EvoClaw tested several modern agents on it, including OpenHands – an open-source platform for AI development agents. The results were noticeably more modest than on the usual static tests.
This in itself is revealing. Not because the agents are «bad» – but because the gap between performance on outdated benchmarks and real-world conditions proved to be significant. This is exactly the kind of data the industry needs to move in the right direction.
A Live Benchmark as Infrastructure
An interesting detail: EvoClaw is designed not as a one-off study, but as a continuously updated system. The authors plan to regularly add new tasks from current repositories – so it will be fundamentally impossible to «cram» for it.
This changes the very logic of evaluation. Instead of aiming for a high score on a fixed test, agent developers will be forced to create systems that can genuinely solve unfamiliar tasks in unfamiliar code. And that is much closer to what everyone actually wants from an AI development assistant.
In short: EvoClaw is an attempt to shift the bar for evaluating AI agents closer to reality. It's not about «how many points the model scored», but «can it handle something it has never seen before?» For now, this might sound like a nuance – but for an industry that is increasingly integrating such agents into live projects, this is not an academic question, but a purely practical one.