The Cursor team has a tool named Bugbot. Its job is to automatically find and fix bugs in code. It sounds simple, but in practice, evaluating and improving such systems is tough. Standard metrics like «how many tests passed» don't always reflect the real quality of the work.
Recently, they shared how they solved this problem: they created their own AI-based metric and used it to systematically improve Bugbot.
Challenges in Evaluating AI Code Fixes
The Challenge of Evaluating Fixes
When a bot fixes a bug, you need to understand: is this actually a good fix? You can check if the tests pass afterward. However, tests aren't always available, and they don't always cover every important aspect.
You could bring in humans to evaluate it – but that's slow and expensive, especially if you're experimenting and want to quickly vet dozens of variations.
Cursor decided they needed an automated metric that would evaluate fixes almost exactly the way an experienced developer would.
Developing an AI-Based Evaluation Metric
The AI-Based Metric
They built a special model that looks at the fix and assigns a score: how well did the bot handle the task? Simply put, one AI checks the work of another.
This metric considers not just the fact of the fix, but also code quality, the completeness of the solution, and potential side effects. In other words, it tries to mimic how a human would grade the result.
Of course, such a metric isn't perfect. But if it correlates well enough with human evaluations, it can be used for rapid experimentation.
Impact of AI Metric on Bugbot Improvement
How This Helped Improve Bugbot
With the arrival of this metric, the improvement process became more manageable. Previously, it was hard to tell which changes to the system were actually helping and which weren't. Now, you can run a test, get a numerical score, and compare different approaches.
The team began systematically testing hypotheses: changing prompts, tuning model parameters, and experimenting with the context passed to the bot. After every change, the metric showed whether things got better or worse.
This approach allowed them to find several key improvements that might otherwise have gone unnoticed.
Implications for AI Tool Development
What This Means for AI Tool Development
The Bugbot story is a great example of how to accelerate the development of complex AI systems. When you have a reliable metric, you can experiment faster and with more confidence.
This is especially important for tools that work with code. There, the quality of the result is often non-obvious, and you can't just calculate accuracy or recall.
The approach with custom AI-based metrics can be useful not just for debugging, but also for other tasks: code generation, refactoring, and automatic reviews.
Questions on AI Metric Accuracy and Training
Open Questions
A few interesting points remain. First, how accurately does such a metric correspond to actual user preferences? An AI might learn to evaluate code based on specific criteria, but there is always a risk that it will miss something important or, conversely, overvalue formal aspects.
Second, how do you train and calibrate such a metric? Most likely, you need a set of benchmark examples labeled by humans. This takes time and effort, though still less than constant manual evaluation of every experiment.
But overall, the idea looks sound: using AI not just as a working tool, but also as a way to measure the quality of other AI systems.