When developers use an AI code editor, they quickly start to feel the difference between models. One responds more accurately, another makes frequent mistakes, and a third is good at simple tasks but struggles with complex ones. But how do you measure this feeling? How do you turn «this model seems better» into concrete data you can rely on for decision-making?
This is the very question the team at Cursor – the company developing the eponymous code editor with deep AI integration – is tackling. They recently published a detailed post about how their internal model evaluation system is structured. And behind this description lies a rather interesting story about why standard approaches don't work.
Why Public AI Benchmarks Are Inadequate for Code Editors
Why «Just Run a Test» Doesn't Work
In the world of AI, there are public benchmarks – sets of tasks used to compare models. It's convenient: you take two models, run them through the same tasks, and look at the numbers. The problem is that such tests often don't reflect real-world work.
Cursor is a tool for programmers. People use it to write code, edit it, understand others' projects, and find bugs. These are very specific scenarios, and they differ significantly from the abstract tasks in public benchmarks. A model might excel at algorithmic puzzles but fail to understand what a developer actually wants when they ask to «fix this piece of code.»
Moreover, there's another subtlety: models change. Providers update them regularly – sometimes with an announcement, sometimes silently. What worked well last week might behave differently today. Therefore, a static test, run once, quickly becomes obsolete.
Hybrid Model Evaluation Using Online and Offline Metrics
Online and Offline: Two Perspectives on a Single Task
Cursor decided not to choose between «live data» and «lab tests» but to use both approaches together. This is what they call a hybrid evaluation system.
Offline evaluation involves controlled experiments. The team takes real tasks that users solved with Cursor and turns them into reproducible tests. It's crucial that these tasks are taken from real-world practice, not artificially created. This is already a step forward compared to abstract benchmarks.
Such tests can be run repeatedly, comparing models under identical conditions to get stable results. This is convenient for quick checks: before rolling out a new model to users, they can run it through a set of tasks and see how it performs.
Online evaluation is about observing what happens in actual use. When real users work with the editor, the system captures signals: whether the user accepted the model's suggestion, edited it, rejected it immediately, or came back to it later. These indirect signals speak more to the model's quality than any synthetic test.
Simply put: offline evaluation answers the question «how correctly does the model solve the task?» while online evaluation answers «how useful is this to a real person in their actual work?»
Sources of Real World Test Tasks for AI Evaluation
Where the Test Tasks Come From
One of the key principles of Cursor's approach is that tests must reflect what users are actually doing. It sounds obvious, but in practice, this requires dedicated effort.
The team analyzes which scenarios are most common: editing code, generating new snippets, working with large files, and answering questions about unfamiliar code. Then, from real interactions, they select examples that represent these scenarios well and turn them into reproducible tests.
At the same time, it's important to ensure the tests don't «get stale.» If developers today are actively using a new technology, but the tests were created a year ago for different tasks, the evaluation results will be inaccurate. Therefore, the system requires regular updates.
Criteria for Measuring Quality of AI Code Suggestions
What Exactly Counts as a «Good Response»
This is perhaps the most difficult part. Evaluating code quality is not the same as evaluating text quality. Code either works or it doesn't. But that's just one of the criteria.
Cursor looks at several things simultaneously. First, correctness: does the code do what was expected of it? Second, contextual relevance: did the model consider the project's specifics, coding style, and existing conventions? Third, user behavior: how did the person react to the suggestion – did they accept it as is, modify it, or ignore it?
The last point is particularly interesting. If a model suggests technically correct code, but the developer doesn't use it, that's a signal that something went wrong. Maybe the model misunderstood the task. Maybe the answer was too generic. Maybe the style didn't match. Online signals help reveal this.
Implications of Custom AI Evaluation for the Software Industry
Why This Matters Beyond Just Cursor
This story about a model evaluation system might seem like a behind-the-scenes look at one company's internal workings. But in reality, it raises a question that is important for the entire industry.
Currently, most teams integrating AI into their products face the same problem: how to know if a model works well specifically for their task. Public rankings and benchmarks provide a general idea, but they don't answer the question, «but how will it work for us?»
Cursor's approach is an example of how to build a custom evaluation system based on real-world usage. They aren't trying to create a universal benchmark. They are trying to understand what works well for their users – and to regularly check if that has changed.
This approach requires resources: you need to collect data, build infrastructure, and maintain the relevance of the tests. But without it, choosing between models becomes a guessing game.
Challenges and Limitations in Evaluating AI Code Models
Open Questions
Cursor frankly admits that the system isn't perfect. Several questions remain open.
The first is the problem of test «contamination.» The longer a set of tasks exists, the higher the risk that models have been trained on similar examples and simply «know the right answer.» This makes the results less indicative. Therefore, the tests must be updated regularly, which requires constant effort.
The second is the interpretation of behavioral signals. If a user doesn't accept a model's suggestion, it doesn't always mean the model was wrong. Perhaps the person changed their mind, got distracted, or simply preferred to write the code themselves. Separating «the model was wrong» from «it was just a matter of circumstance» is a non-trivial task.
The third is user diversity. Different developers work in different ways. Some code in Python, others in Go. Some work on large legacy projects, while others start from scratch. Averaged metrics can hide important differences: a model that is good «on average» might perform poorly in a specific scenario.
These limitations don't make the system useless – they simply show that evaluating the quality of AI models in real products is not a one-time task, but an ongoing process. And Cursor, by all appearances, treats it exactly that way.