Imagine this: you give a coding assistant a task – to rewrite a large part of a project, but you have no tests, no verification, no way to know what exactly has changed or if something important has broken. Sounds risky, right? This is roughly the situation teams find themselves in when developing AI agents without a proper methodology: moving fast, but without a safety net.
This is the exact problem Databricks faced. The company develops data platforms and actively creates its own AI agents. At some point, it became clear that simply writing agents and deploying them to production wasn't enough. A systematic approach was needed, one that would allow them to move quickly while ensuring that nothing broke unnoticed.
And so, coSTAR was born.
coSTAR is an internal methodology at Databricks for developing and deploying AI agents. It's not a library, nor a framework in the conventional sense, but rather a set of principles and practices that answers the question: how do you build agents correctly to avoid excruciating pain later?
Simply put, it's an attempt to bring order to something that is inherently difficult to control. AI agents are not static programs. They make decisions, call tools, and interact with external systems. The same agent can behave differently depending on the context, the phrasing of a query, or the state of the model. This makes their development fundamentally different from conventional programming.
In traditional software development, the logic is straightforward: you write code, run a test, and see the result. It doesn't work that way with agents. A test might pass, but the agent could still do something unexpected in a real-world scenario. That's why you need not just a set of tests, but an entire culture for working with such systems.
The name coSTAR is an acronym, with each letter representing a specific principle.
Context – the agent must clearly understand the environment it's operating in, its tasks, and its constraints. This sounds obvious, but in practice, an ill-defined context is the cause of most bizarre errors.
Objective – the agent must have a clearly formulated goal. Not a vague “help the user,” but a specific result that can be measured and evaluated.
Steps – decomposing the task into understandable stages. The agent shouldn't try to solve everything in one big step; it's better to break the work down into manageable parts.
Tone – how the agent communicates with the user: its style, level of formality, and manner of explanation. This affects not only perception but also trust in the system.
Audience – the agent must understand who it's working for. An answer for an experienced developer is fundamentally different from an answer for a novice, even if the question is the same.
Response – the final block that describes what the result should look like: its format, structure, and level of detail.
Essentially, coSTAR is a way to think about an agent's prompts and behavior systematically, rather than intuitively. Instead of guessing anew each time how to correctly formulate a task for an agent, the team works with a unified structure.
One of the key questions in agent development is the age-old trade-off between speed and stability. If you move too fast, something is bound to break. If you do everything slowly and carefully, your competitors will pull ahead.
Databricks believes that the right methodology resolves this trade-off, or at least significantly mitigates it. When a team has a common language, a shared structure, and clear evaluation criteria, every change becomes more predictable. There's no need to renegotiate what constitutes a good result every single time.
This is especially important in the context of iterative development. Agents are constantly changing: models are updated, new tools are added, and user requirements evolve. Without a clear evaluation system, every such update is a lottery. With coSTAR, it's a managed process with clear control points.
The way Databricks approaches agent quality evaluation deserves special attention. This is perhaps the most non-trivial part of the entire methodology.
With regular code, things are relatively simple: a test either passes or it fails. It doesn't work that way with an agent. An agent might provide a technically correct answer that is completely useless in the given context. Or it might follow all the steps correctly but arrive at a bizarre conclusion. Or it could work perfectly on test examples and unpredictably on real queries.
Therefore, in coSTAR, evaluation is structured across several levels simultaneously. There are automated checks, which are fast and help catch obvious regressions. There's evaluation using another language model, which is slower but can detect semantic errors that automated checks miss. And there is human evaluation – the most expensive, but necessary for final quality control.
This multi-layered approach allows them not to choose between speed and accuracy, but to combine them: quick checks at every step, and deeper ones where they are truly needed.
Databricks described its approach publicly not to show off, but rather because the problem is universal. Any team seriously involved in AI agent development sooner or later faces the same questions: how to evaluate quality, how not to break what's already working, and how to move fast while maintaining control.
There are still few ready-made answers to these questions in the industry. Most teams either invent their own approaches from scratch or work intuitively – and periodically make the same mistakes over and over. The emergence of public methodologies like coSTAR is an attempt to raise the overall level of maturity in the field of agent development.
This doesn't mean coSTAR is a one-size-fits-all solution. It reflects the specific experience of a specific team under specific conditions. But the logic itself – first agree on a common structure, then move fast – is applicable far more broadly than just one product or company.
Despite the elegance of the approach, frankly, questions remain. How well does coSTAR scale to truly complex multi-agent systems? How does it work when there are many agents interacting with each other? How applicable is this approach for teams that don't have the same deep expertise in data and AI as Databricks?
There are no answers to these questions in the public description – and that's okay. The methodology is a living thing, evolving along with practice. Agent systems are still a young field, and even for those deeply involved, much remains in the experimental stage.
But the very fact that companies are starting to systematize and publicly describe their experiences is a good sign. It means the field is maturing. And the next agent you deploy to production will, with a slightly higher probability, behave as you expect it to. 🙂