Published on March 22, 2026

coSTAR How Databricks Launches AI Agents Quickly and Reliably

coSTAR: How Databricks Launches AI Agents Quickly and Reliably

Databricks has developed its own approach to creating AI agents – the coSTAR system, which allows the team to work quickly without losing control over quality.

Development 6 – 8 minutes min read

Event Source: Databricks 6 – 8 minutes min read

Imagine this: you give a coding assistant a task – to rewrite a large part of a project, but you have no tests, no verification, no way to know what exactly has changed or if something important has broken. Sounds risky, right? This is roughly the situation teams find themselves in when developing AI agents without a proper methodology: moving fast, but without a safety net.

This is the exact problem Databricks faced. The company develops data platforms and actively creates its own AI agents. At some point, it became clear that simply writing agents and deploying them to production wasn't enough. A systematic approach was needed, one that would allow them to move quickly while ensuring that nothing broke unnoticed.

And so, coSTAR was born.

What Is coSTAR and Why Is It Needed?

coSTAR is an internal methodology at Databricks for developing and deploying AI agents. It's not a library, nor a framework in the conventional sense, but rather a set of principles and practices that answers the question: how do you build agents correctly to avoid excruciating pain later?

Simply put, it's an attempt to bring order to something that is inherently difficult to control. AI agents are not static programs. They make decisions, call tools, and interact with external systems. The same agent can behave differently depending on the context, the phrasing of a query, or the state of the model. This makes their development fundamentally different from conventional programming.

In traditional software development, the logic is straightforward: you write code, run a test, and see the result. It doesn't work that way with agents. A test might pass, but the agent could still do something unexpected in a real-world scenario. That's why you need not just a set of tests, but an entire culture for working with such systems.

The Components of the Approach

The name coSTAR is an acronym, with each letter representing a specific principle.

Context – the agent must clearly understand the environment it's operating in, its tasks, and its constraints. This sounds obvious, but in practice, an ill-defined context is the cause of most bizarre errors.

Objective – the agent must have a clearly formulated goal. Not a vague “help the user,” but a specific result that can be measured and evaluated.

Steps – decomposing the task into understandable stages. The agent shouldn't try to solve everything in one big step; it's better to break the work down into manageable parts.

Tone – how the agent communicates with the user: its style, level of formality, and manner of explanation. This affects not only perception but also trust in the system.

Audience – the agent must understand who it's working for. An answer for an experienced developer is fundamentally different from an answer for a novice, even if the question is the same.

Response – the final block that describes what the result should look like: its format, structure, and level of detail.

Essentially, coSTAR is a way to think about an agent's prompts and behavior systematically, rather than intuitively. Instead of guessing anew each time how to correctly formulate a task for an agent, the team works with a unified structure.

Fast and Reliable – Can You Have Both?

One of the key questions in agent development is the age-old trade-off between speed and stability. If you move too fast, something is bound to break. If you do everything slowly and carefully, your competitors will pull ahead.

Databricks believes that the right methodology resolves this trade-off, or at least significantly mitigates it. When a team has a common language, a shared structure, and clear evaluation criteria, every change becomes more predictable. There's no need to renegotiate what constitutes a good result every single time.

This is especially important in the context of iterative development. Agents are constantly changing: models are updated, new tools are added, and user requirements evolve. Without a clear evaluation system, every such update is a lottery. With coSTAR, it's a managed process with clear control points.

Evaluation – The Hardest Part

The way Databricks approaches agent quality evaluation deserves special attention. This is perhaps the most non-trivial part of the entire methodology.

With regular code, things are relatively simple: a test either passes or it fails. It doesn't work that way with an agent. An agent might provide a technically correct answer that is completely useless in the given context. Or it might follow all the steps correctly but arrive at a bizarre conclusion. Or it could work perfectly on test examples and unpredictably on real queries.

Therefore, in coSTAR, evaluation is structured across several levels simultaneously. There are automated checks, which are fast and help catch obvious regressions. There's evaluation using another language model, which is slower but can detect semantic errors that automated checks miss. And there is human evaluation – the most expensive, but necessary for final quality control.

This multi-layered approach allows them not to choose between speed and accuracy, but to combine them: quick checks at every step, and deeper ones where they are truly needed.

Why This Isn't Just About Databricks

Databricks described its approach publicly not to show off, but rather because the problem is universal. Any team seriously involved in AI agent development sooner or later faces the same questions: how to evaluate quality, how not to break what's already working, and how to move fast while maintaining control.

There are still few ready-made answers to these questions in the industry. Most teams either invent their own approaches from scratch or work intuitively – and periodically make the same mistakes over and over. The emergence of public methodologies like coSTAR is an attempt to raise the overall level of maturity in the field of agent development.

This doesn't mean coSTAR is a one-size-fits-all solution. It reflects the specific experience of a specific team under specific conditions. But the logic itself – first agree on a common structure, then move fast – is applicable far more broadly than just one product or company.

What Remains an Open Question

Despite the elegance of the approach, frankly, questions remain. How well does coSTAR scale to truly complex multi-agent systems? How does it work when there are many agents interacting with each other? How applicable is this approach for teams that don't have the same deep expertise in data and AI as Databricks?

There are no answers to these questions in the public description – and that's okay. The methodology is a living thing, evolving along with practice. Agent systems are still a young field, and even for those deeply involved, much remains in the experimental stage.

But the very fact that companies are starting to systematize and publicly describe their experiences is a good sign. It means the field is maturing. And the next agent you deploy to production will, with a slightly higher probability, behave as you expect it to. 🙂

#applied analysis #methodology #ai development #ai training #engineering #scaling #generative agents #agent benchmarking

Link to Original: https://www.databricks.com/blog/costar-how-we-ship-ai-agents-databricks-fast-without-breaking-things

Original Title: coSTAR: How We Ship AI Agents at Databricks Fast, Without Breaking Things

Publication Date: Mar 21, 2026

Databricks www.databricks.com A U.S.-based platform for data analytics and machine learning built on a Lakehouse architecture.

Previous Article PyTorch 2.10 and TorchAO: How AI on Your Laptop Became More Feasible Next Article Training Top AI Models: Cheaper Than You Think

coSTAR How Databricks Launches AI Agents Quickly and Reliably

What Is coSTAR and Why Is It Needed?

The Components of the Approach

Fast and Reliable – Can You Have Both?

Evaluation – The Hardest Part

Why This Isn't Just About Databricks

What Remains an Open Question

Related Publications

How to Tell if Your AI Agent is Actually Working or Just Looking Convincing

Test-Driving AI Agents: Real-World Trials, Not Toy Problems

How Cursor Evaluates the Quality of AI Models in Its Editor

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration