Published on March 13, 2026

Kimi AI Agents: Scaling Infrastructure for Hundreds of Thousands of Users

How Kimi Runs Hundreds of Thousands of AI Agents Simultaneously: A Look at the Infrastructure

Exploring how Kimi created a scalable and secure environment for AI agents on the Alibaba Cloud platform.

Infrastructure 6 – 8 minutes min read
Event Source: Alibaba Cloud 6 – 8 minutes min read

AI agents are no longer just chatbots that answer questions. They are programs that act: they run code, open browsers, work with files, and perform multi-step tasks. All of this requires not only powerful AI but also a well-designed infrastructure – an environment where each agent can operate in an isolated, fast, and secure manner.

Kimi is a popular AI assistant developed by the Chinese company Moonshot AI. Its key features are a long context window and the ability to function as a full-fledged agent: searching for information, analyzing documents, and writing and executing code. As the number of users grows and tasks become more complex, a natural question arises: how does it all hold together? What does it run on, how does it scale, and what prevents one user's agents from «interfering» with another's?

AI Agent vs. Language Model: Actions and Infrastructure Challenges

An Agent Isn't Just an Answer; It's an Action

When a standard language model answers a question, it simply generates text. An agent does more: it can decide that to provide an answer, it first needs to do something – run a script, visit a website, or open a file. Simply put, an agent is a model with «hands.»

But this is where the infrastructure headache begins. If an agent runs code, that code needs to be executed somewhere. And it must be done in a way that:

  • one user's code cannot affect another's data;
  • the environment deploys quickly – users don't want to wait;
  • resources are not wasted when the agent is idle;
  • the system can withstand sudden spikes in load – for instance, if hundreds of thousands of people launch agents simultaneously.

For Kimi, this isn't an abstract problem – it's a daily reality.

Isolated Sandboxes for AI Agents: Security and Predictability

A Separate Environment for Each – Isolation as the Foundation

The key architectural decision made by the Kimi team was to use isolated sandboxes. Each agent operates in its own separate environment, as if each user had their own small virtual computer.

This is important for two reasons. The first is security: whatever an agent does inside its sandbox will not affect others. The second is predictability: the environment is the same for everyone, and its behavior can be controlled.

To implement this, Kimi uses the Alibaba Cloud infrastructure. Specifically, it relies on two services: ACK (Alibaba Cloud Container Service for Kubernetes) and ACS (Alibaba Cloud Serverless Containers). To put it without abbreviations: the first is a platform for managing containers (small, isolated software environments), and the second allows these containers to be launched «on demand» without keeping servers constantly running.

Instant Start for AI Agents: Mechanisms for Fast Environment Deployment

Instant Start: Why It's Harder Than It Looks

Imagine a user clicks a button, asks an agent to do something – and waits. If the environment takes 30-60 seconds to deploy, it's annoying. If it takes 2-3 seconds, it's tolerable. If it's less than a second, it's practically unnoticeable.

Traditional cloud approaches handle this poorly: launching a full-fledged virtual machine takes time. Containers are faster, but they also have a threshold. That's why Kimi uses a mechanism of pre-warmed pools – pre-prepared environments that are ready to go and waiting to be assigned. When a request comes in, the agent doesn't have to wait for a «boot-up»; it immediately gets a ready-made environment.

Simply put, it's like keeping several clean workstations ready so a new employee can sit down and start working immediately, instead of waiting for a desk to be assembled and a computer to be set up.

Scaling AI Agents: Managing Hundreds of Thousands of Concurrent Sessions

Hundreds of Thousands Simultaneously: How Is That Even Possible?

One of the key claims in the description of Kimi's infrastructure is its support for hundreds of thousands of concurrent agent sessions. That's a serious number.

This is where elasticity comes to the forefront – the ability of the infrastructure to quickly scale up and down depending on the load. In the morning, when everyone is asleep, there are few agents, and fewer resources are needed. During the day, at peak activity, the system rapidly deploys additional capacity. In the evening, as the load decreases, excess resources are released.

The serverless approach (in this case, via ACS) makes this possible: instead of keeping thousands of servers running constantly, it allocates computing resources only when they are actually needed. This is both cheaper and more efficient.

However, it's not just the speed of scaling that matters, but also its precision: the system must predict how many resources will be needed in the near future to avoid creating queues without wasting resources. This is achieved using load forecasting mechanisms – the system analyzes request dynamics and prepares the required number of environments in advance.

Data Management for AI Agents: Storage in Isolated Environments

The Agent Is Working – But What About the Data?

Another practical question: as an agent works, it creates things – files, intermediate results, cache. All of this needs to be stored somewhere and accessed quickly.

Since each agent session lives in its own isolated environment, it's crucial that storage is organized accordingly: a separate space for each session, fast access, and no overlap between users. In Kimi's infrastructure, this is solved through integration with Alibaba Cloud's storage services, which are mounted directly into each sandbox.

When a session ends, the environment is cleared. This is important not only for security but also for cost-saving: there's no reason to store what is no longer needed.

Network Isolation for AI Agents: Controlled Internet Access

Network Isolation: Agents on the Internet, but Under Control

Many agent tasks involve internet access – searching for information, downloading data, interacting with external services. This creates a potential risk: what if an agent does something undesirable or accesses something it shouldn't?

To manage this, the network traffic of each sandbox is controlled separately. Roughly speaking, an agent can access the internet, but only through «managed gateways» where rules can be set: what is allowed, what is forbidden, and what to block. Moreover, the traffic of different users is not mixed – each network session is isolated, just like the computing environment.

Implications for the AI Agent Industry: Infrastructure Trends and Challenges

What This Means for the Industry

The story of Kimi's infrastructure is interesting not just in its own right. It clearly reflects a broader trend: as AI agents become part of real-world products, the focus shifts not only to the quality of the model itself but also to how that model is integrated into a functioning system.

Creating a good language model is difficult. But creating a system where that model operates reliably, quickly, and securely for hundreds of thousands of people simultaneously is a separate and equally serious engineering challenge.

For developers building their own agent-based applications, Kimi's experience offers a valuable example of how to solve the problem of scaling: it's not just about «getting more servers», but about building an architecture where elasticity, isolation, and fast start-up times are designed in from the very beginning.

For now, agent systems of this scale are rare. But the direction is clear: AI is moving from being a «smart text assistant» to an «autonomous task executor», and the infrastructure must keep pace with this transition.

Original Title: Deep Dive: How Kimi's AI Agent Runs on Alibaba Cloud
Publication Date: Mar 12, 2026
Alibaba Cloud www.alibabacloud.com A Chinese cloud and AI division of Alibaba, providing infrastructure and AI services for businesses.
Previous Article From Applications to Agents: How Business Is Adapting to Intent Next Article How Cursor Evaluates the Quality of AI Models in Its Editor

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

AI: Events

How to Scale vLLM and Avoid Out-of-Memory Errors

Technical context Infrastructure

The AI21 Labs team shared their experience optimizing vLLM – a popular tool for deploying language models that often faces critical errors due to RAM shortages when scaling.

AI21 Labswww.ai21.com Feb 6, 2026

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to know about new
experiments first?

Subscribe to our Telegram channel — we share all the latest
and exciting updates from NeuraBooks.

Subscribe