Published on March 26, 2026

SGLang Elastic EP: Surviving Partial GPU Outages in Large Language Models

When a GPU Fails and the System Keeps Running: How SGLang Learned to Survive Partial Outages

SGLang developers have introduced a mechanism that allows the system to remain operational during partial failures in large GPU clusters.

Infrastructure / Technical context 4 – 5 minutes min read
Event Source: LMSYS ORG 4 – 5 minutes min read

Imagine you have a large cluster of dozens of GPUs collectively serving a powerful language model. Suddenly, one of the cards fails. What happens next? In most cases, the whole thing goes down. The system either stops completely or requires a restart and load redistribution. For a production environment where continuity is critical, this is a serious problem.

This is the exact problem faced by developers deploying large models like DeepSeek on a large number of accelerators. And it's precisely to solve this that SGLang introduced a new mechanism – Elastic EP, or Elastic Expert Parallelism.

Expert Parallelism Explained and Why It Fails

What Is Expert Parallelism and Why It Fails

To understand the core of the problem, we need a brief look at how modern large models like MoE (Mixture of Experts) are structured. To put it simply, such a model doesn't process each request entirely on a single device. Instead, it's divided into numerous «experts» – individual blocks, each specializing in specific tasks. Different requests are routed to different experts, and the experts themselves are distributed across various GPUs.

This allows models with hundreds of billions of parameters to run on real hardware. But this setup has a vulnerability: if one GPU fails, the experts on it become unavailable. The system doesn't know how to continue operating without them and either freezes or crashes.

Previously, the only solution was a full cluster restart with load redistribution – a process that is costly in terms of time and resources.

Elastic EP Solution for LLM Fragility

Elasticity as the Answer to Fragility

Elastic EP changes the system's logic for handling failures. Simply put: if a cluster node stops responding, the system doesn't go down with it but rather reconfigures itself on the fly.

The mechanism works as follows. All GPUs in the cluster are aware of their neighbors' configurations beforehand. When one node fails, the remaining nodes automatically redistribute the load of the experts that were on the faulty node among themselves. Requests continue to be processed – albeit slower or with slightly lower throughput, but without a complete halt.

It's similar to how a good delivery service works: if one courier gets sick, the orders don't get stuck indefinitely – they are handed over to colleagues, even if it means a slight delay.

Practical Implications of Elastic EP for LLMs

What This Means in Practice

For teams deploying large models in a production environment, this is a fundamental shift. Before Elastic EP, even a single GPU failure could take down the entire inference cluster for long enough to violate SLAs and create incidents for users.

Now, a partial failure is no longer a catastrophe. The system continues to serve requests. The faulty node can be replaced or restarted in the background without halting operations.

This is especially important for MoE models like DeepSeek, which require dozens of GPUs even for a basic deployment. The larger the cluster, the higher the probability that a node will fail at some point. It's simple statistics.

Elasticity in LLMs: A New Implementation Challenge

A Small Detail with Big Consequences

Interestingly, the idea of elasticity in distributed systems is not new in itself. Similar mechanisms have long been used in databases, network services, and cloud platforms. But in the context of large language model inference – especially with MoE architecture and expert parallelism – this is relatively new territory.

The complexity here lies not in the idea itself, but in its implementation: ensuring the correct redistribution of experts without losing request context, without desynchronization between nodes, and without a significant performance drop during the reconfiguration.

The SGLang developers solved this problem within an open-source inference framework – which automatically means the solution is available to the community, not just to large corporations with their own engineering teams.

Future Questions and Unanswered Aspects of Elastic EP

What Remains Unanswered

Several questions remain open for now. How large is the overhead during load redistribution at the moment of failure? How does the system behave when multiple nodes fail simultaneously? What is the scalability limit of the mechanism?

These are normal questions for any new technology. Elastic EP is not a magic bullet for all problems, but a specific tool for a specific scenario: a partial failure in a cluster during MoE model inference. And in this scenario, it solves a real and painful problem.

For an industry that is actively moving toward agentic systems and continuous inference – where downtime is especially costly – this is a step in the right direction.

Original Title: Elastic EP in SGLang: Achieving Partial Failure Tolerance for DeepSeek MoE Deployments
Publication Date: Mar 25, 2026
LMSYS ORG lmsys.org A U.S.-based non-profit research organization studying scalable language models and distributed training systems.
Previous Article How AI Agents Help the Largest US Healthcare System Free Up Thousands of Work Hours Next Article How AI Can Manipulate People and What Google DeepMind Is Doing About It

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Don’t miss a single experiment!

Subscribe to our Telegram channel —
we regularly post announcements of new books, articles, and interviews.

Subscribe