Published on March 26, 2026

SGLang Elastic EP: Surviving Partial GPU Outages in Large Language Models

When a GPU Fails and the System Keeps Running: How SGLang Learned to Survive Partial Outages

SGLang developers have introduced a mechanism that allows the system to remain operational during partial failures in large GPU clusters.

Infrastructure / Technical context 4 – 5 minutes min read

Event Source: LMSYS ORG 4 – 5 minutes min read

Imagine you have a large cluster of dozens of GPUs collectively serving a powerful language model. Suddenly, one of the cards fails. What happens next? In most cases, the whole thing goes down. The system either stops completely or requires a restart and load redistribution. For a production environment where continuity is critical, this is a serious problem.

This is the exact problem faced by developers deploying large models like DeepSeek on a large number of accelerators. And it's precisely to solve this that SGLang introduced a new mechanism – Elastic EP, or Elastic Expert Parallelism.

Expert Parallelism Explained and Why It Fails

What Is Expert Parallelism and Why It Fails

To understand the core of the problem, we need a brief look at how modern large models like MoE (Mixture of Experts) are structured. To put it simply, such a model doesn't process each request entirely on a single device. Instead, it's divided into numerous «experts» – individual blocks, each specializing in specific tasks. Different requests are routed to different experts, and the experts themselves are distributed across various GPUs.

This allows models with hundreds of billions of parameters to run on real hardware. But this setup has a vulnerability: if one GPU fails, the experts on it become unavailable. The system doesn't know how to continue operating without them and either freezes or crashes.

Previously, the only solution was a full cluster restart with load redistribution – a process that is costly in terms of time and resources.

Elastic EP Solution for LLM Fragility

Elasticity as the Answer to Fragility

Elastic EP changes the system's logic for handling failures. Simply put: if a cluster node stops responding, the system doesn't go down with it but rather reconfigures itself on the fly.

The mechanism works as follows. All GPUs in the cluster are aware of their neighbors' configurations beforehand. When one node fails, the remaining nodes automatically redistribute the load of the experts that were on the faulty node among themselves. Requests continue to be processed – albeit slower or with slightly lower throughput, but without a complete halt.

It's similar to how a good delivery service works: if one courier gets sick, the orders don't get stuck indefinitely – they are handed over to colleagues, even if it means a slight delay.

Practical Implications of Elastic EP for LLMs

What This Means in Practice

For teams deploying large models in a production environment, this is a fundamental shift. Before Elastic EP, even a single GPU failure could take down the entire inference cluster for long enough to violate SLAs and create incidents for users.

Now, a partial failure is no longer a catastrophe. The system continues to serve requests. The faulty node can be replaced or restarted in the background without halting operations.

This is especially important for MoE models like DeepSeek, which require dozens of GPUs even for a basic deployment. The larger the cluster, the higher the probability that a node will fail at some point. It's simple statistics.

Elasticity in LLMs: A New Implementation Challenge

A Small Detail with Big Consequences

Interestingly, the idea of elasticity in distributed systems is not new in itself. Similar mechanisms have long been used in databases, network services, and cloud platforms. But in the context of large language model inference – especially with MoE architecture and expert parallelism – this is relatively new territory.

The complexity here lies not in the idea itself, but in its implementation: ensuring the correct redistribution of experts without losing request context, without desynchronization between nodes, and without a significant performance drop during the reconfiguration.

The SGLang developers solved this problem within an open-source inference framework – which automatically means the solution is available to the community, not just to large corporations with their own engineering teams.

Future Questions and Unanswered Aspects of Elastic EP

What Remains Unanswered

Several questions remain open for now. How large is the overhead during load redistribution at the moment of failure? How does the system behave when multiple nodes fail simultaneously? What is the scalability limit of the mechanism?

These are normal questions for any new technology. Elastic EP is not a magic bullet for all problems, but a specific tool for a specific scenario: a partial failure in a cluster during MoE model inference. And in this scenario, it solves a real and painful problem.

For an industry that is actively moving toward agentic systems and continuous inference – where downtime is especially costly – this is a step in the right direction.

#applied analysis #technical context #neural networks #ai safety #computer systems #infrastructure #model scaling #inference optimization

Link to Original: https://lmsys.org/blog/2026-03-25-eep-partial-failure-tolerance

Original Title: Elastic EP in SGLang: Achieving Partial Failure Tolerance for DeepSeek MoE Deployments

Publication Date: Mar 25, 2026

LMSYS ORG lmsys.org A U.S.-based non-profit research organization studying scalable language models and distributed training systems.

Previous Article How AI Agents Help the Largest US Healthcare System Free Up Thousands of Work Hours Next Article How AI Can Manipulate People and What Google DeepMind Is Doing About It

SGLang Elastic EP: Surviving Partial GPU Outages in Large Language Models

Expert Parallelism Explained and Why It Fails

Elastic EP Solution for LLM Fragility

Practical Implications of Elastic EP for LLMs

Elasticity in LLMs: A New Implementation Challenge

Future Questions and Unanswered Aspects of Elastic EP

Related Publications

Fault Tolerance in Large Language Models: How DeepSeek Learns to Handle Failures

Mixture of Experts: How Large Language Models Learn to Avoid Waste

How to Safely Update AI Services: Canary Releases Across Multiple Clusters

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration