Published on April 2, 2026

How to handle single GPU failure in MoE systems

One GPU Failure Shouldn't Bring Down the Entire System

The Mooncake and Volcano Engine teams have integrated an elastic expert parallelism mechanism into the SGLang framework, allowing it to withstand partial failures without requiring a restart.

Infrastructure / Technical context 4 – 6 minutes min read
Event Source: LMSYS ORG 4 – 6 minutes min read

Large language models based on the MoE – Mixture of Experts – architecture have a complex structure: instead of a single large neural network, they use multiple specialized subnetworks, only a fraction of which are activated for each request. This saves computational resources but requires special hardware organization.

To serve such models at an industrial scale, the standard approach is to use wide expert parallelism, where a single copy of the model is distributed across 32 or more GPUs. This allows for processing large streams of requests faster and more cheaply. The problem is that the more GPUs are involved, the higher the probability that at least one of them will fail. And in a classic deployment scheme, a single failed process brings down the entire inference instance.

The impact of single GPU failure on MoE systems

Why This Is a Serious Problem

Imagine you have a service running on 32 GPUs, and one of them fails. In a traditional setup, this means a full restart, with all the ensuing consequences: several minutes of downtime, a lost request queue, and a strain on the infrastructure. With high traffic volumes, even a couple of minutes of downtime translates to significant losses.

This is precisely the vulnerability that the Mooncake team, in collaboration with Volcano Engine, set out to address by integrating a mechanism called Elastic EP – elastic expert parallelism – into the SGLang framework.

Elastic EP concept: Redundant experts for fault tolerance

The Idea: Breaking the Rigid Link

In a standard setup, each “expert” (subnetwork) is rigidly tied to a specific GPU. If that GPU fails, the expert becomes unavailable, and the system cannot continue to operate.

Elastic EP changes this logic: experts are stored with redundancy, meaning some of them are replicated across multiple GPUs. If one of the devices fails, the system detects it, redistributes the load to the remaining GPUs, and continues processing requests – without a complete shutdown.

Simply put: the model loses a bit of “power,” but it doesn't stop.

Elastic EP performance in DeepSeek V3.2 tests

What the Tests Showed

To test the solution under near-production conditions, the team ran the DeepSeek V3.2 model on four nodes – 32 GPUs in total – with 256 backup experts. This configuration allowed the system to survive the simultaneous failure of up to 16 processes.

During the experiment, some processes were forcibly terminated, and the recovery time was measured. The result: the service interruption was less than 10 seconds, compared to the 2–3 minutes required for a full restart. This is about 90% faster.

Moreover, in normal operation – when there are no failures – the performance of the system with Elastic EP matches that of the standard approach. In other words, reliability is added without any performance penalty during normal operation.

Dual-layer protection for MoE system resilience

Two Layers of Protection

Under the hood, the solution operates on two levels simultaneously.

The first is the scheduler level. This is the system's “gatekeeper”: it constantly monitors the status of all GPUs and, if one stops responding, immediately removes it from the task distribution queue. New requests are sent only to healthy resources – without any interruptions.

The second is the expert parallelism level itself. Here, a more nuanced process takes place: the system dynamically reassesses experts from the failed GPUs to the surviving ones to ensure computations continue correctly from a mathematical standpoint. This helps avoid severe interruptions at the execution level.

Together, these two mechanisms transform a fragile MoE system into a much more resilient structure.

Mooncake library: Fault-tolerant communication for MoE models

Mooncake as the Communication Backbone

The Mooncake EP library plays a key role in this implementation, acting as a fault-tolerant communication layer between GPUs. It is responsible for fast data transfer between nodes, tracking failures, and rebuilding communication routes when hardware fails partially.

An important detail: the library is designed to be integrated into the existing SGLang infrastructure without extensive refactoring. This lowers the barrier to entry for those looking to add fault tolerance to their existing systems.

Additionally, within the same Elastic EP framework, the NVIDIA Dynamo team proposed an implementation based on their own communication backend, NIXL EP. This shows that the architecture is designed to be extensible, allowing different teams to plug in their own implementations on top of the general framework.

The importance of fault tolerance in MoE models

Why This Matters Beyond This Specific Project

MoE models are not an exotic concept. DeepSeek and a number of other large models use this exact architecture. As these models are increasingly deployed in production systems, the issue of infrastructure reliability becomes just as important as the quality of the model itself.

Until now, wide expert parallelism has been somewhat like walking a tightrope over a chasm without a safety net: it works well as long as everything is fine, but one slip means a total fall. Elastic EP provides that very safety net.

The question of full dynamic process recovery remains open – that is, the ability to automatically “return” a failed GPU to service without restarting the entire instance. According to the team, this functionality is under active development.

Nevertheless, the solution already implemented – reducing downtime from several minutes to mere seconds – fundamentally changes the reliability equation for systems where continuous operation is critical.

Original Title: Elastic EP in SGLang: Achieving Partial Failure Tolerance for DeepSeek MoE Deployments
Publication Date: Mar 25, 2026
LMSYS ORG lmsys.org A U.S.-based non-profit research organization studying scalable language models and distributed training systems.
Previous Article Trinity-Large-Thinking: An Open Model for Serious Tasks Next Article Holo3: A New Record in AI-Powered Computer Control

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to dive deeper into the world
of neuro-creativity?

Be the first to learn about new books, articles, and AI experiments
on our Telegram channel!

Subscribe