Published on April 2, 2026

How to handle single GPU failure in MoE systems

One GPU Failure Shouldn't Bring Down the Entire System

The Mooncake and Volcano Engine teams have integrated an elastic expert parallelism mechanism into the SGLang framework, allowing it to withstand partial failures without requiring a restart.

Infrastructure / Technical context 4 – 6 minutes min read

Event Source: LMSYS ORG 4 – 6 minutes min read

Large language models based on the MoE – Mixture of Experts – architecture have a complex structure: instead of a single large neural network, they use multiple specialized subnetworks, only a fraction of which are activated for each request. This saves computational resources but requires special hardware organization.

To serve such models at an industrial scale, the standard approach is to use wide expert parallelism, where a single copy of the model is distributed across 32 or more GPUs. This allows for processing large streams of requests faster and more cheaply. The problem is that the more GPUs are involved, the higher the probability that at least one of them will fail. And in a classic deployment scheme, a single failed process brings down the entire inference instance.

The impact of single GPU failure on MoE systems

Why This Is a Serious Problem

Imagine you have a service running on 32 GPUs, and one of them fails. In a traditional setup, this means a full restart, with all the ensuing consequences: several minutes of downtime, a lost request queue, and a strain on the infrastructure. With high traffic volumes, even a couple of minutes of downtime translates to significant losses.

This is precisely the vulnerability that the Mooncake team, in collaboration with Volcano Engine, set out to address by integrating a mechanism called Elastic EP – elastic expert parallelism – into the SGLang framework.

Elastic EP concept: Redundant experts for fault tolerance

The Idea: Breaking the Rigid Link

In a standard setup, each “expert” (subnetwork) is rigidly tied to a specific GPU. If that GPU fails, the expert becomes unavailable, and the system cannot continue to operate.

Elastic EP changes this logic: experts are stored with redundancy, meaning some of them are replicated across multiple GPUs. If one of the devices fails, the system detects it, redistributes the load to the remaining GPUs, and continues processing requests – without a complete shutdown.

Simply put: the model loses a bit of “power,” but it doesn't stop.

Elastic EP performance in DeepSeek V3.2 tests

What the Tests Showed

To test the solution under near-production conditions, the team ran the DeepSeek V3.2 model on four nodes – 32 GPUs in total – with 256 backup experts. This configuration allowed the system to survive the simultaneous failure of up to 16 processes.

During the experiment, some processes were forcibly terminated, and the recovery time was measured. The result: the service interruption was less than 10 seconds, compared to the 2–3 minutes required for a full restart. This is about 90% faster.

Moreover, in normal operation – when there are no failures – the performance of the system with Elastic EP matches that of the standard approach. In other words, reliability is added without any performance penalty during normal operation.

Dual-layer protection for MoE system resilience

Two Layers of Protection

Under the hood, the solution operates on two levels simultaneously.

The first is the scheduler level. This is the system's “gatekeeper”: it constantly monitors the status of all GPUs and, if one stops responding, immediately removes it from the task distribution queue. New requests are sent only to healthy resources – without any interruptions.

The second is the expert parallelism level itself. Here, a more nuanced process takes place: the system dynamically reassesses experts from the failed GPUs to the surviving ones to ensure computations continue correctly from a mathematical standpoint. This helps avoid severe interruptions at the execution level.

Together, these two mechanisms transform a fragile MoE system into a much more resilient structure.

Mooncake library: Fault-tolerant communication for MoE models

Mooncake as the Communication Backbone

The Mooncake EP library plays a key role in this implementation, acting as a fault-tolerant communication layer between GPUs. It is responsible for fast data transfer between nodes, tracking failures, and rebuilding communication routes when hardware fails partially.

An important detail: the library is designed to be integrated into the existing SGLang infrastructure without extensive refactoring. This lowers the barrier to entry for those looking to add fault tolerance to their existing systems.

Additionally, within the same Elastic EP framework, the NVIDIA Dynamo team proposed an implementation based on their own communication backend, NIXL EP. This shows that the architecture is designed to be extensible, allowing different teams to plug in their own implementations on top of the general framework.

The importance of fault tolerance in MoE models

Why This Matters Beyond This Specific Project

MoE models are not an exotic concept. DeepSeek and a number of other large models use this exact architecture. As these models are increasingly deployed in production systems, the issue of infrastructure reliability becomes just as important as the quality of the model itself.

Until now, wide expert parallelism has been somewhat like walking a tightrope over a chasm without a safety net: it works well as long as everything is fine, but one slip means a total fall. Elastic EP provides that very safety net.

The question of full dynamic process recovery remains open – that is, the ability to automatically “return” a failed GPU to service without restarting the entire instance. According to the team, this functionality is under active development.

Nevertheless, the solution already implemented – reducing downtime from several minutes to mere seconds – fundamentally changes the reliability equation for systems where continuous operation is critical.

#applied analysis #technical context #neural networks #engineering #computer systems #infrastructure #ai reliability

Link to Original: https://www.lmsys.org/blog/2026-03-25-eep-partial-failure-tolerance/

Original Title: Elastic EP in SGLang: Achieving Partial Failure Tolerance for DeepSeek MoE Deployments

Publication Date: Mar 25, 2026

LMSYS ORG lmsys.org A U.S.-based non-profit research organization studying scalable language models and distributed training systems.

Previous Article Trinity-Large-Thinking: An Open Model for Serious Tasks Next Article Holo3: A New Record in AI-Powered Computer Control

How to handle single GPU failure in MoE systems

The impact of single GPU failure on MoE systems

Elastic EP concept: Redundant experts for fault tolerance

Elastic EP performance in DeepSeek V3.2 tests

Dual-layer protection for MoE system resilience

Mooncake library: Fault-tolerant communication for MoE models

The importance of fault tolerance in MoE models

Related Publications

When a GPU Fails and the System Keeps Running: How SGLang Learned to Survive Partial Outages

Fault Tolerance in Large Language Models: How DeepSeek Learns to Handle Failures

When 31% of the Cache Just Vanishes: The Story of a Silent Bug Deep Within GPU Code

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration