Large language models based on the MoE – Mixture of Experts – architecture have a complex structure: instead of a single large neural network, they use multiple specialized subnetworks, only a fraction of which are activated for each request. This saves computational resources but requires special hardware organization.
To serve such models at an industrial scale, the standard approach is to use wide expert parallelism, where a single copy of the model is distributed across 32 or more GPUs. This allows for processing large streams of requests faster and more cheaply. The problem is that the more GPUs are involved, the higher the probability that at least one of them will fail. And in a classic deployment scheme, a single failed process brings down the entire inference instance.
Why This Is a Serious Problem
Imagine you have a service running on 32 GPUs, and one of them fails. In a traditional setup, this means a full restart, with all the ensuing consequences: several minutes of downtime, a lost request queue, and a strain on the infrastructure. With high traffic volumes, even a couple of minutes of downtime translates to significant losses.
This is precisely the vulnerability that the Mooncake team, in collaboration with Volcano Engine, set out to address by integrating a mechanism called Elastic EP – elastic expert parallelism – into the SGLang framework.
The Idea: Breaking the Rigid Link
In a standard setup, each “expert” (subnetwork) is rigidly tied to a specific GPU. If that GPU fails, the expert becomes unavailable, and the system cannot continue to operate.
Elastic EP changes this logic: experts are stored with redundancy, meaning some of them are replicated across multiple GPUs. If one of the devices fails, the system detects it, redistributes the load to the remaining GPUs, and continues processing requests – without a complete shutdown.
Simply put: the model loses a bit of “power,” but it doesn't stop.
What the Tests Showed
To test the solution under near-production conditions, the team ran the DeepSeek V3.2 model on four nodes – 32 GPUs in total – with 256 backup experts. This configuration allowed the system to survive the simultaneous failure of up to 16 processes.
During the experiment, some processes were forcibly terminated, and the recovery time was measured. The result: the service interruption was less than 10 seconds, compared to the 2–3 minutes required for a full restart. This is about 90% faster.
Moreover, in normal operation – when there are no failures – the performance of the system with Elastic EP matches that of the standard approach. In other words, reliability is added without any performance penalty during normal operation.
Two Layers of Protection
Under the hood, the solution operates on two levels simultaneously.
The first is the scheduler level. This is the system's “gatekeeper”: it constantly monitors the status of all GPUs and, if one stops responding, immediately removes it from the task distribution queue. New requests are sent only to healthy resources – without any interruptions.
The second is the expert parallelism level itself. Here, a more nuanced process takes place: the system dynamically reassesses experts from the failed GPUs to the surviving ones to ensure computations continue correctly from a mathematical standpoint. This helps avoid severe interruptions at the execution level.
Together, these two mechanisms transform a fragile MoE system into a much more resilient structure.
Mooncake as the Communication Backbone
The Mooncake EP library plays a key role in this implementation, acting as a fault-tolerant communication layer between GPUs. It is responsible for fast data transfer between nodes, tracking failures, and rebuilding communication routes when hardware fails partially.
An important detail: the library is designed to be integrated into the existing SGLang infrastructure without extensive refactoring. This lowers the barrier to entry for those looking to add fault tolerance to their existing systems.
Additionally, within the same Elastic EP framework, the NVIDIA Dynamo team proposed an implementation based on their own communication backend, NIXL EP. This shows that the architecture is designed to be extensible, allowing different teams to plug in their own implementations on top of the general framework.
Why This Matters Beyond This Specific Project
MoE models are not an exotic concept. DeepSeek and a number of other large models use this exact architecture. As these models are increasingly deployed in production systems, the issue of infrastructure reliability becomes just as important as the quality of the model itself.
Until now, wide expert parallelism has been somewhat like walking a tightrope over a chasm without a safety net: it works well as long as everything is fine, but one slip means a total fall. Elastic EP provides that very safety net.
The question of full dynamic process recovery remains open – that is, the ability to automatically “return” a failed GPU to service without restarting the entire instance. According to the team, this functionality is under active development.
Nevertheless, the solution already implemented – reducing downtime from several minutes to mere seconds – fundamentally changes the reliability equation for systems where continuous operation is critical.