Published on March 10, 2026

Understanding Numerical Divergence in MoE Language Models

When «Same» and «Same Result» Are Not the Same: Numerical Divergence in MoE Models

Same weights, same prompt – yet the results differ slightly. Why this happens and why it matters for neural network training.

Development 5 – 7 minutes min read
Event Source: Fireworks AI 5 – 7 minutes min read

Imagine solving the same math problem twice on a calculator and getting different answers. It sounds like a glitch. Но this is exactly what happens with modern language models when the same calculations are performed slightly differently across various systems. It is not due to weight errors or differences in data – it is simply because numbers are added in a different order.

The Fireworks AI team has published a detailed breakdown of exactly where these discrepancies arise – and why they matter. We are talking about so-called MoE models (Mixture of Experts), which include the likes of Kimi K2.5, Qwen3.5-MoE, and DeepSeek V3.

Architecture of MoE Models and Computational Complexity

What is MoE and Why It's More Complex

Put simply, a regular language model processes every token (basically every word) through the same computational blocks. MoE models are structured differently: they have several «experts» – individual sub-networks – and for each token, the system dynamically chooses which ones to use. This allows the model to scale without a proportional increase in computational load.

However, this architecture makes the model's behavior more sensitive to numerical errors. If a minor calculation drift changes the choice of «experts», the subsequent chain of calculations follows a different path. A tiny error at the start is amplified across each of the dozens of layers.

Causes of Numerical Divergence in Floating Point Calculations

Where the Divergence Comes From

The root of the problem lies in a property of numerical computation that usually stays behind the scenes: floating-point addition is not commutative in a strict mathematical sense. That is, (a + b) + c and a + (b + c) can yield different results – not because of an error, but because of how numbers are rounded at each intermediate step.

In everyday life, this is negligible. But in a model running calculations through 61 layers, such errors accumulate. And if the model «saw» numbers added in one order during training but they are added in another during production deployment – the results start to diverge.

Sources of Calculation Discrepancies in AI Inference

Three Places Where the Order Changes

Fireworks engineers identified three main sources of such discrepancies.

First is how different systems synchronize calculations across multiple GPUs. When a model runs on several video cards simultaneously, results from different cards must be summed up. The standard tool for this (NCCL) does it in one order, while optimized kernels for inference acceleration do it in another. Mathematically, both are correct. Numerically – they are not.

The second source is combining several operations into one (so-called «fusion kernels»). To save memory and speed up calculations, inference engine developers often merge several sequential operations into one. This changes the internal summation order, and the normalization that follows the addition receives a slightly different number as input.

The third is the specific nature of MoE layers, where several operations are fused into a single kernel: weighted summation of expert outputs, multi-GPU synchronization, and normalization for the next block. This type of layer is present in 58 out of the model's 61 layers, and errors here accumulate with particular intensity.

Impact of Numerical Errors on RLHF and Model Fine-tuning

Why It Goes Unnoticed – and Why It's Still a Problem ⚠️

The most frustrating part of these discrepancies is that they don't break the model in an obvious way. Text is generated normally, and answers look reasonable. The divergence is only discovered through a precise comparison of the probabilities the model assigns to each subsequent token.

For the average user, this is likely completely invisible. But for systems that fine-tune models based on feedback (so-called RLHF), this is fundamentally important. Such systems use a «reference» version of the model to compare new behaviors against. If the inference version yields slightly different probabilities than the training version, the fine-tuning system receives a distorted signal and may start optimizing for the «wrong thing».

In short: «The model isn't broken, but its exact copy for training purposes is no longer quite a copy.»

Analyzing Numerical Divergence in Qwen3.5-MoE Image Tokens

The Case of Qwen3.5-MoE: Where Divergence Became Visible

The case of the Qwen3.5-MoE model when working with images proved particularly telling. For text tokens, the divergence remained small. But for tokens representing images, the divergence metric (the authors use a variant of KL divergence – a measure of the difference between two probability distributions) grew by about 60 times.

The reason turned out to be where exactly rounding occurs during the summation of expert outputs. In the reference implementation, each expert's contribution was rounded to a less precise format before addition. In the optimized Fireworks version, everything was first added in a more precise format, and rounding happened only at the end. Mathematically, both approaches are correct. But since the first variant was the benchmark, the second variant produced different numbers.

When the MoE blocks in the optimized version were replaced with reference ones, the divergence dropped to zero. This allowed for the exact localization of the problem's source.

Practical Implications for MoE Model Deployment and Accuracy

What This Implies

For developers and researchers working on fine-tuning or deploying MoE models, the takeaways from this breakdown are quite practical.

«The same math» does not mean «the same numbers» – and this must be verified explicitly, not taken on faith. Simply checking if the model «generates reasonable text» is not enough if you are using an inference engine as part of a training pipeline.

The authors also emphasize that a single «turn off all optimizations» button isn't the best solution, as different tasks require different trade-offs. Accuracy is vital for RLHF. Speed is vital for production inference. The ideal scenario involves granular settings that allow for choosing the right balance.

Finally, this is another reminder that MoE models – despite their efficiency – introduce an additional layer of fragility related to the discrete choice of experts. A small numerical deviation at the input can change which experts are activated – and this choice then propagates through all subsequent layers.

The problem is neither unique nor catastrophic. But it perfectly illustrates how engineering compromises made for the sake of performance can come back to haunt you where you least expect it.

Original Title: Training-Inference Parity in MoE Models: Where Numerics Drift
Publication Date: Mar 10, 2026
Fireworks AI fireworks.ai U.S.-based AI infrastructure company from Redwood City building platforms for running, fine-tuning, and scaling generative models with high-performance inference.
Previous Article LeRobot v0.5.0: Bringing Robotics Closer to Everyone Next Article How to Train AI on Million-Token Texts: A Game-Changing Idea

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 3 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 3 Pro Google DeepMind
3.
Gemini 3 Pro Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 3 Pro Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to dive deeper into the world
of neuro-creativity?

Be the first to learn about new books, articles, and AI experiments
on our Telegram channel!

Subscribe