Imagine solving the same math problem twice on a calculator and getting different answers. It sounds like a glitch. Но this is exactly what happens with modern language models when the same calculations are performed slightly differently across various systems. It is not due to weight errors or differences in data – it is simply because numbers are added in a different order.
The Fireworks AI team has published a detailed breakdown of exactly where these discrepancies arise – and why they matter. We are talking about so-called MoE models (Mixture of Experts), which include the likes of Kimi K2.5, Qwen3.5-MoE, and DeepSeek V3.
What is MoE and Why It's More Complex
Put simply, a regular language model processes every token (basically every word) through the same computational blocks. MoE models are structured differently: they have several «experts» – individual sub-networks – and for each token, the system dynamically chooses which ones to use. This allows the model to scale without a proportional increase in computational load.
However, this architecture makes the model's behavior more sensitive to numerical errors. If a minor calculation drift changes the choice of «experts», the subsequent chain of calculations follows a different path. A tiny error at the start is amplified across each of the dozens of layers.
Where the Divergence Comes From
The root of the problem lies in a property of numerical computation that usually stays behind the scenes: floating-point addition is not commutative in a strict mathematical sense. That is, (a + b) + c and a + (b + c) can yield different results – not because of an error, but because of how numbers are rounded at each intermediate step.
In everyday life, this is negligible. But in a model running calculations through 61 layers, such errors accumulate. And if the model «saw» numbers added in one order during training but they are added in another during production deployment – the results start to diverge.
Three Places Where the Order Changes
Fireworks engineers identified three main sources of such discrepancies.
First is how different systems synchronize calculations across multiple GPUs. When a model runs on several video cards simultaneously, results from different cards must be summed up. The standard tool for this (NCCL) does it in one order, while optimized kernels for inference acceleration do it in another. Mathematically, both are correct. Numerically – they are not.
The second source is combining several operations into one (so-called «fusion kernels»). To save memory and speed up calculations, inference engine developers often merge several sequential operations into one. This changes the internal summation order, and the normalization that follows the addition receives a slightly different number as input.
The third is the specific nature of MoE layers, where several operations are fused into a single kernel: weighted summation of expert outputs, multi-GPU synchronization, and normalization for the next block. This type of layer is present in 58 out of the model's 61 layers, and errors here accumulate with particular intensity.
Why It Goes Unnoticed – and Why It's Still a Problem ⚠️
The most frustrating part of these discrepancies is that they don't break the model in an obvious way. Text is generated normally, and answers look reasonable. The divergence is only discovered through a precise comparison of the probabilities the model assigns to each subsequent token.
For the average user, this is likely completely invisible. But for systems that fine-tune models based on feedback (so-called RLHF), this is fundamentally important. Such systems use a «reference» version of the model to compare new behaviors against. If the inference version yields slightly different probabilities than the training version, the fine-tuning system receives a distorted signal and may start optimizing for the «wrong thing».
In short: «The model isn't broken, but its exact copy for training purposes is no longer quite a copy.»
The Case of Qwen3.5-MoE: Where Divergence Became Visible
The case of the Qwen3.5-MoE model when working with images proved particularly telling. For text tokens, the divergence remained small. But for tokens representing images, the divergence metric (the authors use a variant of KL divergence – a measure of the difference between two probability distributions) grew by about 60 times.
The reason turned out to be where exactly rounding occurs during the summation of expert outputs. In the reference implementation, each expert's contribution was rounded to a less precise format before addition. In the optimized Fireworks version, everything was first added in a more precise format, and rounding happened only at the end. Mathematically, both approaches are correct. But since the first variant was the benchmark, the second variant produced different numbers.
When the MoE blocks in the optimized version were replaced with reference ones, the divergence dropped to zero. This allowed for the exact localization of the problem's source.
What This Implies
For developers and researchers working on fine-tuning or deploying MoE models, the takeaways from this breakdown are quite practical.
«The same math» does not mean «the same numbers» – and this must be verified explicitly, not taken on faith. Simply checking if the model «generates reasonable text» is not enough if you are using an inference engine as part of a training pipeline.
The authors also emphasize that a single «turn off all optimizations» button isn't the best solution, as different tasks require different trade-offs. Accuracy is vital for RLHF. Speed is vital for production inference. The ideal scenario involves granular settings that allow for choosing the right balance.
Finally, this is another reminder that MoE models – despite their efficiency – introduce an additional layer of fragility related to the discrete choice of experts. A small numerical deviation at the input can change which experts are activated – and this choice then propagates through all subsequent layers.
The problem is neither unique nor catastrophic. But it perfectly illustrates how engineering compromises made for the sake of performance can come back to haunt you where you least expect it.