When it comes to training large language models, one of the main challenges is time and resources. Modern models require a vast amount of computation, meaning that any acceleration of this process – even by just a few percent – translates into real savings in time, energy, and money.
The PyTorch developer team recently announced a result that is anything but ordinary: the training of the Llama 4 Scout model was accelerated by 30.2% compared to the standard approach, all without any loss in quality. The model converged to the same results as it did in the conventional training mode.
It's All About the Number Format
It might sound a bit surprising, but the key change wasn't the training algorithm or the model's architecture. The difference lies in how numbers are stored and processed during computation.
During training, neural networks operate on a vast number of numerical values. The format in which these numbers are represented directly affects the speed and precision of the computations. The standard format currently used universally is BF16 (short for «bfloat16»). It provides sufficient precision and has long been the de facto standard for training large models.
The new approach uses the MXFP8 format – a more compact representation of numbers that takes up less memory and is processed faster. To put it simply: if BF16 is like working with numbers with two decimal places, MXFP8 is like working with one, but with a clever scaling system that prevents the loss of important information.
The main challenge with such «truncated» formats is the risk of losing precision during training. A model's performance can degrade if numbers are rounded too coarsely. This is precisely why transitioning to MXFP8 has been a non-trivial task, especially for complex architectures.
Llama 4 Scout Is More Than Just a Large Model
A key detail: Llama 4 Scout belongs to the class of so-called MoE (Mixture of Experts) models. This is an architecture where the model doesn't activate all of its parameters at once. Instead, for each query, only a subset of «experts» – specialized blocks within the model – is engaged.
This approach allows for the creation of very large models without a proportional increase in computational cost. However, it also creates additional challenges when working with non-standard number formats: the load is distributed unevenly, making it harder to maintain computational stability.
This is why applying MXFP8 to an MoE architecture is a non-trivial task. The team had to develop specialized tools to ensure it worked correctly.
The Results in Practice
The experiments were conducted on an NVIDIA GB200 GPU cluster – one of the most powerful solutions for AI training available today. The result – a 30.2% acceleration – accounts for about 81% of the theoretical maximum that can be achieved by switching to MXFP8. This means the practical results were very close to the theory, which is a good sign in itself.
Moreover, the training quality was not compromised: the model's convergence curves with MXFP8 matched those with BF16. Simply put, the model learned the same thing – just faster.
The implementation was carried out using the TorchAO and TorchTitan libraries – tools from the PyTorch ecosystem designed for optimizing and scaling model training. Details of the implementation are publicly available.
Why This Matters Beyond a Single Experiment
A 30% speedup isn't just a nice number in a report. In the context of training large models, it means the same result can be achieved in roughly three-quarters of the usual time. Alternatively, with the same budget, it enables the training of a model that was previously unattainable.
For large labs that train models on thousands of accelerators for weeks at a time, this kind of optimization changes the scale of what's possible. But it potentially concerns a wider circle: as MoE architectures become standard and the tools for optimizing them become more accessible, such techniques could shift from being «experimental» to «commonplace.»
It's important to note that this is about training, not inference (i.e., running an already trained model). Inference optimization is a separate and actively developing field. Optimizing the training process itself is a more complex challenge, and progress here has been slower.
Open Questions
It remains unclear how easily this approach can be ported to other models and architectures. Llama 4 Scout is a specific model with unique characteristics, and what worked here may not be applicable elsewhere without modification.
There is also the question of accessibility: the NVIDIA GB200 is data center-grade hardware, not something you'd find in an average research lab. Whether these results can be replicated on less exotic hardware remains an open question.
Nevertheless, the very fact that switching to a more compact numerical format yields a 30% speedup with equivalent quality – and that this has been confirmed on a real, modern architecture – seems like a significant step toward more efficient training of large models.