Published on March 19, 2026

Обучение Llama 4 Scout ускорено на 30% благодаря новому формату данных

New Data Format Speeds Up Llama 4 Scout Training by 30%

Researchers have accelerated the training of the Llama 4 Scout model by 30.2% without compromising result quality, thanks to a change in the numerical data format.

Development / Technical context 4 – 6 minutes min read

Event Source: PyTorch 4 – 6 minutes min read

When it comes to training large language models, one of the main challenges is time and resources. Modern models require a vast amount of computation, meaning that any acceleration of this process – even by just a few percent – translates into real savings in time, energy, and money.

The PyTorch developer team recently announced a result that is anything but ordinary: the training of the Llama 4 Scout model was accelerated by 30.2% compared to the standard approach, all without any loss in quality. The model converged to the same results as it did in the conventional training mode.

Формат чисел MXFP8 для ускорения обучения

It's All About the Number Format

It might sound a bit surprising, but the key change wasn't the training algorithm or the model's architecture. The difference lies in how numbers are stored and processed during computation.

During training, neural networks operate on a vast number of numerical values. The format in which these numbers are represented directly affects the speed and precision of the computations. The standard format currently used universally is BF16 (short for «bfloat16»). It provides sufficient precision and has long been the de facto standard for training large models.

The new approach uses the MXFP8 format – a more compact representation of numbers that takes up less memory and is processed faster. To put it simply: if BF16 is like working with numbers with two decimal places, MXFP8 is like working with one, but with a clever scaling system that prevents the loss of important information.

The main challenge with such «truncated» formats is the risk of losing precision during training. A model's performance can degrade if numbers are rounded too coarsely. This is precisely why transitioning to MXFP8 has been a non-trivial task, especially for complex architectures.

Llama 4 Scout и MoE-модели: особенности архитектуры

Llama 4 Scout Is More Than Just a Large Model

A key detail: Llama 4 Scout belongs to the class of so-called MoE (Mixture of Experts) models. This is an architecture where the model doesn't activate all of its parameters at once. Instead, for each query, only a subset of «experts» – specialized blocks within the model – is engaged.

This approach allows for the creation of very large models without a proportional increase in computational cost. However, it also creates additional challenges when working with non-standard number formats: the load is distributed unevenly, making it harder to maintain computational stability.

This is why applying MXFP8 to an MoE architecture is a non-trivial task. The team had to develop specialized tools to ensure it worked correctly.

Результаты тестирования MXFP8 на NVIDIA GB200

The Results in Practice

The experiments were conducted on an NVIDIA GB200 GPU cluster – one of the most powerful solutions for AI training available today. The result – a 30.2% acceleration – accounts for about 81% of the theoretical maximum that can be achieved by switching to MXFP8. This means the practical results were very close to the theory, which is a good sign in itself.

Moreover, the training quality was not compromised: the model's convergence curves with MXFP8 matched those with BF16. Simply put, the model learned the same thing – just faster.

The implementation was carried out using the TorchAO and TorchTitan libraries – tools from the PyTorch ecosystem designed for optimizing and scaling model training. Details of the implementation are publicly available.

Значение ускорения обучения LLM на 30%

Why This Matters Beyond a Single Experiment

A 30% speedup isn't just a nice number in a report. In the context of training large models, it means the same result can be achieved in roughly three-quarters of the usual time. Alternatively, with the same budget, it enables the training of a model that was previously unattainable.

For large labs that train models on thousands of accelerators for weeks at a time, this kind of optimization changes the scale of what's possible. But it potentially concerns a wider circle: as MoE architectures become standard and the tools for optimizing them become more accessible, such techniques could shift from being «experimental» to «commonplace.»

It's important to note that this is about training, not inference (i.e., running an already trained model). Inference optimization is a separate and actively developing field. Optimizing the training process itself is a more complex challenge, and progress here has been slower.

Возможные ограничения и перспективы нового подхода

Open Questions

It remains unclear how easily this approach can be ported to other models and architectures. Llama 4 Scout is a specific model with unique characteristics, and what worked here may not be applicable elsewhere without modification.

There is also the question of accessibility: the NVIDIA GB200 is data center-grade hardware, not something you'd find in an average research lab. Whether these results can be replicated on less exotic hardware remains an open question.

Nevertheless, the very fact that switching to a more compact numerical format yields a 30% speedup with equivalent quality – and that this has been confirmed on a real, modern architecture – seems like a significant step toward more efficient training of large models.

#event #technical context #neural networks #ai training #engineering #computer systems #scaling #model scaling #model training optimization

Link to Original: https://pytorch.org/blog/mxfp8-training-for-moes-1-3x-training-speedup-vs-bf16-for-llama4-scout-on-gb200-cluster-using-torchao-and-torchtitan/

Original Title: MXFP8 Training for MoEs: 1.3x training speedup vs BF16 for Llama4 Scout on GB200 cluster using TorchAO and TorchTitan

Publication Date: Mar 12, 2026

PyTorch pytorch.org An international open-source deep learning framework and community widely used for research and development in artificial intelligence and machine learning.

Previous Article NVIDIA Nemotron 3 Super Now Available on Together AI: What This Means for Developers Next Article On-Device Voice AI Agents: How PyTorch is Building a Unified Platform for Voice Tasks

Обучение Llama 4 Scout ускорено на 30% благодаря новому формату данных

Формат чисел MXFP8 для ускорения обучения

Llama 4 Scout и MoE-модели: особенности архитектуры

Результаты тестирования MXFP8 на NVIDIA GB200

Значение ускорения обучения LLM на 30%

Возможные ограничения и перспективы нового подхода

Related Publications

DeepSpeed Learns to Train Complex AI Models More Efficiently: What's Changed and Why It Matters

FlashOptim: How to Compress a Neural Network Without Losing Quality

What Is a Mixture of Experts and Why Is Everyone Talking About It?

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration