Published on March 19, 2026

Обучение Llama 4 Scout ускорено на 30% благодаря новому формату данных

New Data Format Speeds Up Llama 4 Scout Training by 30%

Researchers have accelerated the training of the Llama 4 Scout model by 30.2% without compromising result quality, thanks to a change in the numerical data format.

Development / Technical context 4 – 6 minutes min read
Event Source: PyTorch 4 – 6 minutes min read

When it comes to training large language models, one of the main challenges is time and resources. Modern models require a vast amount of computation, meaning that any acceleration of this process – even by just a few percent – translates into real savings in time, energy, and money.

The PyTorch developer team recently announced a result that is anything but ordinary: the training of the Llama 4 Scout model was accelerated by 30.2% compared to the standard approach, all without any loss in quality. The model converged to the same results as it did in the conventional training mode.

Формат чисел MXFP8 для ускорения обучения

It's All About the Number Format

It might sound a bit surprising, but the key change wasn't the training algorithm or the model's architecture. The difference lies in how numbers are stored and processed during computation.

During training, neural networks operate on a vast number of numerical values. The format in which these numbers are represented directly affects the speed and precision of the computations. The standard format currently used universally is BF16 (short for «bfloat16»). It provides sufficient precision and has long been the de facto standard for training large models.

The new approach uses the MXFP8 format – a more compact representation of numbers that takes up less memory and is processed faster. To put it simply: if BF16 is like working with numbers with two decimal places, MXFP8 is like working with one, but with a clever scaling system that prevents the loss of important information.

The main challenge with such «truncated» formats is the risk of losing precision during training. A model's performance can degrade if numbers are rounded too coarsely. This is precisely why transitioning to MXFP8 has been a non-trivial task, especially for complex architectures.

Llama 4 Scout и MoE-модели: особенности архитектуры

Llama 4 Scout Is More Than Just a Large Model

A key detail: Llama 4 Scout belongs to the class of so-called MoE (Mixture of Experts) models. This is an architecture where the model doesn't activate all of its parameters at once. Instead, for each query, only a subset of «experts» – specialized blocks within the model – is engaged.

This approach allows for the creation of very large models without a proportional increase in computational cost. However, it also creates additional challenges when working with non-standard number formats: the load is distributed unevenly, making it harder to maintain computational stability.

This is why applying MXFP8 to an MoE architecture is a non-trivial task. The team had to develop specialized tools to ensure it worked correctly.

Результаты тестирования MXFP8 на NVIDIA GB200

The Results in Practice

The experiments were conducted on an NVIDIA GB200 GPU cluster – one of the most powerful solutions for AI training available today. The result – a 30.2% acceleration – accounts for about 81% of the theoretical maximum that can be achieved by switching to MXFP8. This means the practical results were very close to the theory, which is a good sign in itself.

Moreover, the training quality was not compromised: the model's convergence curves with MXFP8 matched those with BF16. Simply put, the model learned the same thing – just faster.

The implementation was carried out using the TorchAO and TorchTitan libraries – tools from the PyTorch ecosystem designed for optimizing and scaling model training. Details of the implementation are publicly available.

Значение ускорения обучения LLM на 30%

Why This Matters Beyond a Single Experiment

A 30% speedup isn't just a nice number in a report. In the context of training large models, it means the same result can be achieved in roughly three-quarters of the usual time. Alternatively, with the same budget, it enables the training of a model that was previously unattainable.

For large labs that train models on thousands of accelerators for weeks at a time, this kind of optimization changes the scale of what's possible. But it potentially concerns a wider circle: as MoE architectures become standard and the tools for optimizing them become more accessible, such techniques could shift from being «experimental» to «commonplace.»

It's important to note that this is about training, not inference (i.e., running an already trained model). Inference optimization is a separate and actively developing field. Optimizing the training process itself is a more complex challenge, and progress here has been slower.

Возможные ограничения и перспективы нового подхода

Open Questions

It remains unclear how easily this approach can be ported to other models and architectures. Llama 4 Scout is a specific model with unique characteristics, and what worked here may not be applicable elsewhere without modification.

There is also the question of accessibility: the NVIDIA GB200 is data center-grade hardware, not something you'd find in an average research lab. Whether these results can be replicated on less exotic hardware remains an open question.

Nevertheless, the very fact that switching to a more compact numerical format yields a 30% speedup with equivalent quality – and that this has been confirmed on a real, modern architecture – seems like a significant step toward more efficient training of large models.

Original Title: MXFP8 Training for MoEs: 1.3x training speedup vs BF16 for Llama4 Scout on GB200 cluster using TorchAO and TorchTitan
Publication Date: Mar 12, 2026
PyTorch pytorch.org An international open-source deep learning framework and community widely used for research and development in artificial intelligence and machine learning.
Previous Article NVIDIA Nemotron 3 Super Now Available on Together AI: What This Means for Developers Next Article On-Device Voice AI Agents: How PyTorch is Building a Unified Platform for Voice Tasks

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Don’t miss a single experiment!

Subscribe to our Telegram channel —
we regularly post announcements of new books, articles, and interviews.

Subscribe