Published on March 26, 2026

DeepSeek-V3 Training Accelerated by 41% What This Means

DeepSeek-V3 Now Trains 41% Faster: What's Behind It?

The PyTorch and Nebius teams joined forces to accelerate the pre-training of DeepSeek-V3 on modern GPUs, and the results exceeded expectations.

Infrastructure / Technical context 4 – 6 minutes min read
Event Source: PyTorch 4 – 6 minutes min read

Large language models don't just appear out of thin air. Behind every release are weeks or even months of computation on hundreds of powerful GPUs. One of the constant challenges in this field is to accelerate training, make it cheaper, and reduce memory constraints. The latest result from the collaboration between the PyTorch and Nebius teams addresses just that.

DeepSeek-V3 Training Acceleration Explained

What Happened?

Engineers from both teams ran the pre-training for the DeepSeek-V3 model on a cluster of 256 NVIDIA B200 GPUs. DeepSeek-V3 is a so-called MoE (Mixture of Experts) model: it contains 671 billion parameters, but only a fraction are active at any given time. This allows it to achieve high performance with relatively moderate computational costs – at least by the standards of such a scale.

The result: pre-training was accelerated by up to 41% compared to earlier approaches. In short – the same work, but significantly faster.

How DeepSeek-V3 Speed Improvement Was Achieved

How Was This Achieved?

This was achieved through two independent improvements that can be applied separately, and when combined, they produce a cumulative effect.

New Number Format: MXFP8

Modern neural networks operate on massive arrays of numbers. The speed and memory usage depend directly on the format in which these numbers are stored and processed. The 'lighter' the format, the faster the computations, but the higher the risk of losing precision.

MXFP8 is one of these 'light' formats. Its key feature is that it allows for more precise control over how numbers are stored: small groups of values are scaled independently. Simply put, this allows it to be both compact and sufficiently precise – a combination that was previously difficult to achieve.

Using MXFP8 during the DeepSeek-V3 training process significantly accelerated computations without any noticeable loss in the final model's quality. It is important to note that this specifically pertains to pre-training – the most expensive stage, where the model learns 'from scratch' on vast amounts of text.

DeepEP: Smarter Data Transfer Between GPUs

When hundreds of GPUs work together, data is constantly transferred between them. This is especially pronounced in MoE models: different 'experts' reside on different GPUs, and at each training step, the right data must be delivered to the right expert. This creates a serious load on the network infrastructure.

DeepEP is a library designed to optimize this very communication. Developed by the DeepSeek team, it specifically targets MoE architectures. Integrating DeepEP into the training framework made it possible to reduce 'idle time,' when GPUs are waiting for data, and thereby better utilize hardware resources.

TorchTitan Role in DeepSeek-V3 Acceleration

Where Does TorchTitan Come In?

TorchTitan is a training framework from the PyTorch team. It can be described as a set of tools and approaches for running large-scale training of large models reliably and flexibly. It was into TorchTitan that support for both MXFP8 and DeepEP was integrated, and all experiments were conducted on this basis.

Two configurations were tested: a simplified 16-billion-parameter version of DeepSeek-V3 and the full-size 671-billion-parameter one. Both variants showed significant acceleration, while the training quality was not compromised.

Significance of DeepSeek-V3 Training Speedup

Why Is This More Important Than It Seems?

At first glance, this might sound like a purely technical story. But there is something more significant behind it.

Training models like DeepSeek-V3 is expensive. Very expensive. Every percentage point of speedup here isn't just about being 'faster'; it translates to real resource savings: less time on GPU clusters, less electricity, and less money. At the scale of hundreds of GPUs and weeks of computation, 41% is a figure that has a very tangible monetary equivalent.

Furthermore, the openness of these results plays a key role. PyTorch is an open ecosystem, and the improvements integrated into TorchTitan are, in theory, available to anyone working on similar tasks. This is not just an internal optimization for a single company but a contribution to the shared infrastructure for training large models.

Real-World Impact of DeepSeek-V3 Training Advances

How Applicable Is This in the Real World?

Here, an honest disclaimer is in order. We are talking about experiments on a cluster of 256 NVIDIA B200 GPUs – which is extremely expensive and not yet widespread hardware. Most individuals and even small organizations do not work with such configurations directly.

Nevertheless, approaches perfected on such systems tend to migrate to more accessible tools over time. MXFP8 is already supported in several other projects, including AMD ROCm, which has been written about in relation to this very same DeepSeek-V3. It's a format the industry is clearly betting on as the next step beyond FP16 and BF16.

As an open-source library, DeepEP is also gradually attracting attention from those working with MoE models – not only at the scale of DeepSeek but also in more modest research projects.

Key Takeaways from DeepSeek-V3 Training Optimization

What's the Bottom Line?

The collaboration between PyTorch and Nebius on training DeepSeek-V3 is a prime example of how engineering cooperation within an open ecosystem can yield measurable results. There's no 'breakthrough' here in the sense of a new architecture or a novel idea, but rather solid engineering: taking two proven tools, integrating them into an existing framework, and achieving an acceleration that is hard to ignore.

For those who follow the infrastructure developments for training large models, this is an event worth keeping in mind. It is precisely these kinds of iterations that determine how quickly and affordably the next generations of AI systems will appear.

Original Title: Enabling Up to 41% Faster Pre-training: MXFP8 and DeepEP for DeepSeek-V3 on B200 with TorchTitan
Publication Date: Mar 25, 2026
PyTorch pytorch.org An international open-source deep learning framework and community widely used for research and development in artificial intelligence and machine learning.
Previous Article Mistral Releases Voxtral TTS Voice Model – Fast, Open-Weight Speech Synthesis Next Article How Cursor Trains Its AI on Live Users – and Updates It Several Times a Day

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

AI: Events

New Data Format Speeds Up Llama 4 Scout Training by 30%

Technical context Development

Researchers have accelerated the training of the Llama 4 Scout model by 30.2% without compromising result quality, thanks to a change in the numerical data format.

PyTorchpytorch.org Mar 19, 2026

AI: Events

Perplexity Shows How to Train Trillion-Parameter Models on AWS

Technical context Infrastructure

The Perplexity team has adapted a framework for training ultra-large neural networks for Amazon's cloud infrastructure. This allowed them to eliminate the rigid dependency on proprietary NVIDIA hardware and utilize standard networking solutions.

Perplexity AIresearch.perplexity.ai Feb 7, 2026

AI: Events

How to Train AI on Million-Token Texts: A Game-Changing Idea

Technical context Infrastructure

Researchers have proposed a method for distributing the processing of ultra-long texts across multiple GPUs, allowing models to be trained on contexts of up to one million tokens.

Hugging Facehuggingface.co Mar 10, 2026

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to know about new
experiments first?

Subscribe to our Telegram channel — we share all the latest
and exciting updates from NeuraBooks.

Subscribe