Published on March 26, 2026

DeepSeek-V3 Training Accelerated by 41% What This Means

DeepSeek-V3 Now Trains 41% Faster: What's Behind It?

The PyTorch and Nebius teams joined forces to accelerate the pre-training of DeepSeek-V3 on modern GPUs, and the results exceeded expectations.

Infrastructure / Technical context 4 – 6 minutes min read

Event Source: PyTorch 4 – 6 minutes min read

Large language models don't just appear out of thin air. Behind every release are weeks or even months of computation on hundreds of powerful GPUs. One of the constant challenges in this field is to accelerate training, make it cheaper, and reduce memory constraints. The latest result from the collaboration between the PyTorch and Nebius teams addresses just that.

DeepSeek-V3 Training Acceleration Explained

What Happened?

Engineers from both teams ran the pre-training for the DeepSeek-V3 model on a cluster of 256 NVIDIA B200 GPUs. DeepSeek-V3 is a so-called MoE (Mixture of Experts) model: it contains 671 billion parameters, but only a fraction are active at any given time. This allows it to achieve high performance with relatively moderate computational costs – at least by the standards of such a scale.

The result: pre-training was accelerated by up to 41% compared to earlier approaches. In short – the same work, but significantly faster.

How DeepSeek-V3 Speed Improvement Was Achieved

How Was This Achieved?

This was achieved through two independent improvements that can be applied separately, and when combined, they produce a cumulative effect.

New Number Format: MXFP8

Modern neural networks operate on massive arrays of numbers. The speed and memory usage depend directly on the format in which these numbers are stored and processed. The 'lighter' the format, the faster the computations, but the higher the risk of losing precision.

MXFP8 is one of these 'light' formats. Its key feature is that it allows for more precise control over how numbers are stored: small groups of values are scaled independently. Simply put, this allows it to be both compact and sufficiently precise – a combination that was previously difficult to achieve.

Using MXFP8 during the DeepSeek-V3 training process significantly accelerated computations without any noticeable loss in the final model's quality. It is important to note that this specifically pertains to pre-training – the most expensive stage, where the model learns 'from scratch' on vast amounts of text.

DeepEP: Smarter Data Transfer Between GPUs

When hundreds of GPUs work together, data is constantly transferred between them. This is especially pronounced in MoE models: different 'experts' reside on different GPUs, and at each training step, the right data must be delivered to the right expert. This creates a serious load on the network infrastructure.

DeepEP is a library designed to optimize this very communication. Developed by the DeepSeek team, it specifically targets MoE architectures. Integrating DeepEP into the training framework made it possible to reduce 'idle time,' when GPUs are waiting for data, and thereby better utilize hardware resources.

TorchTitan Role in DeepSeek-V3 Acceleration

Where Does TorchTitan Come In?

TorchTitan is a training framework from the PyTorch team. It can be described as a set of tools and approaches for running large-scale training of large models reliably and flexibly. It was into TorchTitan that support for both MXFP8 and DeepEP was integrated, and all experiments were conducted on this basis.

Two configurations were tested: a simplified 16-billion-parameter version of DeepSeek-V3 and the full-size 671-billion-parameter one. Both variants showed significant acceleration, while the training quality was not compromised.

Significance of DeepSeek-V3 Training Speedup

Why Is This More Important Than It Seems?

At first glance, this might sound like a purely technical story. But there is something more significant behind it.

Training models like DeepSeek-V3 is expensive. Very expensive. Every percentage point of speedup here isn't just about being 'faster'; it translates to real resource savings: less time on GPU clusters, less electricity, and less money. At the scale of hundreds of GPUs and weeks of computation, 41% is a figure that has a very tangible monetary equivalent.

Furthermore, the openness of these results plays a key role. PyTorch is an open ecosystem, and the improvements integrated into TorchTitan are, in theory, available to anyone working on similar tasks. This is not just an internal optimization for a single company but a contribution to the shared infrastructure for training large models.

Real-World Impact of DeepSeek-V3 Training Advances

How Applicable Is This in the Real World?

Here, an honest disclaimer is in order. We are talking about experiments on a cluster of 256 NVIDIA B200 GPUs – which is extremely expensive and not yet widespread hardware. Most individuals and even small organizations do not work with such configurations directly.

Nevertheless, approaches perfected on such systems tend to migrate to more accessible tools over time. MXFP8 is already supported in several other projects, including AMD ROCm, which has been written about in relation to this very same DeepSeek-V3. It's a format the industry is clearly betting on as the next step beyond FP16 and BF16.

As an open-source library, DeepEP is also gradually attracting attention from those working with MoE models – not only at the scale of DeepSeek but also in more modest research projects.

Key Takeaways from DeepSeek-V3 Training Optimization

What's the Bottom Line?

The collaboration between PyTorch and Nebius on training DeepSeek-V3 is a prime example of how engineering cooperation within an open ecosystem can yield measurable results. There's no 'breakthrough' here in the sense of a new architecture or a novel idea, but rather solid engineering: taking two proven tools, integrating them into an existing framework, and achieving an acceleration that is hard to ignore.

For those who follow the infrastructure developments for training large models, this is an event worth keeping in mind. It is precisely these kinds of iterations that determine how quickly and affordably the next generations of AI systems will appear.

#event #technical context #neural networks #ai development #ai training #engineering #infrastructure #gpu optimization #model training optimization #ai energy efficiency

Link to Original: https://pytorch.org/blog/enabling-up-to-41-faster-pre-training-mxfp8-and-deepep-for-deepseek-v3-on-b200-with-torchtitan/

Original Title: Enabling Up to 41% Faster Pre-training: MXFP8 and DeepEP for DeepSeek-V3 on B200 with TorchTitan

Publication Date: Mar 25, 2026

PyTorch pytorch.org An international open-source deep learning framework and community widely used for research and development in artificial intelligence and machine learning.

Previous Article Mistral Releases Voxtral TTS Voice Model – Fast, Open-Weight Speech Synthesis Next Article How Cursor Trains Its AI on Live Users – and Updates It Several Times a Day

DeepSeek-V3 Training Accelerated by 41% What This Means

DeepSeek-V3 Training Acceleration Explained

How DeepSeek-V3 Speed Improvement Was Achieved

New Number Format: MXFP8

DeepEP: Smarter Data Transfer Between GPUs

TorchTitan Role in DeepSeek-V3 Acceleration

Significance of DeepSeek-V3 Training Speedup

Real-World Impact of DeepSeek-V3 Training Advances

Key Takeaways from DeepSeek-V3 Training Optimization

Related Publications

New Data Format Speeds Up Llama 4 Scout Training by 30%

Perplexity Shows How to Train Trillion-Parameter Models on AWS

How to Train AI on Million-Token Texts: A Game-Changing Idea

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration