Large language models don't just appear out of thin air. Behind every release are weeks or even months of computation on hundreds of powerful GPUs. One of the constant challenges in this field is to accelerate training, make it cheaper, and reduce memory constraints. The latest result from the collaboration between the PyTorch and Nebius teams addresses just that.
What Happened?
Engineers from both teams ran the pre-training for the DeepSeek-V3 model on a cluster of 256 NVIDIA B200 GPUs. DeepSeek-V3 is a so-called MoE (Mixture of Experts) model: it contains 671 billion parameters, but only a fraction are active at any given time. This allows it to achieve high performance with relatively moderate computational costs – at least by the standards of such a scale.
The result: pre-training was accelerated by up to 41% compared to earlier approaches. In short – the same work, but significantly faster.
How Was This Achieved?
This was achieved through two independent improvements that can be applied separately, and when combined, they produce a cumulative effect.
New Number Format: MXFP8
Modern neural networks operate on massive arrays of numbers. The speed and memory usage depend directly on the format in which these numbers are stored and processed. The 'lighter' the format, the faster the computations, but the higher the risk of losing precision.
MXFP8 is one of these 'light' formats. Its key feature is that it allows for more precise control over how numbers are stored: small groups of values are scaled independently. Simply put, this allows it to be both compact and sufficiently precise – a combination that was previously difficult to achieve.
Using MXFP8 during the DeepSeek-V3 training process significantly accelerated computations without any noticeable loss in the final model's quality. It is important to note that this specifically pertains to pre-training – the most expensive stage, where the model learns 'from scratch' on vast amounts of text.
DeepEP: Smarter Data Transfer Between GPUs
When hundreds of GPUs work together, data is constantly transferred between them. This is especially pronounced in MoE models: different 'experts' reside on different GPUs, and at each training step, the right data must be delivered to the right expert. This creates a serious load on the network infrastructure.
DeepEP is a library designed to optimize this very communication. Developed by the DeepSeek team, it specifically targets MoE architectures. Integrating DeepEP into the training framework made it possible to reduce 'idle time,' when GPUs are waiting for data, and thereby better utilize hardware resources.
Where Does TorchTitan Come In?
TorchTitan is a training framework from the PyTorch team. It can be described as a set of tools and approaches for running large-scale training of large models reliably and flexibly. It was into TorchTitan that support for both MXFP8 and DeepEP was integrated, and all experiments were conducted on this basis.
Two configurations were tested: a simplified 16-billion-parameter version of DeepSeek-V3 and the full-size 671-billion-parameter one. Both variants showed significant acceleration, while the training quality was not compromised.
Why Is This More Important Than It Seems?
At first glance, this might sound like a purely technical story. But there is something more significant behind it.
Training models like DeepSeek-V3 is expensive. Very expensive. Every percentage point of speedup here isn't just about being 'faster'; it translates to real resource savings: less time on GPU clusters, less electricity, and less money. At the scale of hundreds of GPUs and weeks of computation, 41% is a figure that has a very tangible monetary equivalent.
Furthermore, the openness of these results plays a key role. PyTorch is an open ecosystem, and the improvements integrated into TorchTitan are, in theory, available to anyone working on similar tasks. This is not just an internal optimization for a single company but a contribution to the shared infrastructure for training large models.
How Applicable Is This in the Real World?
Here, an honest disclaimer is in order. We are talking about experiments on a cluster of 256 NVIDIA B200 GPUs – which is extremely expensive and not yet widespread hardware. Most individuals and even small organizations do not work with such configurations directly.
Nevertheless, approaches perfected on such systems tend to migrate to more accessible tools over time. MXFP8 is already supported in several other projects, including AMD ROCm, which has been written about in relation to this very same DeepSeek-V3. It's a format the industry is clearly betting on as the next step beyond FP16 and BF16.
As an open-source library, DeepEP is also gradually attracting attention from those working with MoE models – not only at the scale of DeepSeek but also in more modest research projects.
What's the Bottom Line?
The collaboration between PyTorch and Nebius on training DeepSeek-V3 is a prime example of how engineering cooperation within an open ecosystem can yield measurable results. There's no 'breakthrough' here in the sense of a new architecture or a novel idea, but rather solid engineering: taking two proven tools, integrating them into an existing framework, and achieving an acceleration that is hard to ignore.
For those who follow the infrastructure developments for training large models, this is an event worth keeping in mind. It is precisely these kinds of iterations that determine how quickly and affordably the next generations of AI systems will appear.