Published on March 5, 2026

DeepSpeed Learns to Train Complex AI Models More Efficiently: What's Changed and Why It Matters

DeepSpeed has received two significant updates: support for training multimodal models and a memory-saving mode using low-precision computations.

Development / Technical context 4 – 5 minutes min read

Event Source: PyTorch 4 – 5 minutes min read

Most people who use AI tools don't think about what goes into creating them. But behind the scenes, there are vast computing resources, complex engineering, and a constant battle for efficiency. One of the key tools in this battle is DeepSpeed, a library developed by Microsoft specifically for training large neural networks. It recently received two notable updates, each addressing aspects that previously posed serious limitations.

Challenges of Training Multimodal AI Models

Why Training Complex Models Is So Difficult

When we talk about modern AI systems, we're increasingly referring to multimodal models – those that can work with multiple types of data at once: text, images, and audio. Simply put, these are models that don't just read text but also 'see' pictures or 'hear' sound.

Such models are more complex than standard ones: they contain several separate components, each responsible for its own data type. This is where the training difficulties began. The thing is, the standard process for training a neural network involves a so-called backward pass – the moment when the model 'learns' from its mistakes and adjusts its parameters. Technically, this step must take a single specific number as input – a scalar loss value.

But in multimodal models, it's not that simple. There can be multiple sources of error – one for each component. Previously, DeepSpeed couldn't handle this correctly. Developers faced limitations: they either had to find workarounds or accept that the library didn't support their required scenario.

DeepSpeed Update for Multimodal Backward Pass and PyTorch Compatibility

First Update: The Backward Pass Now Works as It Should

The new version of DeepSpeed solves this problem directly. The backward pass now supports not only the standard scenario with a single number but also more complex cases – including when multiple values are passed to it, or when the computations are structured differently.

An important detail: the developers have made the new interface identical to the one used in PyTorch, one of the most popular tools for working with neural networks. This is a crucial point. If the API matches a familiar one, migrating to DeepSpeed doesn't require rewriting code from scratch. You can take an existing project and simply enable optimizations – with almost no changes.

For teams developing multimodal systems, this means the barrier to using DeepSpeed has been significantly lowered. Previously, they had to either adapt their code to the library's limitations or forgo its benefits. Now, they don't have to make that choice.

Optimizing VRAM Usage with DeepSpeed Memory Efficiency Updates

Memory – A Resource That's Always in Short Supply

The second update addresses another chronic problem: memory. Training large models requires a colossal amount of video card memory. Even with powerful hardware, there's never enough of it: either the model doesn't fit entirely, or you have to reduce the size of the training data, which slows down the process.

One way to handle this is to store the model's weights in a less precise numerical format. In short: numbers in a computer can be stored with varying degrees of detail. The standard format takes up more space but ensures high precision. A less precise format uses less memory, and in most cases, this doesn't significantly affect the quality of the result.

DeepSpeed now supports a mode where model parameters are stored in such a 'lightweight' format. This allows you to either run a larger model on the same hardware or use more data in a single training step – which ultimately speeds up the entire process.

What This Means in Practice

Both updates solve real problems faced by people involved in training models. But it's important to understand: they don't make AI training a simple task for everyone – it remains a complex and costly endeavor. This is about removing specific technical barriers that hindered efficient work.

For those building multimodal systems – and the number of such projects is growing – this is a significant relief. Fewer workarounds, less adaptation, and more compatibility with existing code.

For those facing memory constraints – which is almost everyone working with large models – this provides an additional tool to squeeze more performance out of existing hardware.

Neither of these updates is a game-changer overnight. But together, they make DeepSpeed a more versatile tool, better suited to how modern AI projects are structured.

#event #technical context #neural networks #ai development #ai training #engineering #infrastructure #multimodal models #model training optimization #large model training optimization #energy efficiency

Link to Original: https://pytorch.org/blog/enhancing-multimodal-training-and-memory-efficiency-with-deepspeed/

Original Title: Enhancing Multimodal Training and Memory Efficiency with DeepSpeed

Publication Date: Feb 25, 2026

PyTorch pytorch.org An international open-source deep learning framework and community widely used for research and development in artificial intelligence and machine learning.

Previous Article Teaching a Compact Computer to Control a Robot: A Case Study in On-Device AI Next Article MCP Security: How to Properly Set Up Access Control in Systems with AI Agents

DeepSpeed Learns to Train Complex AI Models More Efficiently: What's Changed and Why It Matters

Challenges of Training Multimodal AI Models

DeepSpeed Update for Multimodal Backward Pass and PyTorch Compatibility

Optimizing VRAM Usage with DeepSpeed Memory Efficiency Updates

What This Means in Practice

Related Publications

Zero Bubbles and Flexible Pipelines: How AMD Accelerates Large Language Model Training

How AMD and Qwen Optimized MI300X GPUs for Peak Performance

Qwen3.5: The First Natively Multimodal Model

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration