Published on March 5, 2026

DeepSpeed Learns to Train Complex AI Models More Efficiently: What's Changed and Why It Matters

DeepSpeed has received two significant updates: support for training multimodal models and a memory-saving mode using low-precision computations.

Development / Technical context 4 – 5 minutes min read
Event Source: PyTorch 4 – 5 minutes min read

Most people who use AI tools don't think about what goes into creating them. But behind the scenes, there are vast computing resources, complex engineering, and a constant battle for efficiency. One of the key tools in this battle is DeepSpeed, a library developed by Microsoft specifically for training large neural networks. It recently received two notable updates, each addressing aspects that previously posed serious limitations.

Challenges of Training Multimodal AI Models

Why Training Complex Models Is So Difficult

When we talk about modern AI systems, we're increasingly referring to multimodal models – those that can work with multiple types of data at once: text, images, and audio. Simply put, these are models that don't just read text but also 'see' pictures or 'hear' sound.

Such models are more complex than standard ones: they contain several separate components, each responsible for its own data type. This is where the training difficulties began. The thing is, the standard process for training a neural network involves a so-called backward pass – the moment when the model 'learns' from its mistakes and adjusts its parameters. Technically, this step must take a single specific number as input – a scalar loss value.

But in multimodal models, it's not that simple. There can be multiple sources of error – one for each component. Previously, DeepSpeed couldn't handle this correctly. Developers faced limitations: they either had to find workarounds or accept that the library didn't support their required scenario.

DeepSpeed Update for Multimodal Backward Pass and PyTorch Compatibility

First Update: The Backward Pass Now Works as It Should

The new version of DeepSpeed solves this problem directly. The backward pass now supports not only the standard scenario with a single number but also more complex cases – including when multiple values are passed to it, or when the computations are structured differently.

An important detail: the developers have made the new interface identical to the one used in PyTorch, one of the most popular tools for working with neural networks. This is a crucial point. If the API matches a familiar one, migrating to DeepSpeed doesn't require rewriting code from scratch. You can take an existing project and simply enable optimizations – with almost no changes.

For teams developing multimodal systems, this means the barrier to using DeepSpeed has been significantly lowered. Previously, they had to either adapt their code to the library's limitations or forgo its benefits. Now, they don't have to make that choice.

Optimizing VRAM Usage with DeepSpeed Memory Efficiency Updates

Memory – A Resource That's Always in Short Supply

The second update addresses another chronic problem: memory. Training large models requires a colossal amount of video card memory. Even with powerful hardware, there's never enough of it: either the model doesn't fit entirely, or you have to reduce the size of the training data, which slows down the process.

One way to handle this is to store the model's weights in a less precise numerical format. In short: numbers in a computer can be stored with varying degrees of detail. The standard format takes up more space but ensures high precision. A less precise format uses less memory, and in most cases, this doesn't significantly affect the quality of the result.

DeepSpeed now supports a mode where model parameters are stored in such a 'lightweight' format. This allows you to either run a larger model on the same hardware or use more data in a single training step – which ultimately speeds up the entire process.

What This Means in Practice

Both updates solve real problems faced by people involved in training models. But it's important to understand: they don't make AI training a simple task for everyone – it remains a complex and costly endeavor. This is about removing specific technical barriers that hindered efficient work.

For those building multimodal systems – and the number of such projects is growing – this is a significant relief. Fewer workarounds, less adaptation, and more compatibility with existing code.

For those facing memory constraints – which is almost everyone working with large models – this provides an additional tool to squeeze more performance out of existing hardware.

Neither of these updates is a game-changer overnight. But together, they make DeepSpeed a more versatile tool, better suited to how modern AI projects are structured.

Original Title: Enhancing Multimodal Training and Memory Efficiency with DeepSpeed
Publication Date: Feb 25, 2026
PyTorch pytorch.org An international open-source deep learning framework and community widely used for research and development in artificial intelligence and machine learning.
Previous Article Teaching a Compact Computer to Control a Robot: A Case Study in On-Device AI Next Article MCP Security: How to Properly Set Up Access Control in Systems with AI Agents

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

AI: Events

How AMD and Qwen Optimized MI300X GPUs for Peak Performance

Technical context Infrastructure

The Qwen team optimized their models to effectively run on AMD MI300X GPUs, achieving a response latency as low as 15 ms per token and full image generation in just 0.4 seconds.

LMSYS ORGlmsys.org Feb 13, 2026

Alibaba has introduced Qwen3.5, the first model in the Qwen3 family, adept at processing text, images, and audio natively, without needing additional adapters.

Alibaba Cloudwww.alibabacloud.com Feb 17, 2026

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Don’t miss a single experiment!

Subscribe to our Telegram channel —
we regularly post announcements of new books, articles, and interviews.

Subscribe