Published on March 24, 2026

PyTorch 2.11: Faster, Broader, Closer to the Metal

PyTorch 2.11 has been released. This update to the popular neural network training framework brings notable improvements for distributed systems and Apple Silicon.

Development / Technical context 5 – 8 minutes min read
Event Source: PyTorch 5 – 8 minutes min read

PyTorch remains one of the cornerstone tools for modern neural network training, relied upon by academic researchers and industry giants alike. On March 23, version 2.11 was launched, boasting 2,723 commits from 432 contributors since the previous release. While not a complete revolution, it is a substantial cumulative update – particularly for those working with multi-GPU setups or training models on a Mac.

Differentiable Collective Operations for Distributed Training

When Thousands of GPUs Must Act as One

One of the central changes in this release concerns distributed training – scenarios where a model is trained across dozens or hundreds of accelerators simultaneously. In such systems, devices constantly exchange data: synchronizing gradients, redistributing tensors, and aligning states. These operations are known as collectives.

Until now, a practical hurdle was that these operations acted as a «black box» regarding backpropagation. Simply put: when a model trains, it doesn't just make a prediction; it «figures out» how to adjust its weights to improve accuracy. To do this, it must be able to «look back» through every operation performed. Collective operations used to block this flow.

In PyTorch 2.11, this roadblock has been cleared: collective operations are now differentiable. The backward pass now flows through them without the need for extra workarounds. For most users, this change will go unnoticed in day-to-day tasks, but researchers building custom training architectures on large clusters will find it grants them significantly more freedom.

FlexAttention Performance Gains with FlashAttention-4 Backend

The Attention Mechanism Just Got Faster – and It Shows

Most modern language and multimodal models rely on the attention mechanism – a key operation that allows a model to «understand» which parts of the input data are relevant to each other. The speed of this operation directly dictates how fast a model can train or perform inference.

PyTorch includes a tool called FlexAttention, which allows this mechanism to be flexibly tailored to specific tasks. In version 2.11, it received a new backend based on FlashAttention-4 – a more efficient implementation optimized for modern Hopper and Blackwell GPU architectures (NVIDIA's latest flagship lineups).

Performance gains on heavy workloads range from 1.2× to 3.2× compared to the previous implementation. In short: the same model and the same hardware, but training is noticeably faster. While this is currently an experimental feature and its behavior may evolve as it stabilizes, it is available for testing right now.

Enhanced MPS Support and Metal Acceleration for Apple Silicon

The Mac is Becoming a Serious ML Platform

Developers working on MacBooks with Apple Silicon (M-series) are getting a significant boost in capabilities with this release. MPS is the backend through which PyTorch leverages the GPU inside Apple chips to accelerate computations. Previously, support was incomplete: some operations simply wouldn't run on this backend, causing PyTorch to either fall back to a slower mode or trigger errors.

In 2.11, new mathematical operations have been added, support for data types (integers and complex numbers) has been expanded, and asynchronous error reporting has been introduced. The latter is a crucial detail: PyTorch can now report errors occurring directly on the GPU, such as out-of-bounds indexing. Previously, such bugs were hard to track down and often led to silent failures.

This doesn't mean a Mac is about to replace a full-scale server cluster. However, for local development, experimentation, and running models on machines without a dedicated external GPU, the situation has improved dramatically.

GPU Export Support for LSTM and GRU Models

Model Exporting – Now for Recurrent Networks Too

Recurrent networks (LSTM, GRU, and the like) are architectures that excel with sequential data: text, time series, and audio. While currently less popular than Transformers, they remain widely used, especially in tasks where computational efficiency is paramount.

Until now, these architectures faced a hurdle: they were difficult to export for deployment. Exporting is the process where a trained model is «packaged» into a format ready to run in a production environment without unnecessary dependencies. In 2.11, LSTM and GRU models finally have full GPU export support, including dynamic input sizes. This makes them first-class citizens in the standard development cycle – from training to deployment.

XPUGraph for Intel and TopK Optimizations for AMD GPUs

Intel and AMD Haven't Been Forgotten

For Intel GPUs, support for XPUGraph has been added – a mechanism that allows a sequence of operations to be «recorded» and then replayed multiple times, eliminating the overhead of repeated launches. Essentially: Python runs the graph once, bakes it in, and thereafter execution happens directly on the hardware without intermediate steps. This reduces CPU load and speeds up repetitive inference tasks.

For AMD GPUs, support for on-device assertions has arrived, along with improvements to the TopK operator – the operation used to find the largest values in a tensor, which is common in ranking and sampling tasks.

Half-Precision FP16 Matrix Multiplication on CPUs

FP16 on CPU: A Boon for Edge Devices

Another small but practical change: PyTorch can now perform matrix multiplications in FP16 (half-precision arithmetic) on standard processors. Previously, FP16 was almost exclusively the domain of GPUs. On the CPU, this opens the door for faster inference on devices without a graphics card – CPU-only inference servers, edge devices, or embedded systems.

TorchScript Deprecation and Migration to Torch Export

What's Going Away: TorchScript is Officially Deprecated

TorchScript was the legacy mechanism for serializing and running PyTorch models. It served its purpose but accumulated various limitations. It was declared legacy as of version 2.10, and now it is officially deprecated: developers are encouraged to migrate to the more modern torch.export approach.

For those just starting out, this means new projects should be built using the latest tools from the get-go. Existing TorchScript projects will continue to work for now, but investing more time into this method is no longer recommended.

PyTorch Moves to Bimonthly Release Schedule

Release Cycles are Picking Up Speed

It is also worth noting a change in the release schedule: starting this year, PyTorch is moving from a quarterly cycle to a bimonthly one. This means improvements will reach users faster. On one hand, this is great news – less waiting for crucial fixes and new features. On the other hand, developers who rigorously test software before updating will have a bit more work to do to keep up with the changes.

Overall, PyTorch 2.11 is a refined, feature-packed release. While it lacks flashy «headline» announcements, it offers real-world improvements across the board: from faster attention mechanisms to broader platform support. It is precisely these kinds of updates that make a tool truly mature.

Original Title: PyTorch 2.11 Release Blog
Publication Date: Mar 23, 2026
PyTorch pytorch.org An international open-source deep learning framework and community widely used for research and development in artificial intelligence and machine learning.
Previous Article Reinforcement Learning: Expensive in Name Only Next Article Solar Pro 3: A New Model for Agentic Tasks with Double the Performance

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

AI: Events

How to Train AI on Million-Token Texts: A Game-Changing Idea

Technical context Infrastructure

Researchers have proposed a method for distributing the processing of ultra-long texts across multiple GPUs, allowing models to be trained on contexts of up to one million tokens.

Hugging Facehuggingface.co Mar 10, 2026

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Don’t miss a single experiment!

Subscribe to our Telegram channel —
we regularly post announcements of new books, articles, and interviews.

Subscribe