Published on March 21, 2026

TorchSpec Accelerating Large Language Models Without Sacrificing Quality

TorchSpec: Accelerating Large Language Models Without Sacrificing Quality

The PyTorch team has introduced TorchSpec, a tool designed to facilitate the training of speculative decoding, thereby accelerating the performance of large language models.

Development / Technical context 5 – 8 minutes min read
Event Source: PyTorch 5 – 8 minutes min read

Over the past year, large language models have grown significantly – both in size and in capabilities. Modern models like Kimi K2.5, GLM 5, or Qwen 3.5 utilize context windows of millions of tokens and demonstrate impressive results across a wide variety of tasks. However, the more powerful the model, the more pressing the question becomes: how can we make it respond quickly without requiring endless computational resources?

One of the most promising approaches to solving this problem is speculative decoding. The PyTorch team has released TorchSpec, a tool that enables this approach to be used for training models at an industrial scale. Let's dive into the details.

Why LLM text generation is slow

Why Is Text Generation So Slow?

When a language model generates text, it does so step-by-step: token by token. A token is roughly equivalent to a word or part of a word. The model calculates each subsequent token anew, based on everything that came before it. Simply put, the model cannot “think ahead” – it proceeds strictly one step at a time.

For large models, each such step is a costly operation. It requires accessing a huge number of parameters, loading data into memory, and performing intensive computations. As a result, generation speed is limited not so much by processor power as by the data transfer rate between memory and computational units. This is known as a memory-bound limitation, and it becomes the main bottleneck.

Speculative decoding to speed up text generation

Speculative Decoding: Guessing to Speed Up

The idea behind speculative decoding is quite elegant. Instead of having the large, main model generate each token on its own, a small, fast, and lightweight “draft” model works alongside it. It proposes several tokens ahead at once, essentially making educated guesses about what the large model will output. Then, the large model verifies these guesses in a single pass – either accepting or correcting them.

If the small model guessed correctly, we obtain several tokens for the price of a single verification. This is significantly faster than generating each token individually. At the same time, the quality of the final text is not compromised: the large model still checks every prediction and discards anything that doesn't align with its own probabilities.

The key here is how well the small model can “predict” the large one. This metric is called the acceptance rate – the proportion of accepted tokens. The higher this value, the greater the speedup.

Small model for speculative decoding training challenges

Where Does the Small Model Come From – and Why Is It Not Trivial?

You could take a pre-existing small model and merely run it alongside the large one. This works, but not perfectly: the small model was trained independently and does not necessarily predict the behavior of the specific large model well.

A more promising approach is to train the small “draft” model specifically for the large model, so it mimics its behavior as accurately as possible. This is precisely what TorchSpec does: it provides the infrastructure for such training at a large scale, with support for distributed computing, where training runs across multiple accelerators simultaneously.

This is important because this kind of training used to be technically complex: it required running both the small and large models simultaneously, synchronizing their operations, and intelligently distributing the load across devices. Without specialized tools, this turned into a major engineering challenge.

How TorchSpec works in practice

What Does TorchSpec Do in Practice?

TorchSpec is integrated into the PyTorch ecosystem and allows for training draft models in a mode the authors call online speculative decoding training. “Online” here means that the large model participates directly in the training process: the draft model receives real-time feedback from it, rather than being trained on pre-prepared data.

This approach yields a higher quality of adaptation: the draft model learns specifically for the target large model, not from an average of text from the internet. This directly increases the acceptance rate and, consequently, the final speedup.

The tool supports working with multiple accelerators in parallel, which is crucial when dealing with modern large models. In effect, TorchSpec handles the entire infrastructure side of things: how to distribute the models across devices, how to synchronize their operations, and how to manage the data flow between them.

TorchSpec performance metrics and speedup

Numbers Worth Mentioning

The TorchSpec team presents experimental results showing a significant speed increase compared to standard generation. The specific numbers depend on the configuration – model sizes, task, hardware – but the general idea is this: speculative decoding with a well-trained draft model can provide a speedup of several times compared to classic token-by-token generation.

Moreover, the quality of the responses remains equivalent: since the large model still verifies every accepted token, the user gets the same result as with standard generation – just faster.

Who benefits from TorchSpec for LLMs

Who Needs This and Why?

If you just use language models through a chat interface, you don't need TorchSpec directly – it's a tool for those who deploy and maintain models. But for teams working with large models in production, it can be very practical.

Faster generation means less waiting time for users, fewer computational resources per request, and, consequently, lower infrastructure costs. With high traffic volumes, this translates to significant savings.

Furthermore, TorchSpec is an open-source tool within the PyTorch ecosystem, meaning teams can adapt it to their needs without having to build everything from scratch.

Challenges and limitations of TorchSpec

What Questions Remain?

Speculative decoding is not a silver bullet. The approach's effectiveness heavily depends on how well the draft model “matches” the behavior of the large one. If the task is unconventional or the generation style varies greatly, the acceptance rate can drop, and the speed gain will be less than expected.

It's also worth noting that training the draft model itself requires resources. This is a one-time cost that pays off during operation, but a barrier to entry still exists.

Finally, online training involving a large model is technically more complex than standard training: you need to manage two models simultaneously, monitor the load balance, and ensure the process is stable. TorchSpec simplifies this, but it doesn't make the task trivial.

Overall, the release of TorchSpec is a step toward making speculative decoding more accessible to a wider range of teams, not just those willing to spend months on infrastructure development. We'll see how the tool is adopted in practice.

Original Title: TorchSpec: Speculative Decoding Training at Scale
Publication Date: Mar 20, 2026
PyTorch pytorch.org An international open-source deep learning framework and community widely used for research and development in artificial intelligence and machine learning.
Previous Article RL-Studio: A Reinforcement Learning Research Platform Presented at AAAI 2026 Next Article Graph Neural Networks vs. Fraudsters: How Data Connections Help Catch Criminals

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to know about new
experiments first?

Subscribe to our Telegram channel — we share all the latest
and exciting updates from NeuraBooks.

Subscribe