Over the past year, large language models have grown significantly – both in size and in capabilities. Modern models like Kimi K2.5, GLM 5, or Qwen 3.5 utilize context windows of millions of tokens and demonstrate impressive results across a wide variety of tasks. However, the more powerful the model, the more pressing the question becomes: how can we make it respond quickly without requiring endless computational resources?
One of the most promising approaches to solving this problem is speculative decoding. The PyTorch team has released TorchSpec, a tool that enables this approach to be used for training models at an industrial scale. Let's dive into the details.
Why Is Text Generation So Slow?
When a language model generates text, it does so step-by-step: token by token. A token is roughly equivalent to a word or part of a word. The model calculates each subsequent token anew, based on everything that came before it. Simply put, the model cannot “think ahead” – it proceeds strictly one step at a time.
For large models, each such step is a costly operation. It requires accessing a huge number of parameters, loading data into memory, and performing intensive computations. As a result, generation speed is limited not so much by processor power as by the data transfer rate between memory and computational units. This is known as a memory-bound limitation, and it becomes the main bottleneck.
Speculative Decoding: Guessing to Speed Up
The idea behind speculative decoding is quite elegant. Instead of having the large, main model generate each token on its own, a small, fast, and lightweight “draft” model works alongside it. It proposes several tokens ahead at once, essentially making educated guesses about what the large model will output. Then, the large model verifies these guesses in a single pass – either accepting or correcting them.
If the small model guessed correctly, we obtain several tokens for the price of a single verification. This is significantly faster than generating each token individually. At the same time, the quality of the final text is not compromised: the large model still checks every prediction and discards anything that doesn't align with its own probabilities.
The key here is how well the small model can “predict” the large one. This metric is called the acceptance rate – the proportion of accepted tokens. The higher this value, the greater the speedup.
Where Does the Small Model Come From – and Why Is It Not Trivial?
You could take a pre-existing small model and merely run it alongside the large one. This works, but not perfectly: the small model was trained independently and does not necessarily predict the behavior of the specific large model well.
A more promising approach is to train the small “draft” model specifically for the large model, so it mimics its behavior as accurately as possible. This is precisely what TorchSpec does: it provides the infrastructure for such training at a large scale, with support for distributed computing, where training runs across multiple accelerators simultaneously.
This is important because this kind of training used to be technically complex: it required running both the small and large models simultaneously, synchronizing their operations, and intelligently distributing the load across devices. Without specialized tools, this turned into a major engineering challenge.
What Does TorchSpec Do in Practice?
TorchSpec is integrated into the PyTorch ecosystem and allows for training draft models in a mode the authors call online speculative decoding training. “Online” here means that the large model participates directly in the training process: the draft model receives real-time feedback from it, rather than being trained on pre-prepared data.
This approach yields a higher quality of adaptation: the draft model learns specifically for the target large model, not from an average of text from the internet. This directly increases the acceptance rate and, consequently, the final speedup.
The tool supports working with multiple accelerators in parallel, which is crucial when dealing with modern large models. In effect, TorchSpec handles the entire infrastructure side of things: how to distribute the models across devices, how to synchronize their operations, and how to manage the data flow between them.
Numbers Worth Mentioning
The TorchSpec team presents experimental results showing a significant speed increase compared to standard generation. The specific numbers depend on the configuration – model sizes, task, hardware – but the general idea is this: speculative decoding with a well-trained draft model can provide a speedup of several times compared to classic token-by-token generation.
Moreover, the quality of the responses remains equivalent: since the large model still verifies every accepted token, the user gets the same result as with standard generation – just faster.
Who Needs This and Why?
If you just use language models through a chat interface, you don't need TorchSpec directly – it's a tool for those who deploy and maintain models. But for teams working with large models in production, it can be very practical.
Faster generation means less waiting time for users, fewer computational resources per request, and, consequently, lower infrastructure costs. With high traffic volumes, this translates to significant savings.
Furthermore, TorchSpec is an open-source tool within the PyTorch ecosystem, meaning teams can adapt it to their needs without having to build everything from scratch.
What Questions Remain?
Speculative decoding is not a silver bullet. The approach's effectiveness heavily depends on how well the draft model “matches” the behavior of the large one. If the task is unconventional or the generation style varies greatly, the acceptance rate can drop, and the speed gain will be less than expected.
It's also worth noting that training the draft model itself requires resources. This is a one-time cost that pays off during operation, but a barrier to entry still exists.
Finally, online training involving a large model is technically more complex than standard training: you need to manage two models simultaneously, monitor the load balance, and ensure the process is stable. TorchSpec simplifies this, but it doesn't make the task trivial.
Overall, the release of TorchSpec is a step toward making speculative decoding more accessible to a wider range of teams, not just those willing to spend months on infrastructure development. We'll see how the tool is adopted in practice.