There's one thing about large language models that annoys almost everyone who works with them: they are slow. Not because they're inefficient, but because they literally generate text one word at a time. This is a fundamental architectural limitation, and it's not easy to circumvent.
One of the methods the industry has mastered in recent years is called speculative decoding. The idea is to use a small auxiliary model – a “speculator” – that tries to guess what the larger model will say next. If it guesses correctly, the large model simply confirms the entire chunk without spending time on its step-by-step generation. Simply put: the small model creates a draft, and the large one checks it. Where the draft matches what the large model would have written itself, time is saved.
It works. But the classic approach has a weak spot.
A Speculator That Doesn't Learn
Usually, the speculator is configured once – before the system goes into production. It's trained on a specific dataset, fixed in place, and then operates as is. This is a static approach: the model doesn't adapt to actual user queries, doesn't account for the topics or text types where it most often makes mistakes, and doesn't improve with experience.
The problem is that real-world traffic is hard to predict. Queries can be concentrated in a specific topic, style, or language – and if the speculator wasn't trained on this, its accuracy drops. As a result, the speedup drops as well.
Aurora: A Speculator That Learns on the Fly
The team at Together AI has released Aurora – an open-source framework that changes this approach. Instead of setting up the speculator and forgetting about it, Aurora turns it into a self-learning system. Every processed query becomes a training example: the model sees where it guessed correctly and where it didn't, and gradually adjusts to the real-world workload.
At its core is reinforcement learning (RL). This is the same approach used to “align” large models with human preferences, but here the task is different: the speculator learns to maximize the proportion of correctly guessed tokens. The more accurately it predicts the continuation, the more time is saved during generation.
The result, according to the developers, is an approximately 1.25x speedup compared to a well-trained static speculator. The figure may seem modest, but it's important to understand the context: this isn't an improvement over having no optimization at all, but an improvement on top of an already optimized system. This is an additional gain that comes “for free” – simply because the system continues to learn as it operates.
Why This Is Interesting Beyond the Numbers
Aurora is open source. This means the framework is available for study, modification, and use. For teams that deploy large language models in production and are looking for ways to reduce latency without upgrading hardware, such a tool can be quite useful.
But perhaps more interesting is the principle itself. A system that improves its performance not through additional training on pre-collected data but by observing its own activity – that's a different way of thinking about optimization. Not “set it up once and run,” but “launch it and let it learn.”
In a broader context, this fits into the industry's growing interest in making AI systems more efficient without constantly increasing computational power. The recent interest in memory compression algorithms – like Google's newly introduced TurboQuant, which significantly reduces a model's working memory usage without loss of accuracy – is part of the same trend. Different approaches to the same goal: squeezing more out of what's already there.
Aurora is a lab result that has become a public tool. How it will behave in different deployment conditions, on various models, and with different types of traffic is a question that remains to be tested in practice. Self-training on live traffic sounds appealing, but it also comes with risks: if incoming queries are heavily skewed or atypical, the speculator might adapt to them at the expense of its overall versatility.
Furthermore, integrating RL training into a real-time system is a non-trivial engineering challenge. How easily Aurora solves this for teams without deep expertise in machine learning remains to be seen.
Nevertheless, the direction is clear: language model acceleration is ceasing to be a one-time setup and is becoming a continuous process. Aurora is one of the first open-source steps in that direction.