Published on April 1, 2026

Aurora AI How Models Predict Responses and Continuously Improve

Aurora: How AI Learned to Predict Its Responses and Continuously Improve

Together AI has introduced Aurora – an open-source framework that transforms language model acceleration into a self-learning system, improving on the fly.

Infrastructure / Technical context 4 – 5 minutes min read
Event Source: Together.ai 4 – 5 minutes min read

Static Speculator Limitations in LLM Optimization

There's one thing about large language models that annoys almost everyone who works with them: they are slow. Not because they're inefficient, but because they literally generate text one word at a time. This is a fundamental architectural limitation, and it's not easy to circumvent.

One of the methods the industry has mastered in recent years is called speculative decoding. The idea is to use a small auxiliary model – a “speculator” – that tries to guess what the larger model will say next. If it guesses correctly, the large model simply confirms the entire chunk without spending time on its step-by-step generation. Simply put: the small model creates a draft, and the large one checks it. Where the draft matches what the large model would have written itself, time is saved.

It works. But the classic approach has a weak spot.

Aurora A Self Learning Speculator for Faster LLMs

A Speculator That Doesn't Learn

Usually, the speculator is configured once – before the system goes into production. It's trained on a specific dataset, fixed in place, and then operates as is. This is a static approach: the model doesn't adapt to actual user queries, doesn't account for the topics or text types where it most often makes mistakes, and doesn't improve with experience.

The problem is that real-world traffic is hard to predict. Queries can be concentrated in a specific topic, style, or language – and if the speculator wasn't trained on this, its accuracy drops. As a result, the speedup drops as well.

Beyond Speed Why Aurora's Approach Matters

Aurora: A Speculator That Learns on the Fly

The team at Together AI has released Aurora – an open-source framework that changes this approach. Instead of setting up the speculator and forgetting about it, Aurora turns it into a self-learning system. Every processed query becomes a training example: the model sees where it guessed correctly and where it didn't, and gradually adjusts to the real-world workload.

At its core is reinforcement learning (RL). This is the same approach used to “align” large models with human preferences, but here the task is different: the speculator learns to maximize the proportion of correctly guessed tokens. The more accurately it predicts the continuation, the more time is saved during generation.

The result, according to the developers, is an approximately 1.25x speedup compared to a well-trained static speculator. The figure may seem modest, but it's important to understand the context: this isn't an improvement over having no optimization at all, but an improvement on top of an already optimized system. This is an additional gain that comes “for free” – simply because the system continues to learn as it operates.

Open Questions for Aurora's Real World Performance

Why This Is Interesting Beyond the Numbers

Aurora is open source. This means the framework is available for study, modification, and use. For teams that deploy large language models in production and are looking for ways to reduce latency without upgrading hardware, such a tool can be quite useful.

But perhaps more interesting is the principle itself. A system that improves its performance not through additional training on pre-collected data but by observing its own activity – that's a different way of thinking about optimization. Not “set it up once and run,” but “launch it and let it learn.”

In a broader context, this fits into the industry's growing interest in making AI systems more efficient without constantly increasing computational power. The recent interest in memory compression algorithms – like Google's newly introduced TurboQuant, which significantly reduces a model's working memory usage without loss of accuracy – is part of the same trend. Different approaches to the same goal: squeezing more out of what's already there.

What Remains an Open Question

Aurora is a lab result that has become a public tool. How it will behave in different deployment conditions, on various models, and with different types of traffic is a question that remains to be tested in practice. Self-training on live traffic sounds appealing, but it also comes with risks: if incoming queries are heavily skewed or atypical, the speculator might adapt to them at the expense of its overall versatility.

Furthermore, integrating RL training into a real-time system is a non-trivial engineering challenge. How easily Aurora solves this for teams without deep expertise in machine learning remains to be seen.

Nevertheless, the direction is clear: language model acceleration is ceasing to be a one-time setup and is becoming a continuous process. Aurora is one of the first open-source steps in that direction.

Original Title: Aurora
Publication Date: Mar 31, 2026
Together.ai www.together.ai A U.S.-based platform for running and scaling open AI models.
Previous Article OpenAI Raises $122 Billion: What's Behind the Latest Record Next Article ASUS UGen300: The Flash Drive That Runs AI

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to know about new
experiments first?

Subscribe to our Telegram channel — we share all the latest
and exciting updates from NeuraBooks.

Subscribe