Most modern language models – like ChatGPT, Claude, and Gemini – operate on the same principle: they generate text one word (or, more accurately, one token) at a time. This is similar to someone typing blind, without knowing in advance what they'll write at the end of the sentence. The method works, but it has an inherent speed limitation: the longer the response, the longer you wait.
Inception Labs has taken a different path. Their Mercury series models are built on a diffusion approach – the same one used in image generators like Stable Diffusion. But here, text is generated instead of pictures. In short: the model doesn't write words sequentially but 'develops' the entire response at once, gradually refining it from noise. This is a fundamentally different architecture, and it has one clear advantage: speed.
What is Mercury 2 and Why Is It Needed?
Mercury 2 is the new generation of diffusion language models from Inception Labs. The company introduced it along with its own benchmark called PinchBench, which measures not only the quality of responses but also their speed and generation cost simultaneously. The idea is that evaluating a model solely on quality is like choosing a car based only on its top speed while ignoring fuel consumption.
PinchBench combines these three parameters into a single score: how well the model responds, how quickly it does so, and how much it costs. By this metric, Mercury 2 shows results comparable to leading models – at a significantly lower computational cost.
Speed That Changes the Application Logic
Mercury 2 generates text at speeds of around 1000 tokens per second and higher – several times faster than most standard autoregressive models with comparable quality. But it's not just about the numbers.
High speed changes how the model can be used altogether. When a response arrives almost instantly, it opens up scenarios that were previously impractical: running multiple agents in parallel, rapid real-time iteration, and processing a large stream of short tasks without noticeable delays. Simply put, the model ceases to be the bottleneck in the system.
This is especially important for so-called agentic systems – where multiple AI components work together, each performing its own step, and the total response time is the sum of all delays. If each step takes seconds, the entire chain gets stretched out. If each step takes milliseconds, the picture changes dramatically.
The Era of the Personal Agent: What Does That Even Mean?
Inception Labs talks about the 'era of the personal agent' – and this isn't just a marketing phrase. Behind it lies a specific idea: an AI assistant that functions not as a search engine (ask a question, get an answer), but as a full-fledged task executor.
Imagine asking your assistant not to 'find me information about flights,' but to 'book a ticket for Friday, check if I have any conflicts in my calendar, and remind me about it on Thursday morning.' This is a chain of actions that needs to be performed sequentially, accessing different tools and considering the context. It is precisely these kinds of tasks that are called agentic.
For this to work in real time and not cost as much as renting a server, the model must be fast and cheap. Mercury 2 is an attempt to fill this specific gap.
Diffusion in Text: A Brief Look at Why It's Not Simple
Applying a diffusion approach to text is a non-trivial task. With images, it's relatively straightforward: pixels can be 'noised' and gradually restored. It's more complex with text – words are discrete, and you can't just 'slightly change' them as smoothly as a pixel's color.
This is precisely why diffusion language models have long lagged behind autoregressive ones in terms of quality. Mercury 2, based on the presented results, significantly closes this gap – especially on tasks where text coherence, instruction following, and working with code are important.
This doesn't mean the diffusion approach is already better in every aspect. But it is becoming a viable alternative, not just an academic experiment.
The Bottom Line
Mercury 2 isn't just another 'smartest model in the world.' It's an attempt to rethink the balance between speed, cost, and quality in language models. Inception Labs is betting that the future of AI systems lies not in a single powerful model that thinks slowly and expensively, but in fast, affordable components that can be run in parallel and at scale.
Whether this bet will pay off, only time will tell. But the very fact that diffusion language models have reached a level where they can be seriously compared with market leaders shows that the solution space in AI is expanding. And that, as a rule, is good news for everyone who uses these solutions.