Published on March 25, 2026

Mercury 2 Inception Labs Fast AI Models Personal Assistant

Mercury 2: Fast AI Models and the First Steps Towards a Personal Assistant

Inception Labs has introduced Mercury 2 – a diffusion language model that operates quickly and affordably, paving the way for a new approach to creating AI agents.

Products 4 – 6 minutes min read
Event Source: Inception 4 – 6 minutes min read

Most modern language models – like ChatGPT, Claude, and Gemini – operate on the same principle: they generate text one word (or, more accurately, one token) at a time. This is similar to someone typing blind, without knowing in advance what they'll write at the end of the sentence. The method works, but it has an inherent speed limitation: the longer the response, the longer you wait.

Inception Labs has taken a different path. Their Mercury series models are built on a diffusion approach – the same one used in image generators like Stable Diffusion. But here, text is generated instead of pictures. In short: the model doesn't write words sequentially but 'develops' the entire response at once, gradually refining it from noise. This is a fundamentally different architecture, and it has one clear advantage: speed.

What is Mercury 2 and Why is it Essential

What is Mercury 2 and Why Is It Needed?

Mercury 2 is the new generation of diffusion language models from Inception Labs. The company introduced it along with its own benchmark called PinchBench, which measures not only the quality of responses but also their speed and generation cost simultaneously. The idea is that evaluating a model solely on quality is like choosing a car based only on its top speed while ignoring fuel consumption.

PinchBench combines these three parameters into a single score: how well the model responds, how quickly it does so, and how much it costs. By this metric, Mercury 2 shows results comparable to leading models – at a significantly lower computational cost.

Speed That Transforms AI Application Logic

Speed That Changes the Application Logic

Mercury 2 generates text at speeds of around 1000 tokens per second and higher – several times faster than most standard autoregressive models with comparable quality. But it's not just about the numbers.

High speed changes how the model can be used altogether. When a response arrives almost instantly, it opens up scenarios that were previously impractical: running multiple agents in parallel, rapid real-time iteration, and processing a large stream of short tasks without noticeable delays. Simply put, the model ceases to be the bottleneck in the system.

This is especially important for so-called agentic systems – where multiple AI components work together, each performing its own step, and the total response time is the sum of all delays. If each step takes seconds, the entire chain gets stretched out. If each step takes milliseconds, the picture changes dramatically.

The Era of the Personal AI Agent Defined

The Era of the Personal Agent: What Does That Even Mean?

Inception Labs talks about the 'era of the personal agent' – and this isn't just a marketing phrase. Behind it lies a specific idea: an AI assistant that functions not as a search engine (ask a question, get an answer), but as a full-fledged task executor.

Imagine asking your assistant not to 'find me information about flights,' but to 'book a ticket for Friday, check if I have any conflicts in my calendar, and remind me about it on Thursday morning.' This is a chain of actions that needs to be performed sequentially, accessing different tools and considering the context. It is precisely these kinds of tasks that are called agentic.

For this to work in real time and not cost as much as renting a server, the model must be fast and cheap. Mercury 2 is an attempt to fill this specific gap.

Diffusion in Text Challenges and Solutions

Diffusion in Text: A Brief Look at Why It's Not Simple

Applying a diffusion approach to text is a non-trivial task. With images, it's relatively straightforward: pixels can be 'noised' and gradually restored. It's more complex with text – words are discrete, and you can't just 'slightly change' them as smoothly as a pixel's color.

This is precisely why diffusion language models have long lagged behind autoregressive ones in terms of quality. Mercury 2, based on the presented results, significantly closes this gap – especially on tasks where text coherence, instruction following, and working with code are important.

This doesn't mean the diffusion approach is already better in every aspect. But it is becoming a viable alternative, not just an academic experiment.

Conclusion on Inception Labs Mercury 2 Impact

The Bottom Line

Mercury 2 isn't just another 'smartest model in the world.' It's an attempt to rethink the balance between speed, cost, and quality in language models. Inception Labs is betting that the future of AI systems lies not in a single powerful model that thinks slowly and expensively, but in fast, affordable components that can be run in parallel and at scale.

Whether this bet will pay off, only time will tell. But the very fact that diffusion language models have reached a level where they can be seriously compared with market leaders shows that the solution space in AI is expanding. And that, as a rule, is good news for everyone who uses these solutions.

Original Title: Mercury 2 on PinchBench: Fast Diffusion Models and the Personal Agent Era
Publication Date: Mar 24, 2026
Inception www.inceptionlabs.ai A U.S.-based AI company developing next-generation diffusion large language models (dLLMs) and language technologies designed for high-speed text generation and multimodal tasks.
Previous Article MolmoWeb: An Open AI Agent for Autonomous Web Browsing Next Article When an Agent Doesn't Know the Answer: How Retrieval Models Are Learning to Find the Unreachable

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to know about new
experiments first?

Subscribe to our Telegram channel — we share all the latest
and exciting updates from NeuraBooks.

Subscribe