Published February 3, 2026

Why AI Voice Agents Are Switching to Direct Speech Processing

We explore how direct speech processing differs from text-based intermediate steps and why it matters for the voice assistants of the future.

Products
Event Source: Ultravox Reading Time: 4 – 6 minutes

Most voice assistants we encounter today work in a rather roundabout way. First, they recognize your speech and turn it into text, then process that text through a language model, and finally synthesize the response back into voice. This results in a sort of chain: speech → text → text → speech. This works, but a lot of nuance gets lost along the way.

The alternative approach is called speech-to-speech, meaning direct speech processing without intermediate translation into text. The model listens to the voice, processes it directly, and responds with a voice. It sounds logical, but in practice, such systems remained complex and expensive for a long time. Now the situation is changing, and the Ultravox team decided to find out just how justified this approach is.

Why Text-Based Voice Assistants Lose Speech Nuances

What Gets Lost in the Text Chain

When you speak to a standard voice assistant, it doesn't hear your intonation, pauses, pacing, or emotions. All of this disappears at the text conversion stage. The system sees only words, as if you had typed them. This is fine for simple tasks like setting a timer but becomes a problem when it comes to more complex interactions.

Imagine you call customer support and speak with irritation or uncertainty. A text-based system won't notice this. It will answer based on the meaning of the words but won't account for your state. Direct speech processing allows the model to pick up on these details and react more naturally.

Furthermore, the text chain adds latency. Each stage requires time: recognizing, processing, synthesizing. In a live dialogue, this feels like unnatural pauses. Speech-to-speech systems can work faster because they don't need to take as many intermediate steps.

How to Evaluate Voice Agent Quality

To compare different approaches, an evaluation methodology is needed. Ultravox developed a special benchmark (a reference evaluation system) for this purpose called AIEWF Eval. The name stands for AI Enterprise Workflow Evaluation – that is, testing working scenarios for business.

The essence is that the assessment is not conducted on abstract tasks but on real-world use cases: ordering via a call center, product consultation, or technical support. This is important because voice agents are most often needed in precisely such contexts where speed, accuracy, and natural communication are key.

The benchmark checks several aspects: how correctly the model understands the request, how quickly it reacts, how natural the response sounds, and whether it retains the conversation context. This allows for a more complete picture than simply measuring recognition accuracy or generation speed.

Speech-to-Speech vs Text-Based Voice AI Performance Comparison

The Results: Where Speech-to-Speech Wins

Testing showed that direct speech processing indeed offers advantages in several areas. First is reaction speed. Models working directly with speech showed lower latency between replies, which makes the dialogue more lively.

Second is naturalness. When the model processes speech directly, it better preserves the intonation and rhythm of the conversation. This doesn't mean it mimics a human perfectly, but it sounds less mechanical compared to systems assembling an answer from synthesized fragments.

Third is context understanding. Speech models can take into account not just the words but how they are spoken. This helps to more accurately determine a person's intent, especially in ambiguous situations.

There are limitations too. Speech-to-speech models require more computing resources at the training stage and currently cope worse with rare languages or highly specialized vocabulary. But for the English language and typical business scenarios, they are already showing stable results.

Best Use Cases for Speech-to-Speech Voice AI Technology

Who Needs This Right Now

Direct speech processing is especially useful where speed and the emotional tone of the dialogue matter. These are call centers where clients want to solve a problem quickly and not wait while the robot “thinks.” These are advisory services where it is important to create the impression of live communication. These are also educational apps where the model must react to a student's intonation to understand if they are managing well or have gotten confused.

For simple tasks like setting an alarm or checking the weather, a text chain is quite sufficient. But the more complex the scenario, the more noticeable the benefits of the speech-to-speech approach become.

Future Development of Direct Speech Processing Technology

What's Next

The development of voice models is moving toward greater integration of speech capabilities. While previously direct speech processing was available only to large companies with serious resources, now more accessible solutions are appearing. Ultravox, for example, offers tools for developers who want to embed speech-to-speech functionality into their products.

Open questions remain: how to scale such systems to support a larger number of languages, how to make them more energy-efficient, and how to ensure security and privacy when processing voice data. But the direction has been set, and judging by the test results, it is justified.

Direct speech processing won't replace text models completely, but it will become the standard for those tasks where the liveliness and naturalness of interaction are important. And the more accessible these technologies become, the more often we will encounter them in everyday life.

#applied analysis #research review #ai development #ai linguistics #interfaces #human–machine interaction #ai_benchmarks #voice ai agents
Original Title: Why speech-to-speech is the future for AI voice agents: Unpacking the AIEWF Eval
Publication Date: Feb 2, 2026
Ultravox www.ultravox.ai An international project developing AI models for speech synthesis and speech understanding.
Previous Article Context Engineering: How Financial Companies Can Make AI Reliable Next Article GLM-OCR: A Small Model That Reads Documents Better Than Big Ones

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.5 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.5 Anthropic
2.
Gemini 3 Pro Preview Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 3 Pro Preview Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

Nvidia has introduced Memory for the Blockchain of Life (MBL), a system that helps AI agents remember events and restore context in a way that resembles human memory.

Nvidiablogs.nvidia.com Dec 31, 2025

Want to dive deeper into the world
of neuro-creativity?

Be the first to learn about new books, articles, and AI experiments
on our Telegram channel!

Subscribe