Published on February 3, 2026

Why AI Voice Agents Are Switching to Direct Speech Processing

We explore how direct speech processing differs from text-based intermediate steps and why it matters for the voice assistants of the future.

Products 4 – 6 minutes min read

Event Source: Ultravox 4 – 6 minutes min read

Most voice assistants we encounter today work in a rather roundabout way. First, they recognize your speech and turn it into text, then process that text through a language model, and finally synthesize the response back into voice. This results in a sort of chain: speech → text → text → speech. This works, but a lot of nuance gets lost along the way.

The alternative approach is called speech-to-speech, meaning direct speech processing without intermediate translation into text. The model listens to the voice, processes it directly, and responds with a voice. It sounds logical, but in practice, such systems remained complex and expensive for a long time. Now the situation is changing, and the Ultravox team decided to find out just how justified this approach is.

Why Text-Based Voice Assistants Lose Speech Nuances

What Gets Lost in the Text Chain

When you speak to a standard voice assistant, it doesn't hear your intonation, pauses, pacing, or emotions. All of this disappears at the text conversion stage. The system sees only words, as if you had typed them. This is fine for simple tasks like setting a timer but becomes a problem when it comes to more complex interactions.

Imagine you call customer support and speak with irritation or uncertainty. A text-based system won't notice this. It will answer based on the meaning of the words but won't account for your state. Direct speech processing allows the model to pick up on these details and react more naturally.

Furthermore, the text chain adds latency. Each stage requires time: recognizing, processing, synthesizing. In a live dialogue, this feels like unnatural pauses. Speech-to-speech systems can work faster because they don't need to take as many intermediate steps.

How to Evaluate Voice Agent Quality

To compare different approaches, an evaluation methodology is needed. Ultravox developed a special benchmark (a reference evaluation system) for this purpose called AIEWF Eval. The name stands for AI Enterprise Workflow Evaluation – that is, testing working scenarios for business.

The essence is that the assessment is not conducted on abstract tasks but on real-world use cases: ordering via a call center, product consultation, or technical support. This is important because voice agents are most often needed in precisely such contexts where speed, accuracy, and natural communication are key.

The benchmark checks several aspects: how correctly the model understands the request, how quickly it reacts, how natural the response sounds, and whether it retains the conversation context. This allows for a more complete picture than simply measuring recognition accuracy or generation speed.

Speech-to-Speech vs Text-Based Voice AI Performance Comparison

The Results: Where Speech-to-Speech Wins

Testing showed that direct speech processing indeed offers advantages in several areas. First is reaction speed. Models working directly with speech showed lower latency between replies, which makes the dialogue more lively.

Second is naturalness. When the model processes speech directly, it better preserves the intonation and rhythm of the conversation. This doesn't mean it mimics a human perfectly, but it sounds less mechanical compared to systems assembling an answer from synthesized fragments.

Third is context understanding. Speech models can take into account not just the words but how they are spoken. This helps to more accurately determine a person's intent, especially in ambiguous situations.

There are limitations too. Speech-to-speech models require more computing resources at the training stage and currently cope worse with rare languages or highly specialized vocabulary. But for the English language and typical business scenarios, they are already showing stable results.

Best Use Cases for Speech-to-Speech Voice AI Technology

Who Needs This Right Now

Direct speech processing is especially useful where speed and the emotional tone of the dialogue matter. These are call centers where clients want to solve a problem quickly and not wait while the robot “thinks.” These are advisory services where it is important to create the impression of live communication. These are also educational apps where the model must react to a student's intonation to understand if they are managing well or have gotten confused.

For simple tasks like setting an alarm or checking the weather, a text chain is quite sufficient. But the more complex the scenario, the more noticeable the benefits of the speech-to-speech approach become.

Future Development of Direct Speech Processing Technology

What's Next

The development of voice models is moving toward greater integration of speech capabilities. While previously direct speech processing was available only to large companies with serious resources, now more accessible solutions are appearing. Ultravox, for example, offers tools for developers who want to embed speech-to-speech functionality into their products.

Open questions remain: how to scale such systems to support a larger number of languages, how to make them more energy-efficient, and how to ensure security and privacy when processing voice data. But the direction has been set, and judging by the test results, it is justified.

Direct speech processing won't replace text models completely, but it will become the standard for those tasks where the liveliness and naturalness of interaction are important. And the more accessible these technologies become, the more often we will encounter them in everyday life.

#applied analysis #research review #ai development #ai linguistics #interfaces #human–machine interaction #ai benchmarks #voice ai agents

Link to Original: https://www.ultravox.ai/blog/why-speech-to-speech-is-the-future-for-ai-voice-agents-unpacking-the-aiewf-eval

Original Title: Why speech-to-speech is the future for AI voice agents: Unpacking the AIEWF Eval

Publication Date: Feb 2, 2026

Ultravox www.ultravox.ai An international project developing AI models for speech synthesis and speech understanding.

Previous Article Context Engineering: How Financial Companies Can Make AI Reliable Next Article GLM-OCR: A Small Model That Reads Documents Better Than Big Ones

Why AI Voice Agents Are Switching to Direct Speech Processing

Why Text-Based Voice Assistants Lose Speech Nuances

How to Evaluate Voice Agent Quality

Speech-to-Speech vs Text-Based Voice AI Performance Comparison

Best Use Cases for Speech-to-Speech Voice AI Technology

Future Development of Direct Speech Processing Technology

Related Publications

MBL – AI That Remembers Events Just Like We Do

OpenHands Index: A New Way to Compare AI Agents on Real-World Tasks

Samsung to Showcase Context-Aware Home Appliances at CES 2026

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration