Most voice assistants we encounter today work in a rather roundabout way. First, they recognize your speech and turn it into text, then process that text through a language model, and finally synthesize the response back into voice. This results in a sort of chain: speech → text → text → speech. This works, but a lot of nuance gets lost along the way.
The alternative approach is called speech-to-speech, meaning direct speech processing without intermediate translation into text. The model listens to the voice, processes it directly, and responds with a voice. It sounds logical, but in practice, such systems remained complex and expensive for a long time. Now the situation is changing, and the Ultravox team decided to find out just how justified this approach is.
Why Text-Based Voice Assistants Lose Speech Nuances
What Gets Lost in the Text Chain
When you speak to a standard voice assistant, it doesn't hear your intonation, pauses, pacing, or emotions. All of this disappears at the text conversion stage. The system sees only words, as if you had typed them. This is fine for simple tasks like setting a timer but becomes a problem when it comes to more complex interactions.
Imagine you call customer support and speak with irritation or uncertainty. A text-based system won't notice this. It will answer based on the meaning of the words but won't account for your state. Direct speech processing allows the model to pick up on these details and react more naturally.
Furthermore, the text chain adds latency. Each stage requires time: recognizing, processing, synthesizing. In a live dialogue, this feels like unnatural pauses. Speech-to-speech systems can work faster because they don't need to take as many intermediate steps.
How to Evaluate Voice Agent Quality
To compare different approaches, an evaluation methodology is needed. Ultravox developed a special benchmark (a reference evaluation system) for this purpose called AIEWF Eval. The name stands for AI Enterprise Workflow Evaluation – that is, testing working scenarios for business.
The essence is that the assessment is not conducted on abstract tasks but on real-world use cases: ordering via a call center, product consultation, or technical support. This is important because voice agents are most often needed in precisely such contexts where speed, accuracy, and natural communication are key.
The benchmark checks several aspects: how correctly the model understands the request, how quickly it reacts, how natural the response sounds, and whether it retains the conversation context. This allows for a more complete picture than simply measuring recognition accuracy or generation speed.
Speech-to-Speech vs Text-Based Voice AI Performance Comparison
The Results: Where Speech-to-Speech Wins
Testing showed that direct speech processing indeed offers advantages in several areas. First is reaction speed. Models working directly with speech showed lower latency between replies, which makes the dialogue more lively.
Second is naturalness. When the model processes speech directly, it better preserves the intonation and rhythm of the conversation. This doesn't mean it mimics a human perfectly, but it sounds less mechanical compared to systems assembling an answer from synthesized fragments.
Third is context understanding. Speech models can take into account not just the words but how they are spoken. This helps to more accurately determine a person's intent, especially in ambiguous situations.
There are limitations too. Speech-to-speech models require more computing resources at the training stage and currently cope worse with rare languages or highly specialized vocabulary. But for the English language and typical business scenarios, they are already showing stable results.
Best Use Cases for Speech-to-Speech Voice AI Technology
Who Needs This Right Now
Direct speech processing is especially useful where speed and the emotional tone of the dialogue matter. These are call centers where clients want to solve a problem quickly and not wait while the robot “thinks.” These are advisory services where it is important to create the impression of live communication. These are also educational apps where the model must react to a student's intonation to understand if they are managing well or have gotten confused.
For simple tasks like setting an alarm or checking the weather, a text chain is quite sufficient. But the more complex the scenario, the more noticeable the benefits of the speech-to-speech approach become.
Future Development of Direct Speech Processing Technology
What's Next
The development of voice models is moving toward greater integration of speech capabilities. While previously direct speech processing was available only to large companies with serious resources, now more accessible solutions are appearing. Ultravox, for example, offers tools for developers who want to embed speech-to-speech functionality into their products.
Open questions remain: how to scale such systems to support a larger number of languages, how to make them more energy-efficient, and how to ensure security and privacy when processing voice data. But the direction has been set, and judging by the test results, it is justified.
Direct speech processing won't replace text models completely, but it will become the standard for those tasks where the liveliness and naturalness of interaction are important. And the more accessible these technologies become, the more often we will encounter them in everyday life.