Limitations of Traditional Automatic Speech Recognition Systems
Why Just Transcribing Is Not Enough
In India, voice is the primary way people interact with technology. Farmers check crop prices, delivery drivers get their routes, and the elderly navigate WhatsApp – and all of them speak far more often than they type. The reason is simple: keyboards just can't keep up with the fluid nature of Indian languages, and speaking is simply more natural than texting.
But here's the paradox: traditional Automatic Speech Recognition (ASR) systems perform quite well on test data featuring clean, scripted speech, but they start to struggle in real-world conditions. As it turns out, transcription accuracy isn't everything. Speech in India demands something more than just turning sounds into text.
On February 2, 2026, the Sarvam AI team introduced Sarvam Audio – an audio extension for the Sarvam 3B language model. The project aims to solve three key problems that recognition systems face in the Indian context.
Main Challenges in Real World Speech Recognition
Three Problems Hindering Speech Understanding
The first is code-switching (mixing languages). Indians freely pepper their speech with English words. Sometimes these need to be kept in the Latin alphabet, and other times they should be transliterated into the native script. There is no single format that fits every situation.
The second is multiple voices at once. In real life, people often talk over each other – in meetings, interviews, or casual chats. To recognize everything correctly, a system needs to do more than just turn sounds into words; it has to understand exactly who said what.
The third is context. A system must account for previous remarks in a dialogue or information from a long audio recording. Without this, short phrases, ambiguous expressions, or noisy snippets are regularly misinterpreted.
Sarvam Audio attempts to tackle all three problems simultaneously.
One Model – Five Output Formats
Simply put, the system can deliver results in different forms depending on where they will be used. This isn't just technical flexibility; it's a necessity. Indian speech is multilingual by default, and different tasks require different presentation styles.
Sarvam Audio supports five transcription modes:
- Verbatim Transcription – reproducing the text word-for-word. Ideal for call centers and quality control where every detail matters.
- Normalized Without Code-Switching – text with proper punctuation where numbers are written as digits. Useful for recording addresses and order numbers in logistics and e-commerce.
- Normalized With Code-Switching – uses the native script, but English terms remain in the Latin alphabet. This is the format for banking operations and tech support where app names and services are mentioned.
- Fully Latin – the entire text is written using Latin letters, which is convenient for searching and messaging. This works well for WhatsApp Business.
- Smart Translation – you speak in any Indian language and receive English text. Helpful for content creators looking to reach a global audience.
Crucially, the format is not chosen in advance, but at the exact moment of the request. The application itself specifies which style is needed at that second.
The team verified the quality using the IndicVoices benchmark – a dataset covering a wide range of real-world Indian speech conditions. Sarvam Audio was compared against GPT-4o-Transcribe and Gemini-3-Flash using the Word Error Rate metric (the lower the error rate, the better). Sarvam Audio showed the best results across all three transcription modes. This proves that controlling the format does not sacrifice accuracy.
Advanced Speaker Diarization for Multi Party Conversations
Who Said What and When
Real-world audio is rarely a monologue. Meetings, interviews, discussions – these all involve several people whose remarks often overlap. Correctly recognizing this flow means not just transcribing words, but accurately identifying who they belong to.
Sarvam Audio handles this task on recordings up to 60 minutes long and demonstrates superior results compared to its peers in diarization – the task of partitioning audio by speaker. The model doesn't just transcribe; it labels exactly who uttered which phrase.
The team evaluated quality on their own benchmark, compiled from real meeting recordings with expert labeling. The tests included audio files ranging from 1 to 60 minutes with up to 8 speakers and significant voice overlaps. Two metrics were used: Word Diarization Error Rate (the percentage of words attributed to the wrong person) and Diarization Error Rate (the total speaker identification error, including misses and false alarms). In both cases, the lower the score, the higher the quality.
Context as the Key to Understanding
Context is the «secret sauce» required to parse live speech. The architecture of Sarvam Audio is built on a language model base, allowing it to factor in context through text descriptions or conversation history. This significantly improves transcription quality in tricky situations.
For example, when a user says «नौ» (Nau) in response to a question about quantity, the system uses the dialogue context to understand it's the Hindi word for «nine», not the English «no». In a noisy recording, if someone says «Bhaiya, loc son bhejo», the model draws on the delivery theme and restores the correct phrase: «Bhaiya, location bhejo». In a conversation about the stock market, Sarvam Audio will transcribe «M&M» as «Mahindra & Mahindra» rather than a literal «M and M».
The team tested this on a benchmark simulating real conversational speech in Indian languages. Instead of classic word-level accuracy metrics, they used an LLM-based evaluation – this better reflects how well the system preserves the gist and key entities in commands and dialogues.
Two parameters were measured: Intent Preservation (was the main action understood correctly?) and Entity Preservation (names, numbers, places, and organizations). Sarvam Audio consistently outperforms Gemini-3-Flash on both counts.
The evaluation framework has been made public, and the benchmark itself – the Synthetic Contextual ASR Benchmark (Indic) – has been uploaded to Hugging Face. It covers 10 major Indian languages and is built on synthetic data from sectors like banking, e-commerce, and healthcare. Each example includes audio, a ground-truth transcription, a language tag, and the full conversation context: the bot's role, dialogue history, and the prompt.
Voice Function Calling and Parameter Extraction from Audio
From Speech Straight to Action
Voice assistants are everywhere now. Most work in two stages: first, the audio is transcribed by a speech recognition system (ASR), then the text is processed by a language model (LLM). This works, but it introduces latency and often leads to a loss of context – especially with short or noisy phrases.
Sarvam Audio proves that high-precision function calling and parameter extraction can be performed directly from the audio stream – without an intermediate text conversion step.
By working directly with speech, the system:
- better preserves intent and context;
- significantly reduces latency;
- simplifies the overall solution architecture.
In an example provided in the article, a user engages in a Tamil-language dialogue with a bill-pay bot. After the system clarifies all the details – account type, provider, account number, and amount – the user confirms the transaction. Sarvam Audio instantly identifies the required function and its arguments based on the dialogue context and triggers it without any extra conversions.
This approach allows for the deployment of reliable voice agents based on small, specialized datasets without resorting to heavyweight models.
What's Next
Sarvam Audio is reimagining speech recognition for India from the ground up. This isn't just high-quality transcription for 22 Indian languages and Indian English. It's a system that accounts for harsh realities: language mixing, script variations, long recordings, overlapping voices, and complex context.
The model's main edge is that it goes beyond classic recognition. Built-in context handling, diarization, output format control, and direct speech-to-command conversion lay the groundwork for a new generation of voice applications built specifically for Indian users.
Sarvam Audio will soon be available on the Sarvam Dashboard platform. As the developers themselves put it: «Voice is the interface. Sarvam Audio makes it truly work for India».