When it comes to evaluating AI systems, the most obvious question is: how exactly do we measure «good?» For speech recognition, the standard answer sounds simple: take a test set of audio recordings with ready-made text transcriptions, run the model, and see how many words it recognized incorrectly. The fewer the errors, the better the model. It all seems straightforward. But in the case of Indian languages, this simplicity proves deceptive.
The Sarvam AI team undertook a large-scale effort to evaluate Automatic Speech Recognition (ASR) systems for the languages of India. Their main conclusion wasn't about the numbers in the tables, but about how difficult it is to obtain these numbers honestly.
The Problem Isn't the Models, It's the Data
India is a country of immense linguistic diversity. There are over twenty officially recognized languages and hundreds of dialects. However, most existing datasets for training and testing ASR systems were created either for English or for a few major world languages. For Hindi, Tamil, Bengali, Telugu, and other Indian languages, the situation is significantly worse.
The researchers at Sarvam AI found that available test datasets – that is, sets of audio recordings with correct transcriptions used to evaluate a model – are often either too small or do not reflect real-world speech. Some of them contain studio recordings with perfect pronunciation, which bear little resemblance to how people speak in everyday life: with accents, in noisy environments, quickly, and with pauses.
Simply put: if a test doesn't match reality, then the evaluation based on it means very little.
What Exactly Was Tested and Why It's Not So Simple
The team created its own test sets for several Indian languages, aiming to cover different recording conditions, accents, and speech styles. This is a labor-intensive task in itself: it involves collecting audio, recruiting native speakers for transcription, verifying the quality of the transcriptions, and ensuring the sample is sufficiently diverse.
A separate challenge is the evaluation metric. The standard metric in ASR is called Word Error Rate (WER), which is the proportion of words the model recognized incorrectly. But for languages with rich morphology – where a single word can have dozens of forms depending on the context – this metric doesn't work as well as it does for English. A single «error» in a word's root can lead to several «incorrect» words in the transcription, even though the meaning of the phrase remains clear.
For some languages, the researchers also looked at how models handle code-switching – when a speaker switches from one language to another mid-sentence. In India, this is a very common scenario: a person might start a sentence in Hindi and finish it in English, or insert a word from a regional language into a sentence in an official one. Most models handle this poorly.
What the Model Comparison Showed
As part of the study, several speech recognition systems were compared against each other on the same test sets. Among those tested were both global solutions designed for a wide range of languages and models created specifically with the Indian context in mind, including Sarvam's own developments.
The results showed that general-purpose models often underperform specialized ones, particularly on the languages the latter were designed for. This is not surprising: a general model has to «spread» its attention across dozens of languages at once, whereas a model «tuned» for a specific language or language group can better capture its nuances – phonetics, rhythm, and typical structures.
At the same time, the researchers noted that even specialized systems are still far from the level considered acceptable for practical use, especially for languages with less training data or high dialectal variation.
Why This Matters Beyond Academic Interest
Speech recognition isn't just about voice typing. It is the foundation for voice assistants, video subtitling, interface accessibility for people with poor reading or writing skills, real-time automatic translation, and many other applications.
For India, with its multilingual population where a significant number of residents use their voice more than a keyboard, the quality of ASR systems is a matter of genuine access to technology. If a model poorly understands Tamil or Marathi, then for millions of people, an entire class of services simply doesn't work as it should.
This is precisely why honest evaluation is not an academic exercise but a practical necessity. You can't improve what you don't measure well.
Open Questions Remain
The work by Sarvam AI raises several questions that do not yet have definitive answers.
First, there's the issue of standardization. To compare models fairly, common test sets agreed upon by the community are needed. Such a standard does not yet exist for Indian languages, and different teams evaluate systems on different data, making it difficult to compare results.
Second is the balance between generality and specialization. Creating a separate model for each of the twenty-plus languages is expensive and labor-intensive. Making one universal model means accepting that it will perform worse on each specific language. How to find a reasonable compromise remains an open question.
Third, it's about data. A good model requires a large volume of high-quality training recordings. For languages with fewer speakers or without strong digital infrastructure, it is simply difficult to collect this data in the required volume.
In a sense, the Sarvam AI study is an honest look at the current state of ASR for Indian languages. It is not a triumphant report, but rather a diagnostic: here's what works, here's what doesn't, and here's why it's hard to measure. Such work may be less spectacular than announcements of new models, but it is no less important for the advancement of the technology.