The Indian company Sarvam AI has released Saaras V3 – an automatic speech recognition model designed for the languages of India. This is the third version of the system, which now understands 12 languages: Hindi, Bengali, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu, Urdu, Gujarati, and Indian English.
Why Indian Language Speech Recognition Matters
Why Does This Even Matter?
Dozens of languages are spoken in India, and most major speech recognition systems don't work very well with them. Models like Whisper from OpenAI or solutions from Google were trained mainly on data from Western countries where English dominates. They simply have little material for Indian languages, especially when it comes to colloquial variants, accents, or mixing languages in a single phrase.
Sarvam is trying to solve this problem by creating models specifically for the region. And judging by the test results, they are succeeding.
Saaras V3 Key Features and Improvements
What's New in Version 3?
Saaras V3 was trained on 45,000 hours of audio – that's about five times more than the previous version had. Data was collected from various sources: call center calls, YouTube, podcasts, and recordings from streets and offices. It is important that the sample includes both formal and everyday speech – the kind people actually communicate in.
The model has become better at handling several complex issues:
- Switching between languages within a single phrase. For India, this is the norm: a person might start a sentence in Hindi, insert an English word, and finish in Punjabi. Previously, models often “broke” on this.
- Accents and dialects. Each state has its own pronunciation variant, and Saaras V3 accounts for this diversity.
- Background noise. Recordings from streets, transport, and crowded places – all of this ended up in the training sample, and the model learned to work in such conditions.
Saaras V3 Comparison with Competitors
Comparison with Competitors
Sarvam conducted tests on open datasets and compared Saaras V3 with several popular models: Whisper Large V3 Turbo from OpenAI, Gemini 2.0 Flash from Google, and its own previous version. The main metric is the Word Error Rate (the percentage of incorrectly recognized words). The lower it is, the better.
The results look like this: Saaras V3 showed the best result in most languages. For example, in Hindi, its word error rate is about 8%, while Whisper has about 12%, and Gemini has about 14%. In Bengali, the difference is even more noticeable: Saaras V3 has approximately 10%, Whisper has about 18%, and Gemini – over 20%.
There are a couple of exceptions. In Urdu, Gemini 2.0 Flash showed a slightly better result than Saaras V3. In Indian English, the difference between the models is minimal, but Saaras V3 is still slightly ahead.
How Saaras V3 Works in Practice
How It Works in Practice
Sarvam offers several formats for using the model. There is an API through which you can send audio and receive text. There is a streaming mode – where the model recognizes speech in real-time as the person speaks. This is convenient for applications like live subtitles or voice assistants.
The model supports files in popular formats: MP3, WAV, FLAC, OGG, and others. The maximum audio duration for a single request is two hours.
The company also released a lightweight version of the model – Saaras Lite. It works faster and requires fewer resources but loses slightly in accuracy. This is an option for cases where speed is important, not perfect recognition quality.
Who Benefits from Saaras V3
Who Is This For?
The main usage scenarios are call centers, educational platforms, medical documentation, content platforms, and voice interfaces. There are many startups and companies in India creating products in local languages, and for them, speech recognition accuracy is a critical parameter.
For example, if you are making an app for recording medical consultations in Tamil, you need a model that won't mix up terms and will correctly understand the doctor's accent. Or if you are launching an online learning platform in Bengali, it is important that the subtitles are accurate; otherwise, students simply won't understand the material.
Saaras V3: Undeclared Details and Limitations
What Remains Behind the Scenes
Sarvam does not disclose details of the model's architecture and does not publish it in the public domain. This is a commercial product, and access to it is paid. There is an API for developers, but the model itself cannot be downloaded and run locally.
Another point: all tests were conducted on open datasets, but results may differ in real-world conditions. For example, if your application has specific terminology or non-standard accents, the model might work worse than in benchmarks.
Finally, although Saaras V3 supports 12 languages, there are far more of them in India. There are languages with fewer speakers for which there are no decent speech recognition systems at all. This is a problem that no one has solved yet.
Sarvam AI's Future Plans for Speech Recognition
What's Next?
Sarvam plans to expand the list of languages and improve quality on those already supported. The company is also working on models for other tasks – for example, on speech synthesis systems and language models oriented towards the Indian context.
This is an important step for the Indian market. Technologies that work in local languages give access to digital services to millions of people who do not speak English. And if models like Saaras V3 continue to develop, it could change how people interact with technology in the region.