Published February 12, 2026

Sarvam AI Releases Saaras V3 Speech Recognition for Indian Languages

Sarvam Releases Saaras V3 – A Speech Recognition Model for Indian Languages

An Indian company has introduced a new version of its speech recognition system that supports 12 languages and outperforms major competitors in accuracy.

Products
Event Source: Sarvam Reading Time: 4 – 6 minutes

The Indian company Sarvam AI has released Saaras V3 – an automatic speech recognition model designed for the languages of India. This is the third version of the system, which now understands 12 languages: Hindi, Bengali, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu, Urdu, Gujarati, and Indian English.

Why Indian Language Speech Recognition Matters

Why Does This Even Matter?

Dozens of languages are spoken in India, and most major speech recognition systems don't work very well with them. Models like Whisper from OpenAI or solutions from Google were trained mainly on data from Western countries where English dominates. They simply have little material for Indian languages, especially when it comes to colloquial variants, accents, or mixing languages in a single phrase.

Sarvam is trying to solve this problem by creating models specifically for the region. And judging by the test results, they are succeeding.

Saaras V3 Key Features and Improvements

What's New in Version 3?

Saaras V3 was trained on 45,000 hours of audio – that's about five times more than the previous version had. Data was collected from various sources: call center calls, YouTube, podcasts, and recordings from streets and offices. It is important that the sample includes both formal and everyday speech – the kind people actually communicate in.

The model has become better at handling several complex issues:

  • Switching between languages within a single phrase. For India, this is the norm: a person might start a sentence in Hindi, insert an English word, and finish in Punjabi. Previously, models often “broke” on this.
  • Accents and dialects. Each state has its own pronunciation variant, and Saaras V3 accounts for this diversity.
  • Background noise. Recordings from streets, transport, and crowded places – all of this ended up in the training sample, and the model learned to work in such conditions.

Saaras V3 Comparison with Competitors

Comparison with Competitors

Sarvam conducted tests on open datasets and compared Saaras V3 with several popular models: Whisper Large V3 Turbo from OpenAI, Gemini 2.0 Flash from Google, and its own previous version. The main metric is the Word Error Rate (the percentage of incorrectly recognized words). The lower it is, the better.

The results look like this: Saaras V3 showed the best result in most languages. For example, in Hindi, its word error rate is about 8%, while Whisper has about 12%, and Gemini has about 14%. In Bengali, the difference is even more noticeable: Saaras V3 has approximately 10%, Whisper has about 18%, and Gemini – over 20%.

There are a couple of exceptions. In Urdu, Gemini 2.0 Flash showed a slightly better result than Saaras V3. In Indian English, the difference between the models is minimal, but Saaras V3 is still slightly ahead.

How Saaras V3 Works in Practice

How It Works in Practice

Sarvam offers several formats for using the model. There is an API through which you can send audio and receive text. There is a streaming mode – where the model recognizes speech in real-time as the person speaks. This is convenient for applications like live subtitles or voice assistants.

The model supports files in popular formats: MP3, WAV, FLAC, OGG, and others. The maximum audio duration for a single request is two hours.

The company also released a lightweight version of the model – Saaras Lite. It works faster and requires fewer resources but loses slightly in accuracy. This is an option for cases where speed is important, not perfect recognition quality.

Who Benefits from Saaras V3

Who Is This For?

The main usage scenarios are call centers, educational platforms, medical documentation, content platforms, and voice interfaces. There are many startups and companies in India creating products in local languages, and for them, speech recognition accuracy is a critical parameter.

For example, if you are making an app for recording medical consultations in Tamil, you need a model that won't mix up terms and will correctly understand the doctor's accent. Or if you are launching an online learning platform in Bengali, it is important that the subtitles are accurate; otherwise, students simply won't understand the material.

Saaras V3: Undeclared Details and Limitations

What Remains Behind the Scenes

Sarvam does not disclose details of the model's architecture and does not publish it in the public domain. This is a commercial product, and access to it is paid. There is an API for developers, but the model itself cannot be downloaded and run locally.

Another point: all tests were conducted on open datasets, but results may differ in real-world conditions. For example, if your application has specific terminology or non-standard accents, the model might work worse than in benchmarks.

Finally, although Saaras V3 supports 12 languages, there are far more of them in India. There are languages with fewer speakers for which there are no decent speech recognition systems at all. This is a problem that no one has solved yet.

Sarvam AI's Future Plans for Speech Recognition

What's Next?

Sarvam plans to expand the list of languages and improve quality on those already supported. The company is also working on models for other tasks – for example, on speech synthesis systems and language models oriented towards the Indian context.

This is an important step for the Indian market. Technologies that work in local languages give access to digital services to millions of people who do not speak English. And if models like Saaras V3 continue to develop, it could change how people interact with technology in the region.

#event #applied analysis #machine learning #ai linguistics #products #business #platform economics #dialectal models #audio transcription
Original Title: Introducing Saaras V3
Publication Date: Feb 10, 2026
Sarvam www.sarvam.ai Indian AI company developing language models and speech technologies for local languages and services.
Previous Article How to Generate 2K Video Fast: The Two-Stage SANA-Video Approach Next Article Test-Driving AI Agents: Real-World Trials, Not Toy Problems

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.5 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.5 Anthropic
2.
Gemini 3 Pro Preview Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 3 Pro Preview Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

Indian company Sarvam AI has unveiled a system for automatically dubbing videos into regional languages while preserving the original intonations and synchronizing lip movements.

Sarvamwww.sarvam.ai Feb 8, 2026

Don’t miss a single experiment!

Subscribe to our Telegram channel —
we regularly post announcements of new books, articles, and interviews.

Subscribe