Published on March 7, 2026

Sarvam Releases Open Language Models with Support for Indian Languages

Indian company Sarvam AI has open-sourced two large language models – 30B and 105B – with a focus on supporting the languages of India.

Products 5 – 7 minutes min read

Event Source: Sarvam 5 – 7 minutes min read

Most large language models today are trained predominantly on English text. This is no accident: there is incomparably more English data on the internet than texts in Hindi, Tamil, Bengali, or Telugu. As a result, these models perform quite well with English but are noticeably worse with languages spoken by hundreds of millions of people.

Indian company Sarvam AI has decided to tackle this problem head-on. It recently open-sourced two of its language models – Sarvam 30B and Sarvam 105B. The numbers in the names indicate the number of parameters – roughly speaking, this is the model's 'size,' which affects its ability to understand and generate text. More parameters typically mean the model can handle more complex tasks.

Why Are Language-Specific Models Needed Anyway?

When we say a model 'supports' a certain language, it doesn't always mean it performs well with it. The model might manage to translate text or answer simple questions after a fashion, but it may struggle to understand cultural context, specific idioms, or mixed-language speech – for instance, when someone writes in Hindi but inserts English words.

In India, this kind of language mixing is the norm, not the exception. Additionally, the country has 22 officially recognized languages, and many of them are fundamentally different from each other in structure. Training a model that can work with most of them with equal confidence is a non-trivial task.

Sarvam took a systematic approach. Both models were trained on a large corpus of texts in Indian languages, including Hindi, Tamil, Telugu, Kannada, Malayalam, Bengali, Marathi, Gujarati, Odia, and Punjabi. Special attention was paid to ensuring the models could understand mixed input – when a person switches between languages right in the middle of a sentence.

30B and 105B – What's the Difference?

To put it simply, they are two models of different scales for different tasks.

Sarvam 30B is more compact. It is designed for cases where speed and accessibility are important: for example, for running on limited hardware or in situations where a quick response is needed. At the same time, according to the company, its performance on Indian languages surpasses that of many larger, general-purpose models.

Sarvam 105B is significantly larger and more powerful. It is aimed at complex tasks: detailed answers, reasoning, and professional contexts. According to the developers, in several tests related to Indian languages and realities, this model shows results comparable to leading commercial models.

Open Source Is Important

Both versions have been released in open access. This means that developers, researchers, and companies can take the models, study them, adapt them for their own needs, and integrate them into their own products – without having to pay for an API or depend on an external service.

This is particularly significant for the Indian tech ecosystem. Many startups and non-profit organizations working with local languages simply cannot afford the recurring costs of commercial models. Open weights lower this barrier.

Furthermore, openness allows for independent verification of how the model performs – which is crucial in sensitive areas like healthcare, education, or legal services, where errors in language understanding can have real-world consequences.

Where Did the Training Data Come From?

One of the most pressing questions when developing language models for specific languages is the data. There is significantly less text in Indian languages available on the open internet than in English, and the quality of what is available often leaves much to be desired.

Sarvam has created its own text corpus, which includes both web data and specially collected and annotated materials in the target languages. The company also handled data filtering and cleaning – a separate and labor-intensive task that is often underestimated.

In essence, a significant portion of the team's effort went not into the model's architecture itself, but into gathering enough high-quality training data. This is a typical story for languages that are often called 'low-resource' – not because they have few speakers, but because their digital presence has historically been small.

Who Will Find This Useful?

In short – anyone who is building products for an Indian audience and wants them to truly understand their users.

This could include educational platforms that need to explain material in a student's native language. Or medical services where the accuracy of understanding phrasing is critical. Or voice assistants, chatbots, and document processing tools – the list is long.

For developers who previously had to use general-purpose models and put up with their weaknesses in specific languages, the emergence of an open alternative is a real, practical option.

What Remains Unresolved

Open source is a good thing, but it doesn't solve all problems on its own. Running a 105-billion-parameter model requires significant computational resources, which not everyone has. The more compact version is more accessible, but it also has infrastructure requirements.

There is also the question of long-term support: open models live only as long as the team developing them has the energy and resources. Sarvam is a relatively young company, and what support for these models will look like in a year or two remains to be seen.

Finally, open weights are not the same as open data. Information about what exactly the models were trained on is only partially available, and this limits the potential for independent auditing.

Nevertheless, the very fact that high-quality, open models focused on Indian languages have appeared is a step that the local tech community has long been waiting for. And judging by the initial feedback, the interest in them is very real.

#event #applied analysis #ai development #ai linguistics #open technologies #open language models #technological independence

Link to Original: https://www.sarvam.ai/blogs/sarvam-30b-105b

Original Title: Open-Sourcing Sarvam 30B and 105B

Publication Date: Mar 6, 2026

Sarvam www.sarvam.ai Indian AI company developing language models and speech technologies for local languages and services.

Previous Article How Axios Uses AI in Local Journalism Next Article How AI Learns to Simulate Physics: Fine-Tuning Surrogate Models on AMD GPUs

Sarvam Releases Open Language Models with Support for Indian Languages

Why Are Language-Specific Models Needed Anyway?

30B and 105B – What's the Difference?

Open Source Is Important

Where Did the Training Data Come From?

Who Will Find This Useful?

What Remains Unresolved

Related Publications

Indian Company Sarvam Unveils Arya Voice Assistant with 10-Language Support

Bulbul V3: An Indian Model for Speech Synthesis in 15 Languages

Sarvam Dub: Automatic Dubbing for Indian Languages

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration