Most large language models today are trained predominantly on English text. This is no accident: there is incomparably more English data on the internet than texts in Hindi, Tamil, Bengali, or Telugu. As a result, these models perform quite well with English but are noticeably worse with languages spoken by hundreds of millions of people.
Indian company Sarvam AI has decided to tackle this problem head-on. It recently open-sourced two of its language models – Sarvam 30B and Sarvam 105B. The numbers in the names indicate the number of parameters – roughly speaking, this is the model's 'size,' which affects its ability to understand and generate text. More parameters typically mean the model can handle more complex tasks.
When we say a model 'supports' a certain language, it doesn't always mean it performs well with it. The model might manage to translate text or answer simple questions after a fashion, but it may struggle to understand cultural context, specific idioms, or mixed-language speech – for instance, when someone writes in Hindi but inserts English words.
In India, this kind of language mixing is the norm, not the exception. Additionally, the country has 22 officially recognized languages, and many of them are fundamentally different from each other in structure. Training a model that can work with most of them with equal confidence is a non-trivial task.
Sarvam took a systematic approach. Both models were trained on a large corpus of texts in Indian languages, including Hindi, Tamil, Telugu, Kannada, Malayalam, Bengali, Marathi, Gujarati, Odia, and Punjabi. Special attention was paid to ensuring the models could understand mixed input – when a person switches between languages right in the middle of a sentence.
To put it simply, they are two models of different scales for different tasks.
Sarvam 30B is more compact. It is designed for cases where speed and accessibility are important: for example, for running on limited hardware or in situations where a quick response is needed. At the same time, according to the company, its performance on Indian languages surpasses that of many larger, general-purpose models.
Sarvam 105B is significantly larger and more powerful. It is aimed at complex tasks: detailed answers, reasoning, and professional contexts. According to the developers, in several tests related to Indian languages and realities, this model shows results comparable to leading commercial models.
Both versions have been released in open access. This means that developers, researchers, and companies can take the models, study them, adapt them for their own needs, and integrate them into their own products – without having to pay for an API or depend on an external service.
This is particularly significant for the Indian tech ecosystem. Many startups and non-profit organizations working with local languages simply cannot afford the recurring costs of commercial models. Open weights lower this barrier.
Furthermore, openness allows for independent verification of how the model performs – which is crucial in sensitive areas like healthcare, education, or legal services, where errors in language understanding can have real-world consequences.
One of the most pressing questions when developing language models for specific languages is the data. There is significantly less text in Indian languages available on the open internet than in English, and the quality of what is available often leaves much to be desired.
Sarvam has created its own text corpus, which includes both web data and specially collected and annotated materials in the target languages. The company also handled data filtering and cleaning – a separate and labor-intensive task that is often underestimated.
In essence, a significant portion of the team's effort went not into the model's architecture itself, but into gathering enough high-quality training data. This is a typical story for languages that are often called 'low-resource' – not because they have few speakers, but because their digital presence has historically been small.
In short – anyone who is building products for an Indian audience and wants them to truly understand their users.
This could include educational platforms that need to explain material in a student's native language. Or medical services where the accuracy of understanding phrasing is critical. Or voice assistants, chatbots, and document processing tools – the list is long.
For developers who previously had to use general-purpose models and put up with their weaknesses in specific languages, the emergence of an open alternative is a real, practical option.
Open source is a good thing, but it doesn't solve all problems on its own. Running a 105-billion-parameter model requires significant computational resources, which not everyone has. The more compact version is more accessible, but it also has infrastructure requirements.
There is also the question of long-term support: open models live only as long as the team developing them has the energy and resources. Sarvam is a relatively young company, and what support for these models will look like in a year or two remains to be seen.
Finally, open weights are not the same as open data. Information about what exactly the models were trained on is only partially available, and this limits the potential for independent auditing.
Nevertheless, the very fact that high-quality, open models focused on Indian languages have appeared is a step that the local tech community has long been waiting for. And judging by the initial feedback, the interest in them is very real.