Published February 9, 2026

Sarvam Vision: A Document-Processing Model with Indic Language Expertise

An Indian startup has released a compact multimodal model capable of recognizing text in 22 of the country's languages – often more accurately than global counterparts.

Products
Event Source: Sarvam Reading Time: 5 – 7 minutes

Multimodal AI Capabilities for Document Processing

When Knowledge is Locked in Documents

On February 5, Indian company Sarvam AI introduced Sarvam Vision – a multimodal model capable of working with both text and images. While Sarvam had previously released voice and text solutions, the developers have now expanded into the visual realm.

The model is built on state-space architecture and contains three billion parameters. While it is not the largest model on the market – for comparison, GPT-4 is estimated to contain hundreds of billions of parameters – its compactness is an advantage: the model runs faster and requires fewer resources.

The primary mission of Sarvam Vision is document processing. The model can describe images, recognize text in photos, interpret charts, and parse complex tables. Crucially, it is specialized for Indic languages.

Challenges of Recognizing Regional Indic Languages

The Problem Global Models Don't Solve

In India, vast amounts of information are still stored on paper: scanned archives, historical documents, and government bulletins. For this data to be accessible for research or business use, it must be digitized.

The issue is that existing document recognition solutions perform well in English but stall on India's regional languages. Global models treat them as secondary, leading to a drop in recognition accuracy.

Sarvam took a different approach: rather than adapting a Western model, they trained their own from scratch, paying special attention to India's 22 official languages – from Hindi and Bengali to less common ones like Santali or Maithili.

Dataset Preparation and Training Process

How the Model Was Trained

The developers assembled an extensive dataset: synthetic and real «image–text» pairs for all languages. The dataset included scientific papers, financial reports, government documents, historical manuscripts, textbooks, and newspapers.

Data was prepared separately for each document type. For example, for charts, tasks were created for structure extraction, description, and analysis. For tables, the focus was on understanding cell relationships.

Training took place in several stages: first, pre-training the base model, then fine-tuning for specific tasks, and finally reinforcement learning based on verifiable rewards – meaning the model received feedback based on how accurately it completed the task.

Benchmark Results

Sarvam compared its model against competitors on several popular tests. On the English-language portion of olmOCR-Bench (a benchmark for evaluating document text recognition), Sarvam Vision performed on par with or better than GPT-5.2, Gemini 3 Pro, and other large models. The model performed particularly well with mathematical texts, tables, and scans of old documents.

But the highlight is the Indic languages. Since no standard benchmarks existed for them, Sarvam created their own: the Sarvam Indic OCR Bench. It includes over 20,000 document samples in 22 languages, ranging from texts from the 1800s to modern materials.

In this test, Sarvam Vision outperformed all other models, including Gemini 3 Pro and Claude Opus 4.5. For Hindi, recognition accuracy was nearly 96%; for Bengali, it was 93%; and for Tamil, it was also 93%. Even for less common languages like Odia or Dogri, the results were significantly better than those of its competitors.

Extracting Structured Data from Visual Elements

Knowledge is More Than Just Text

The developers emphasize an important point: the model's task is not just to extract text, but to extract knowledge. Documents contain not only words but also tables, charts, illustrations, and infographics. To fully understand a document, every pixel must be taken into account.

For example, Sarvam Vision can:

  • recognize handwritten text on a historical document;
  • extract data from a complex nested table;
  • describe the content of a chart in Hindi or Tamil;
  • convert visual information into a structured JSON format.

In the article, the authors provided several examples. The model correctly recognized a handwritten letter in English 📝, scanned Tamil text from a book from the 1800s, a complex table with nested rows, and a data chart in Hindi.

Computer Vision and Real World Scene Recognition

The Model Does More Than Just Document Work

Although the primary focus is on document processing, Sarvam Vision also handles general computer vision tasks. The model can describe scenes, recognize text «in the wild» – for instance, on shop signs or road signs – and extract structured information from photographs.

The developers demonstrated the model describing a street with a bike lane in English and Kannada, recognizing a notice in Gujarati, extracting flight schedules from an airport display in Kannada, and reading a handwritten school text about Kalam.

Limitations and Performance Challenges

Where the Model Fails

Sarvam honestly admits that the model is not perfect. The article provides two examples of failures.

The first is an incorrect translation of a shop name from Bengali. The model recognized the sign but translated it incorrectly.

The second involves challenges with low-resource languages. When the model was asked to describe a street scene in Santali, it ignored the instruction and replied in English. For rare languages, the quality of instruction-following remains inconsistent.

Future Development and API Availability

What's Next

Sarvam Vision is available via API. Throughout February 2026, the company is providing free access to the document recognition API with no volume limits. This is a great opportunity to test the model in real-world conditions.

The developers plan to evolve the model in the fields of education, healthcare, and video analytics. Additionally, the Sarvam team invites developers to their Discord to discuss updates and share feedback.

It is noteworthy that a compact, specialized model can compete with much larger, general-purpose solutions – especially in areas where accuracy with non-English languages is critical.

Original Title: Sarvam Vision
Publication Date: Feb 8, 2026
Sarvam www.sarvam.ai Indian AI company developing language models and speech technologies for local languages and services.
Previous Article Canadian Clinics Deploy Oracle AI Assistant to Automate Medical Documentation Next Article Red Hat Shows How AI Can Make Telecom Networks Smarter and More Autonomous

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.5 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.5 Anthropic
2.
Gemini 3 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 3 Pro Google DeepMind
3.
Gemini 3 Flash Preview Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 3 Flash Preview Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

Indian company Sarvam AI has unveiled a system for automatically dubbing videos into regional languages while preserving the original intonations and synchronizing lip movements.

Sarvamwww.sarvam.ai Feb 8, 2026

Mistral AI has unveiled Voxtral – a real-time speech transcription model featuring precise speaker separation and a new interactive «sandbox» for audio workflows.

Mistral AImistral.ai Feb 6, 2026

Want to dive deeper into the world
of neuro-creativity?

Be the first to learn about new books, articles, and AI experiments
on our Telegram channel!

Subscribe