Published on February 9, 2026

Sarvam Vision: A Document-Processing Model with Indic Language Expertise

An Indian startup has released a compact multimodal model capable of recognizing text in 22 of the country's languages – often more accurately than global counterparts.

Products 5 – 7 minutes min read

Event Source: Sarvam 5 – 7 minutes min read

Multimodal AI Capabilities for Document Processing

When Knowledge is Locked in Documents

On February 5, Indian company Sarvam AI introduced Sarvam Vision – a multimodal model capable of working with both text and images. While Sarvam had previously released voice and text solutions, the developers have now expanded into the visual realm.

The model is built on state-space architecture and contains three billion parameters. While it is not the largest model on the market – for comparison, GPT-4 is estimated to contain hundreds of billions of parameters – its compactness is an advantage: the model runs faster and requires fewer resources.

The primary mission of Sarvam Vision is document processing. The model can describe images, recognize text in photos, interpret charts, and parse complex tables. Crucially, it is specialized for Indic languages.

Challenges of Recognizing Regional Indic Languages

The Problem Global Models Don't Solve

In India, vast amounts of information are still stored on paper: scanned archives, historical documents, and government bulletins. For this data to be accessible for research or business use, it must be digitized.

The issue is that existing document recognition solutions perform well in English but stall on India's regional languages. Global models treat them as secondary, leading to a drop in recognition accuracy.

Sarvam took a different approach: rather than adapting a Western model, they trained their own from scratch, paying special attention to India's 22 official languages – from Hindi and Bengali to less common ones like Santali or Maithili.

Dataset Preparation and Training Process

How the Model Was Trained

The developers assembled an extensive dataset: synthetic and real «image–text» pairs for all languages. The dataset included scientific papers, financial reports, government documents, historical manuscripts, textbooks, and newspapers.

Data was prepared separately for each document type. For example, for charts, tasks were created for structure extraction, description, and analysis. For tables, the focus was on understanding cell relationships.

Training took place in several stages: first, pre-training the base model, then fine-tuning for specific tasks, and finally reinforcement learning based on verifiable rewards – meaning the model received feedback based on how accurately it completed the task.

Benchmark Results

Sarvam compared its model against competitors on several popular tests. On the English-language portion of olmOCR-Bench (a benchmark for evaluating document text recognition), Sarvam Vision performed on par with or better than GPT-5.2, Gemini 3 Pro, and other large models. The model performed particularly well with mathematical texts, tables, and scans of old documents.

But the highlight is the Indic languages. Since no standard benchmarks existed for them, Sarvam created their own: the Sarvam Indic OCR Bench. It includes over 20,000 document samples in 22 languages, ranging from texts from the 1800s to modern materials.

In this test, Sarvam Vision outperformed all other models, including Gemini 3 Pro and Claude Opus 4.5. For Hindi, recognition accuracy was nearly 96%; for Bengali, it was 93%; and for Tamil, it was also 93%. Even for less common languages like Odia or Dogri, the results were significantly better than those of its competitors.

Extracting Structured Data from Visual Elements

Knowledge is More Than Just Text

The developers emphasize an important point: the model's task is not just to extract text, but to extract knowledge. Documents contain not only words but also tables, charts, illustrations, and infographics. To fully understand a document, every pixel must be taken into account.

For example, Sarvam Vision can:

recognize handwritten text on a historical document;
extract data from a complex nested table;
describe the content of a chart in Hindi or Tamil;
convert visual information into a structured JSON format.

In the article, the authors provided several examples. The model correctly recognized a handwritten letter in English 📝, scanned Tamil text from a book from the 1800s, a complex table with nested rows, and a data chart in Hindi.

Computer Vision and Real World Scene Recognition

The Model Does More Than Just Document Work

Although the primary focus is on document processing, Sarvam Vision also handles general computer vision tasks. The model can describe scenes, recognize text «in the wild» – for instance, on shop signs or road signs – and extract structured information from photographs.

The developers demonstrated the model describing a street with a bike lane in English and Kannada, recognizing a notice in Gujarati, extracting flight schedules from an airport display in Kannada, and reading a handwritten school text about Kalam.

Limitations and Performance Challenges

Where the Model Fails

Sarvam honestly admits that the model is not perfect. The article provides two examples of failures.

The first is an incorrect translation of a shop name from Bengali. The model recognized the sign but translated it incorrectly.

The second involves challenges with low-resource languages. When the model was asked to describe a street scene in Santali, it ignored the instruction and replied in English. For rare languages, the quality of instruction-following remains inconsistent.

Future Development and API Availability

What's Next

Sarvam Vision is available via API. Throughout February 2026, the company is providing free access to the document recognition API with no volume limits. This is a great opportunity to test the model in real-world conditions.

The developers plan to evolve the model in the fields of education, healthcare, and video analytics. Additionally, the Sarvam team invites developers to their Discord to discuss updates and share feedback.

It is noteworthy that a compact, specialized model can compete with much larger, general-purpose solutions – especially in areas where accuracy with non-English languages is critical.

#event #applied analysis #neural networks #computer vision #ai linguistics #data #dialectal models #multimodal models

Link to Original: https://www.sarvam.ai/blogs/Sarvam-vision

Original Title: Sarvam Vision

Publication Date: Feb 8, 2026

Sarvam www.sarvam.ai Indian AI company developing language models and speech technologies for local languages and services.

Previous Article Canadian Clinics Deploy Oracle AI Assistant to Automate Medical Documentation Next Article Red Hat Shows How AI Can Make Telecom Networks Smarter and More Autonomous

Sarvam Vision: A Document-Processing Model with Indic Language Expertise

Multimodal AI Capabilities for Document Processing

Challenges of Recognizing Regional Indic Languages

Dataset Preparation and Training Process

Benchmark Results

Extracting Structured Data from Visual Elements

Computer Vision and Real World Scene Recognition

Limitations and Performance Challenges

Future Development and API Availability

Related Publications

Sarvam Dub: Automatic Dubbing for Indian Languages

SenseTime Open-Sources SenseNova-MARS – A Model for Searching and Analyzing Diverse Data Types

Voxtral: Transcription at the Speed of Sound

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration