Multimodal AI Capabilities for Document Processing
When Knowledge is Locked in Documents
On February 5, Indian company Sarvam AI introduced Sarvam Vision – a multimodal model capable of working with both text and images. While Sarvam had previously released voice and text solutions, the developers have now expanded into the visual realm.
The model is built on state-space architecture and contains three billion parameters. While it is not the largest model on the market – for comparison, GPT-4 is estimated to contain hundreds of billions of parameters – its compactness is an advantage: the model runs faster and requires fewer resources.
The primary mission of Sarvam Vision is document processing. The model can describe images, recognize text in photos, interpret charts, and parse complex tables. Crucially, it is specialized for Indic languages.
Challenges of Recognizing Regional Indic Languages
The Problem Global Models Don't Solve
In India, vast amounts of information are still stored on paper: scanned archives, historical documents, and government bulletins. For this data to be accessible for research or business use, it must be digitized.
The issue is that existing document recognition solutions perform well in English but stall on India's regional languages. Global models treat them as secondary, leading to a drop in recognition accuracy.
Sarvam took a different approach: rather than adapting a Western model, they trained their own from scratch, paying special attention to India's 22 official languages – from Hindi and Bengali to less common ones like Santali or Maithili.
Dataset Preparation and Training Process
How the Model Was Trained
The developers assembled an extensive dataset: synthetic and real «image–text» pairs for all languages. The dataset included scientific papers, financial reports, government documents, historical manuscripts, textbooks, and newspapers.
Data was prepared separately for each document type. For example, for charts, tasks were created for structure extraction, description, and analysis. For tables, the focus was on understanding cell relationships.
Training took place in several stages: first, pre-training the base model, then fine-tuning for specific tasks, and finally reinforcement learning based on verifiable rewards – meaning the model received feedback based on how accurately it completed the task.
Benchmark Results
Sarvam compared its model against competitors on several popular tests. On the English-language portion of olmOCR-Bench (a benchmark for evaluating document text recognition), Sarvam Vision performed on par with or better than GPT-5.2, Gemini 3 Pro, and other large models. The model performed particularly well with mathematical texts, tables, and scans of old documents.
But the highlight is the Indic languages. Since no standard benchmarks existed for them, Sarvam created their own: the Sarvam Indic OCR Bench. It includes over 20,000 document samples in 22 languages, ranging from texts from the 1800s to modern materials.
In this test, Sarvam Vision outperformed all other models, including Gemini 3 Pro and Claude Opus 4.5. For Hindi, recognition accuracy was nearly 96%; for Bengali, it was 93%; and for Tamil, it was also 93%. Even for less common languages like Odia or Dogri, the results were significantly better than those of its competitors.
Extracting Structured Data from Visual Elements
Knowledge is More Than Just Text
The developers emphasize an important point: the model's task is not just to extract text, but to extract knowledge. Documents contain not only words but also tables, charts, illustrations, and infographics. To fully understand a document, every pixel must be taken into account.
For example, Sarvam Vision can:
- recognize handwritten text on a historical document;
- extract data from a complex nested table;
- describe the content of a chart in Hindi or Tamil;
- convert visual information into a structured JSON format.
In the article, the authors provided several examples. The model correctly recognized a handwritten letter in English 📝, scanned Tamil text from a book from the 1800s, a complex table with nested rows, and a data chart in Hindi.
Computer Vision and Real World Scene Recognition
The Model Does More Than Just Document Work
Although the primary focus is on document processing, Sarvam Vision also handles general computer vision tasks. The model can describe scenes, recognize text «in the wild» – for instance, on shop signs or road signs – and extract structured information from photographs.
The developers demonstrated the model describing a street with a bike lane in English and Kannada, recognizing a notice in Gujarati, extracting flight schedules from an airport display in Kannada, and reading a handwritten school text about Kalam.
Limitations and Performance Challenges
Where the Model Fails
Sarvam honestly admits that the model is not perfect. The article provides two examples of failures.
The first is an incorrect translation of a shop name from Bengali. The model recognized the sign but translated it incorrectly.
The second involves challenges with low-resource languages. When the model was asked to describe a street scene in Santali, it ignored the instruction and replied in English. For rare languages, the quality of instruction-following remains inconsistent.
Future Development and API Availability
What's Next
Sarvam Vision is available via API. Throughout February 2026, the company is providing free access to the document recognition API with no volume limits. This is a great opportunity to test the model in real-world conditions.
The developers plan to evolve the model in the fields of education, healthcare, and video analytics. Additionally, the Sarvam team invites developers to their Discord to discuss updates and share feedback.
It is noteworthy that a compact, specialized model can compete with much larger, general-purpose solutions – especially in areas where accuracy with non-English languages is critical.