Published on March 20, 2026

Тестирование ИИ моделей для обработки документов: сравнение 16 систем

16 AI Models, 9,000+ Documents: Who Came Out on Top?

A large-scale test of 16 AI models on real-world documents revealed surprising results: expensive solutions don't always outperform their more affordable counterparts.

Products 5 – 8 minutes min read
Event Source: Nanonets 5 – 8 minutes min read

When companies choose an AI for document processing, they usually rely on marketing promises, others' experiences, or intuition. Conducting their own comparison is expensive and time-consuming. The Nanonets team decided to fill this gap and did what many had been putting off: they took 16 popular models and ran them through over 9,000 real-world documents. The results were interesting enough to share in more detail.

Что проверяли в тестировании ИИ для документов

What Exactly Was Tested

The testing was based on three open benchmarks that cover different aspects of document processing.

The first is DocVQA. Here, the model receives an image of a document and must answer a specific question about its content. Simply put, it's a test of how well the model can 'read' a document and find the necessary information within it.

The second is InfographicVQA. It's a similar task, but more complex: the documents contain infographics, charts, and tables. The model needs to understand not just the text, but the visual structure as well.

The third is ArxivQA. Scientific articles with formulas, diagrams, and specific formatting. This is arguably the most demanding of the three formats.

In all cases, the evaluation wasn't just whether the model 'answered or not', but the accuracy of the answer based on the standard ANLS metric. This metric considers how close the answer is to the correct one, even if the wording is slightly different.

Какие ИИ-модели участвовали в тестировании

The Contenders

Models from the major market players participated in the test: GPT-4o and GPT-4 Turbo from OpenAI, Gemini 1.5 Flash and Gemini 1.5 Pro from Google, Claude Opus and Claude Sonnet from Anthropic, as well as a number of other solutions – including models from Mistral, open-source options, and specialized document processing systems.

A total of 16 models. The range was broad: from flagship commercial products to more affordable and compact alternatives.

Выводы исследования по работе ИИ с документами

Unexpected Findings

The first thing that stands out is that the most expensive models aren't always the best for document processing specifically. This might seem like an obvious point, but in practice, it's often overlooked when choosing a tool, as people focus on a model's overall rating rather than its performance on a specific task.

Gemini 1.5 Flash proved to be one of the most well-balanced options. Despite its relatively low cost, it demonstrated high accuracy across most document types. This is a rare combination – you usually have to choose between speed, price, and quality.

GPT-4o remained stable and reliable, especially on structured documents. It didn't always lead the pack, but it rarely fell short – what you might call a 'safe bet'.

Claude Sonnet handled long and complex documents well, where maintaining context is crucial. Its advantage was less noticeable on short and simple forms.

As for the scientific articles from ArxivQA, this is where the gap between the models was widest. Formulas, non-standard formatting, and dense technical text proved to be significantly more challenging for most models than standard business documents.

Цена обработки документов ИИ-моделями

The Cost Factor

The study also analyzed the cost of processing. And this is where the picture becomes particularly relevant for anyone considering a real-world implementation.

The price difference between the models can be several-fold. Meanwhile, the difference in quality for typical tasks is much smaller. In short: it's likely not worth overpaying for a top-tier model to process standard invoices or forms. But if the documents are complex, non-standard, or require a deep understanding of context, a more powerful (and more expensive) model can be worth the investment.

This isn't a universal truth, but rather a guideline: the document type should influence the choice of model just as much as its overall rating or brand recognition.

Бенчмарк ИИ моделей не заменяет реальное тестирование

One Benchmark Is Not a Verdict

An important caveat the study's authors themselves emphasize is that benchmarks are a snapshot, not the full picture. Real-world document processing involves many factors that are difficult to replicate in a test environment: scan quality, non-standard fonts, mixed languages, and documents with damage or illegible sections.

Furthermore, models are constantly being updated. Results that are relevant today might look different in a few months, especially given the pace at which the major players are evolving.

So, the right conclusion to draw from this study is not 'use model X', but rather, 'Before you choose, test it on your own data.' A benchmark provides a starting point, but it's no substitute for testing on a real-world task.

Важность тестирования ИИ для обработки документов

Why This Matters

Document processing is one of the most common tasks companies face when implementing AI. Invoices, contracts, application forms, medical records, tax forms – all of these require accurate information extraction, and the cost of an error here is very real.

Meanwhile, the market for tools for this task is overheated: every company claims to have the best accuracy and speed. Independent comparisons on real-world volumes are rare. That's why studies like this are valuable: they provide at least some neutral frame of reference in a space where marketing often drowns out the technical details.

Of course, Nanonets is not an independent research institute; the company has its own product in the IDP (Intelligent Document Processing) market. This is worth keeping in mind when interpreting the results. But the methodology is open, the benchmarks are public – anyone who wants to can reproduce the test and verify the conclusions.

Практические советы по выбору ИИ-инструментов для документов

What This Means for You

If you are choosing or evaluating AI tools for document processing, here are a few practical considerations that stem from this research:

  • Don't rely solely on general model rankings – the task of document processing is specific, and leaders in chatbots are not necessarily leaders here.
  • The type of document matters. A model that excels at processing invoices might struggle with scientific papers or infographics.
  • Cost and quality don't always correlate as you might expect. There are models with a good price-to-accuracy ratio – these should be considered first for typical tasks.
  • Any benchmark is a guide, not the final answer. Testing on your own documents remains a mandatory step before implementation.

The study doesn't name a universal winner – and to be honest, that's the right outcome. Because for this task, a universal winner most likely doesn't exist.

Original Title: We ran 16 AI Models on 9,000+ Real Documents. Here's What We Found.
Publication Date: Mar 11, 2026
Nanonets nanonets.com A U.S.-based company using AI to automate document processing and visual data analysis.
Previous Article How Rakuten Halved Bug Fix Time with OpenAI's AI Agent Next Article Tracy: A New Library for Understanding the Inner Workings of AI Applications

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to know about new
experiments first?

Subscribe to our Telegram channel — we share all the latest
and exciting updates from NeuraBooks.

Subscribe