When companies choose an AI for document processing, they usually rely on marketing promises, others' experiences, or intuition. Conducting their own comparison is expensive and time-consuming. The Nanonets team decided to fill this gap and did what many had been putting off: they took 16 popular models and ran them through over 9,000 real-world documents. The results were interesting enough to share in more detail.
What Exactly Was Tested
The testing was based on three open benchmarks that cover different aspects of document processing.
The first is DocVQA. Here, the model receives an image of a document and must answer a specific question about its content. Simply put, it's a test of how well the model can 'read' a document and find the necessary information within it.
The second is InfographicVQA. It's a similar task, but more complex: the documents contain infographics, charts, and tables. The model needs to understand not just the text, but the visual structure as well.
The third is ArxivQA. Scientific articles with formulas, diagrams, and specific formatting. This is arguably the most demanding of the three formats.
In all cases, the evaluation wasn't just whether the model 'answered or not', but the accuracy of the answer based on the standard ANLS metric. This metric considers how close the answer is to the correct one, even if the wording is slightly different.
The Contenders
Models from the major market players participated in the test: GPT-4o and GPT-4 Turbo from OpenAI, Gemini 1.5 Flash and Gemini 1.5 Pro from Google, Claude Opus and Claude Sonnet from Anthropic, as well as a number of other solutions – including models from Mistral, open-source options, and specialized document processing systems.
A total of 16 models. The range was broad: from flagship commercial products to more affordable and compact alternatives.
Unexpected Findings
The first thing that stands out is that the most expensive models aren't always the best for document processing specifically. This might seem like an obvious point, but in practice, it's often overlooked when choosing a tool, as people focus on a model's overall rating rather than its performance on a specific task.
Gemini 1.5 Flash proved to be one of the most well-balanced options. Despite its relatively low cost, it demonstrated high accuracy across most document types. This is a rare combination – you usually have to choose between speed, price, and quality.
GPT-4o remained stable and reliable, especially on structured documents. It didn't always lead the pack, but it rarely fell short – what you might call a 'safe bet'.
Claude Sonnet handled long and complex documents well, where maintaining context is crucial. Its advantage was less noticeable on short and simple forms.
As for the scientific articles from ArxivQA, this is where the gap between the models was widest. Formulas, non-standard formatting, and dense technical text proved to be significantly more challenging for most models than standard business documents.
The Cost Factor
The study also analyzed the cost of processing. And this is where the picture becomes particularly relevant for anyone considering a real-world implementation.
The price difference between the models can be several-fold. Meanwhile, the difference in quality for typical tasks is much smaller. In short: it's likely not worth overpaying for a top-tier model to process standard invoices or forms. But if the documents are complex, non-standard, or require a deep understanding of context, a more powerful (and more expensive) model can be worth the investment.
This isn't a universal truth, but rather a guideline: the document type should influence the choice of model just as much as its overall rating or brand recognition.
One Benchmark Is Not a Verdict
An important caveat the study's authors themselves emphasize is that benchmarks are a snapshot, not the full picture. Real-world document processing involves many factors that are difficult to replicate in a test environment: scan quality, non-standard fonts, mixed languages, and documents with damage or illegible sections.
Furthermore, models are constantly being updated. Results that are relevant today might look different in a few months, especially given the pace at which the major players are evolving.
So, the right conclusion to draw from this study is not 'use model X', but rather, 'Before you choose, test it on your own data.' A benchmark provides a starting point, but it's no substitute for testing on a real-world task.
Why This Matters
Document processing is one of the most common tasks companies face when implementing AI. Invoices, contracts, application forms, medical records, tax forms – all of these require accurate information extraction, and the cost of an error here is very real.
Meanwhile, the market for tools for this task is overheated: every company claims to have the best accuracy and speed. Independent comparisons on real-world volumes are rare. That's why studies like this are valuable: they provide at least some neutral frame of reference in a space where marketing often drowns out the technical details.
Of course, Nanonets is not an independent research institute; the company has its own product in the IDP (Intelligent Document Processing) market. This is worth keeping in mind when interpreting the results. But the methodology is open, the benchmarks are public – anyone who wants to can reproduce the test and verify the conclusions.
What This Means for You
If you are choosing or evaluating AI tools for document processing, here are a few practical considerations that stem from this research:
- Don't rely solely on general model rankings – the task of document processing is specific, and leaders in chatbots are not necessarily leaders here.
- The type of document matters. A model that excels at processing invoices might struggle with scientific papers or infographics.
- Cost and quality don't always correlate as you might expect. There are models with a good price-to-accuracy ratio – these should be considered first for typical tasks.
- Any benchmark is a guide, not the final answer. Testing on your own documents remains a mandatory step before implementation.
The study doesn't name a universal winner – and to be honest, that's the right outcome. Because for this task, a universal winner most likely doesn't exist.