If you ask where the most valuable information in corporate documents is hidden, the answer is almost always the same: in tables. Financial reports, technical specifications, medical data – all of this is typically structured in this way. And it's precisely tables that most AI tools have historically struggled with.
Tables Are More Than Just Text
Simply put, recognizing a table is harder than it looks. It's not just a collection of words; it's a structure where the row and column of each element are crucial. Merged cells, nested headers, and complex layouts – all of this turns the task into a puzzle, even for powerful models. This is why many companies still process documents manually or pay for specialized services.
Against this backdrop, LightOn has introduced the second version of its model: LightOnOCR-2. This is an open-source model specializing in what's known as OCR, or Optical Character Recognition, which involves recognizing characters and structures in scanned or photographed documents. But the key achievement here isn't just recognizing characters; it's the ability to accurately extract tables with all their rows, columns, and interrelationships.
How LightOnOCR-2 Outperformed Commercial Giants
In comparative testing, LightOnOCR-2 surpassed a whole range of well-known solutions – Claude, GPT-5, Qwen3, Mistral, and Mathpix – specifically in the task of table extraction. This is noteworthy for several reasons.
First, most of the listed models are commercial, backed by large companies with vast resources. LightOnOCR-2 is open-source, meaning its code and weights are available to everyone. Second, large general-purpose models like GPT-5 or Claude can do many things, but they often fall short against more specialized solutions where precision in a specific task is required.
It's like having a multi-tool that's good for most jobs, but when you need to do something precisely, you reach for a specialized instrument. LightOnOCR-2 is exactly that case: the model is fine-tuned for working with documents, and it's in this niche that it delivers better results than the larger «jacks-of-all-trades.»
Why This Matters for Document Processing
The task of table extraction isn't just an abstract benchmark. Behind it lies a very real need: companies work with vast numbers of documents every day where data is packed into tables. Banks process financial reports, hospitals handle medical records, and logistics companies deal with invoices. An error in a single cell can distort the entire picture.
Until now, automating this process was either expensive (subscription-based commercial solutions) or unreliable (general-purpose models that only «understand» tables approximately). LightOnOCR-2 offers a third option: a high-precision, open-source solution that can be deployed independently.
This is especially relevant for organizations that need to avoid sending documents to external cloud services – whether for confidentiality reasons or due to regulatory requirements. Deploying an open-source model locally solves this problem.
Open Source as a Competitive Advantage
LightOnOCR-2 arrives at a time when open-source models are increasingly challenging commercial ones in specialized tasks. Recently, Google released the Gemma 4 family – also open-source models under the Apache 2.0 license – which compete with much larger solutions in certain scenarios. The trend is clear: open-source projects are no longer in the «second league» and are starting to set the standard in specific niches.
In the case of LightOnOCR-2, that niche is working with documents and tables. And based on the test results, the open-source model doesn't just match its commercial counterparts – it surpasses them.
What Remains an Open Question
Benchmark results are always a snapshot taken under specific conditions. How the model performs on real-world documents with non-standard layouts, in languages with different typography, or on tables with partially damaged or unreadable data – these are separate questions that can only be answered in practice, not in lab tests.
Nevertheless, the emergence of a strong open-source alternative in a niche long dominated by commercial solutions is a significant event. This is especially true for teams looking for a reliable document processing tool, who are unwilling to depend on external APIs and want to understand what's happening «under the hood.»