Published January 23, 2026

How AI Reads Arabic and Hebrew PDF Files

How to Teach AI to Properly Read Arabic and Hebrew PDF Files

AI21 Labs has developed a method enabling language models to accurately extract text from documents in right-to-left languages.

Development
Event Source: AI21 Labs Reading Time: 3 – 4 minutes

When we talk about working with documents, we usually refer to Latin scripts – English, Spanish, French. But what about Arabic or Hebrew? In these languages, the text runs from right to left, and this creates unexpected problems for systems trying to extract information from PDF files.

Problems with Right-to-Left Languages in PDFs

What's the Problem with RTL Languages?

Right-to-left (RTL) languages require a special approach. When you open a PDF in Arabic or Hebrew, the document structure can be complex: tables, columns, sidebars – all of which need to be read in the correct order. If the system doesn't understand the text direction, it might jumble the lines, mix up columns, or produce complete gibberish.

Until now, most PDF parsing tools were created with Latin scripts in mind. RTL languages remained on the periphery, and the processing quality for such documents was noticeably worse.

AI21 Labs Solution for RTL PDF Processing

How AI21 Labs Approached the Solution

The AI21 Labs team decided not to reinvent the wheel but to utilize what already works well – models designed for left-to-right (LTR) languages. The idea is to leverage the strengths of existing systems and adapt them for RTL.

Simply put, they taught the model to «see»» the RTL document as if it were LTR, but while accounting for all the specific characteristics of the text direction. This made it possible to achieve results comparable to the best systems for the English language.

What Exactly They Did

The main approach is based on several steps:

  • Direction-aware preprocessing. The document is analyzed with the understanding that the text flows from right to left. The system determines the reading order of elements on the page.
  • Using LTR models. Instead of training a new model from scratch, existing systems trained for Latin scripts are used. They are applied to the RTL text after special preparation.
  • Testing on real documents. The model was tested on various document types – from simple texts to complex tables and multi-column layouts.

The result: the parsing quality of RTL documents has reached a level previously only available for English and other LTR languages.

Importance of RTL Document Processing Improvements

Why This Matters

Hundreds of millions of people use Arabic, Hebrew, and other RTL languages. For them, working with documents is just as much a daily task as it is for everyone else. Yet, automation tools often let them down.

Imagine a bank that wants to automatically process applications in Arabic. Or a government organization working with documents in Hebrew. If parsing works poorly, one has to do everything manually or put up with errors.

Now that the quality of RTL document processing has caught up with Latin scripts, it opens up new opportunities for automation in regions where it was previously difficult to implement.

Future of AI for Right-to-Left Languages

What's Next

This approach demonstrates that it is not necessary to create separate systems for each language from scratch. You can use existing developments and adapt them to new tasks. This saves time and resources.

Of course, nuances remain. RTL languages vary: Arabic with its connected letters differs from Hebrew, where the letters are block-style. There is also Persian, Urdu, and others. Each of them may require its own specific adjustments.

But the main point is that it has been shown that the quality gap can be bridged. And this is good news for everyone who works with documents in languages that have long remained on the sidelines in the world of AI tools.

#applied analysis #technical context #neural networks #ai linguistics #engineering #data #rtl language models
Original Title: Closing the parsing gap: reaching SOTA RTL parsing by leveraging LTR capabilities
Publication Date: Jan 22, 2026
AI21 Labs www.ai21.com An Israeli company building large language models and AI tools for working with text.
Previous Article How “Snoozing” Data Helps Save on AI Training Costs Next Article AMD Introduces GPU Partitioning for Concurrent LLM Execution

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.5 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.5 Anthropic
2.
Gemini 3 Pro Preview Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 3 Pro Preview Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

Want to know about new
experiments first?

Subscribe to our Telegram channel — we share all the latest
and exciting updates from NeuraBooks.

Subscribe