When we talk about working with documents, we usually refer to Latin scripts – English, Spanish, French. But what about Arabic or Hebrew? In these languages, the text runs from right to left, and this creates unexpected problems for systems trying to extract information from PDF files.
Problems with Right-to-Left Languages in PDFs
What's the Problem with RTL Languages?
Right-to-left (RTL) languages require a special approach. When you open a PDF in Arabic or Hebrew, the document structure can be complex: tables, columns, sidebars – all of which need to be read in the correct order. If the system doesn't understand the text direction, it might jumble the lines, mix up columns, or produce complete gibberish.
Until now, most PDF parsing tools were created with Latin scripts in mind. RTL languages remained on the periphery, and the processing quality for such documents was noticeably worse.
AI21 Labs Solution for RTL PDF Processing
How AI21 Labs Approached the Solution
The AI21 Labs team decided not to reinvent the wheel but to utilize what already works well – models designed for left-to-right (LTR) languages. The idea is to leverage the strengths of existing systems and adapt them for RTL.
Simply put, they taught the model to «see»» the RTL document as if it were LTR, but while accounting for all the specific characteristics of the text direction. This made it possible to achieve results comparable to the best systems for the English language.
What Exactly They Did
The main approach is based on several steps:
- Direction-aware preprocessing. The document is analyzed with the understanding that the text flows from right to left. The system determines the reading order of elements on the page.
- Using LTR models. Instead of training a new model from scratch, existing systems trained for Latin scripts are used. They are applied to the RTL text after special preparation.
- Testing on real documents. The model was tested on various document types – from simple texts to complex tables and multi-column layouts.
The result: the parsing quality of RTL documents has reached a level previously only available for English and other LTR languages.
Importance of RTL Document Processing Improvements
Why This Matters
Hundreds of millions of people use Arabic, Hebrew, and other RTL languages. For them, working with documents is just as much a daily task as it is for everyone else. Yet, automation tools often let them down.
Imagine a bank that wants to automatically process applications in Arabic. Or a government organization working with documents in Hebrew. If parsing works poorly, one has to do everything manually or put up with errors.
Now that the quality of RTL document processing has caught up with Latin scripts, it opens up new opportunities for automation in regions where it was previously difficult to implement.
Future of AI for Right-to-Left Languages
What's Next
This approach demonstrates that it is not necessary to create separate systems for each language from scratch. You can use existing developments and adapt them to new tasks. This saves time and resources.
Of course, nuances remain. RTL languages vary: Arabic with its connected letters differs from Hebrew, where the letters are block-style. There is also Persian, Urdu, and others. Each of them may require its own specific adjustments.
But the main point is that it has been shown that the quality gap can be bridged. And this is good news for everyone who works with documents in languages that have long remained on the sidelines in the world of AI tools.