If you've been following how companies are implementing AI in real-world operations, you've likely come across the acronym RAG. Simply put, it's an approach where a language model doesn't just generate a response from what it 'remembers' from its training data, but first searches for relevant information in documents and only then provides an answer. It's like a smart search engine built into an AI assistant.
The problem is that, in practice, this approach hits an unexpected bottleneck – and not where you'd typically look for one.
Where the Pipeline Breaks Down
When discussing AI systems, the conversation often revolves around the quality of the model itself: how accurately it responds, how well it reasons, and whether it confuses facts. But in real-world corporate implementations, the bottleneck often turns out to be something more mundane: document preparation.
Imagine an organization has thousands of PDF files, Word documents, spreadsheets, and scanned pages. Before the model can 'read' these documents and answer questions about them, each file must be parsed, cleaned, broken down into meaningful chunks, and converted into numerical representations known as embeddings. Only then does this information enter a vector database, from which the model draws context for each query.
This might sound like a technical detail, but in practice, when processing hundreds of thousands of documents, this very stage can take hours or even days. And if the data is constantly being updated, this delay becomes a systemic problem.
Divide and Conquer: Distributed Processing as the Solution
Red Hat, in collaboration with Anyscale, is offering a solution based on distributed data processing. The idea isn't new in the world of big data, but its application to RAG pipelines is a logical and pragmatic step.
Instead of processing documents sequentially on a single machine, the task is broken down into parallel streams that run simultaneously on multiple cluster nodes. It's like having a whole team join in to read a stack of books instead of just one person – each takes their share, and the overall speed increases dramatically.
Technically, this is implemented using Ray Data – a framework for distributed data processing – in conjunction with Docling, a tool for extracting structured information from various documents like PDFs, tables, images with text, and other formats.
All of this is deployed on Red Hat OpenShift AI, a platform that provides the infrastructure layer: managing compute resources, storage, GPU acceleration, and everything else needed for such systems to operate stably in a corporate environment.
What Exactly Can Docling Do
Docling is not just a PDF parser. The tool can handle complex layouts: recognizing tables, separating columns, processing headers and captions, and understanding the document's hierarchy. This is crucial because most corporate documents are structured nothing like clean, linear text – they contain insets, footnotes, multi-column layouts, and scans with an OCR layer.
If a parser 'reads' a document incorrectly – mixing up the order of paragraphs or losing data from a table – the model's responses based on that document will be unreliable. The quality of data preparation directly impacts the quality of the final RAG solution.
Why This Matters Right Now
RAG is actively becoming a part of the corporate AI stack, and these are no longer just experiments but production deployments. Organizations want their internal models to answer questions based on up-to-date documentation, contracts, regulations, and knowledge bases. And the faster new data enters the system, the more 'live' and useful it becomes.
The document processing bottleneck is not a theoretical problem but something teams encounter in real-world projects. A solution through distributed processing seems natural: it scales horizontally (meaning you can simply add more machines), doesn't require rewriting logic from scratch, and fits into existing OpenShift-based infrastructure.
A distinct advantage of this approach is unification. Instead of assembling a pipeline from disparate tools, the team gets a single environment where data management, computation, and model control are all in one place. This reduces the operational load and simplifies debugging when something goes wrong.
What's Left Unanswered
The solution looks compelling at an architectural level, but several questions remain unanswered. How easy is it for a team without deep expertise in distributed systems to set all this up? How will Docling perform with documents in languages other than English, especially if they have specific layouts? How does the system handle low-quality documents – those that are poorly scanned or have an inconsistent structure?
These questions don't undermine the value of the approach, but they are important for anyone considering such a solution as the foundation for a production system, not just a research prototype.
Overall, this is a sincere and pragmatic attempt to address a real problem that the industry has somewhat underestimated amid the race for better model quality. Data must not only be stored but also prepared quickly and correctly – and that, as it turns out, is a non-trivial task in itself.