Published on March 17, 2026

M4-RAG: как ИИ ищет ответы в картинках на разных языках

M4-RAG: When AI Seeks Answers in Images, Not Just Text, and Across Multiple Languages

Researchers have introduced M4-RAG, a large-scale benchmark for evaluating systems that answer questions about images by drawing on external knowledge and operating in multiple languages.

Research 5 – 7 minutes min read
Event Source: Capital One 5 – 7 minutes min read

Imagine showing an AI a photograph of an old building and asking, “What is this place?” The model must not only describe the picture but also find the necessary information from external sources to provide a meaningful answer. Now, let's make it more complex: the question isn't asked in English, but in Japanese, Arabic, or Russian. These are precisely the kinds of challenges at the core of the new M4-RAG study, presented at the CVPR conference.

Зачем ИИ нужен внешний поиск данных

Why Does AI Need “External Search” Anyway?

Most modern language models are trained on vast datasets, but this data is fixed at the time of training. To put it simply, the model only knows what was in its “textbook.” If it needs fresh or highly specialized information, it might not know the answer or, even worse, invent one.

This is precisely why a new approach has been actively developing in recent years, where a model first searches for relevant information in an external database before answering–much like a student going to the library before an exam. This approach is called RAG, which stands for Retrieval-Augmented Generation. The idea is simple: before you answer, find something that can help.

Until recently, RAG systems worked primarily with text. But the real world is different: information comes in the form of photos, diagrams, charts, and documents with images. This raises a natural question: how well does AI handle such tasks when visual information is involved? And how do you measure that “well?”

Что такое M4-RAG и для чего он нужен

What Is M4-RAG and Why Is It Needed?

M4-RAG is a large-scale benchmark, which is a standardized set of tasks for evaluating the quality of systems that answer questions about images by relying on external search. The M4 acronym reflects the multiple dimensions this tool operates in: multilingual, multimodal (combining text and images), massive in scale, and multifaceted in tasks.

To put it even more simply, it's something like a standardized exam for AI systems that can (or claim they can) search for information in pictures–and do it in different languages.

The task of Visual Question Answering (VQA) is not new in itself. But the specific combination of three factors–visual content, external search, and multilingualism–has rarely been studied as a unified system before. M4-RAG fills this gap.

Почему языки в ИИ это не просто перевод

Why Languages Are Not Just About “Translation”

Multilingualism in the context of AI is a serious topic in its own right. Most powerful models are trained predominantly on English text. This means their capabilities in other languages are often noticeably weaker–even if the model formally “understands” several languages.

When the need to work with images and search for information from external sources is added to the mix, the complexity increases sharply. The system must not only “see” the picture but also formulate the correct search query, find a suitable source, extract the necessary information–and do all of this in a language that may be far from English.

M4-RAG makes it possible to test how well a system handles these complex scenarios. This is important: if we want AI tools to be truly accessible to people worldwide, not just English speakers, we need to be able to measure the quality of their performance in different languages–and purposefully improve it.

Бенчмарк как инструмент прогресса ИИ

A Benchmark as a Tool for Progress

It might seem that creating an “exam” is a less interesting task than creating the model itself. But in the research community, benchmarks are highly valued–and for good reason.

Without a common standard for measurement, different development teams can't honestly compare their results. Everyone could test their model on examples convenient for them and get impressive numbers–but that would say nothing about its real quality. A good benchmark provides a single “ruler” by which to compare approaches and track actual progress.

The acceptance of M4-RAG at the CVPR conference–one of the most authoritative venues in the field of computer vision–indicates that the research community has recognized that such a ruler is indeed necessary, and that the proposed approach is serious enough to become a starting point for future research.

Что это меняет на практике для пользователей

What Does This Change in Practice?

For ordinary users, the emergence of a benchmark won't change anything directly, of course–it's a tool for researchers and developers. But the indirect consequences are quite tangible.

Systems that can answer questions about images by relying on up-to-date external information–and do it in different languages–are not an abstraction. They could be, for example, a tool that helps interpret a medical scan by drawing on the latest clinical data. Or a service that finds a product's specifications and alternatives from a photo. Or an educational assistant that explains a historical document by pulling context from external sources.

The more accurately we can measure the quality of such systems, the faster they will improve. M4-RAG is a step in exactly that direction.

Open Questions

Any benchmark is a snapshot of reality, not reality itself. The question always remains: how well does it reflect the scenarios that systems will encounter in real-world use? Might it turn out that a model that aces this exam still struggles with the live queries of real users?

Moreover, multimodal search with multilingual support is an area where there is objectively less data available than for English text. This creates a structural inequality that a single benchmark cannot solve–it can only make the problem visible and measurable.

But this is where progress begins: first, you learn to measure, then you learn to improve. M4-RAG takes on the first part of that job.

Original Title: M4-RAG: A multimodal RAG
Publication Date: Jun 3, 2026
Capital One www.capitalone.com A U.S.-based financial technology corporation applying artificial intelligence and machine learning to banking services, data analytics, and financial process automation.
Previous Article MR3: A Model That Evaluates AI Responses in Dozens of Languages Without Predefined Rules Next Article Why AI Can't «Read» the World Like We Do

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Don’t miss a single experiment!

Subscribe to our Telegram channel —
we regularly post announcements of new books, articles, and interviews.

Subscribe