Published on March 16, 2026

MR3: модель для оценки ответов ИИ на десятках языков без заданных правил

MR3: A Model That Evaluates AI Responses in Dozens of Languages Without Predefined Rules

Researchers have introduced the MR3 model, which evaluates the quality of language model responses across multiple languages – without rigid criteria or evaluation templates.

Research / Technical context 4 – 5 minutes min read
Event Source: Capital One 4 – 5 minutes min read

When a large language model answers a question, someone has to decide whether the answer is good or not. In production systems, this role is increasingly being filled by specialized evaluator models called reward models. They are trained to distinguish good responses from bad ones and help the primary model improve through further training.

It sounds simple, but in practice, there are several inconvenient limitations. First, most of these evaluators are trained primarily in English. Second, they are usually tied to a specific set of criteria – that is, predefined rules about what constitutes a good or bad answer. Change the task, and you have to change or retrain the evaluator.

The researchers who presented the MR3 model at the ICLR conference sought to address these two limitations.

Что такое MR3 и в чем ее особенность

What is MR3 and What Makes It Special

MR3 is a new type of evaluator model. Its full name stands for Multilingual Rubric-Agnostic Reward Reasoning Model, meaning it is a multilingual evaluation model that is not dependent on predefined criteria.

Let's break down what that means.

Multilingual. In terms of language coverage, MR3 surpasses anything that has existed in this field before. Simply put, the model can evaluate responses not only in English but also in dozens of other languages – which is crucial for systems that serve a multilingual audience.

Rubric-Agnostic. Most evaluators work based on a rubric: there is a list of rules, and the response is checked against each one. MR3 is designed differently – it can make an assessment based on the context of the task, without needing predefined rules about what is considered correct. This makes it more versatile: the same model can be applied in various scenarios without reconfiguration.

Reasoning as Part of the Evaluation. The word reasoning in its name isn't just for show. The model doesn't just output a score directly; it first constructs a chain of reasoning explaining why one answer is better than another and outlining its strengths and weaknesses. This makes the evaluation more transparent and, as a rule, more reliable.

Практическое применение MR3

Why Is This Needed – and For Whom?

To understand the practical value of MR3, it helps to recall how the process of improving language models works.

Modern large models are trained not only on text from the internet but also with the help of feedback, where the system learns from evaluations of its own responses. This approach is known as Reinforcement Learning from Human Feedback (RLHF) or its automated variations. The evaluator model acts as a judge here: it looks at a response and says how good it is.

If an evaluator works in only one language, the quality of fine-tuning in other languages inevitably suffers. This is particularly problematic for companies and teams building products for a multilingual audience.

Furthermore, if an evaluator is rigidly tied to specific criteria, it has to be retrained every time the task changes. MR3 removes this limitation, as it is capable of adapting to new evaluation conditions without retraining.

Значение MR3 для индустрии

What This Means for the Industry

The work on MR3 was presented at ICLR – one of the leading conferences in machine learning. This fact alone speaks to the approach's scientific validity.

For researchers and teams developing multilingual systems, MR3 offers an interesting alternative to current solutions. Instead of maintaining separate evaluators for different languages or tasks, they can use a single, more flexible model with broader coverage.

This is especially relevant as language models increasingly expand beyond the English language. The demand for quality assessment tools that work just as well in Spanish, Arabic, or Hindi as they do in English is very real and growing.

Вопросы и перспективы MR3

Open Questions

Like most research papers, the work on MR3 leaves some questions that have yet to be clarified in practice.

Being rubric-agnostic is one of the model's strengths, but it also creates an area of uncertainty. When an evaluator builds its own evaluation logic without relying on explicit rules, the question arises: how stable and predictable are its judgments in different contexts? Verifying this in real-world production scenarios is more difficult than on test datasets.

The quality of its multilingual performance is also not uniform: models generally perform better in languages with large amounts of training data. How consistently MR3 handles lower-resource languages is a question that requires separate study.

Nevertheless, the direction in which MR3 is moving seems logical: the quality evaluation of language models should be as flexible and multilingual as the models themselves. And in this regard, MR3 takes a significant step forward.

Original Title: MR3: Multilingual rubric-agnostic reward reasoning models
Publication Date: Apr 23, 2026
Capital One www.capitalone.com A U.S.-based financial technology corporation applying artificial intelligence and machine learning to banking services, data analytics, and financial process automation.
Previous Article When an AI Agent is Ready, But Needs a Proper Launch Next Article M4-RAG: When AI Seeks Answers in Images, Not Just Text, and Across Multiple Languages

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to know about new
experiments first?

Subscribe to our Telegram channel — we share all the latest
and exciting updates from NeuraBooks.

Subscribe