Published on March 20, 2026

Как научить ИИ слушаться тех, кому стоит доверять

How to Teach AI to Obey Trusted Sources

OpenAI has developed the IH-Challenge approach, which helps language models correctly prioritize instructions from different sources.

Security 4 – 6 minutes min read
Event Source: OpenAI 4 – 6 minutes min read

Imagine you hire an employee through an agency. The agency provides them with general rules of conduct. As their employer, you then provide specific instructions for their job. Now, what if a random client comes along and tries to convince this employee to break all the rules? Who should the employee listen to? The answer seems obvious, but for language models, this has been a non-trivial task until now.

This is precisely the problem OpenAI has taken on with an approach called the IH-Challenge (from Instruction Hierarchy).

Откуда вообще берётся путаница

Where Does the Confusion Come From?

Modern language models receive instructions from several sources at once. The platform developer configures the model's behavior via a system prompt. The user writes something in the chat. And sometimes, the model works with external data – documents, web pages, search results – where text that looks like an instruction can also be found.

The problem is that the model doesn't always understand whose words to trust more. If an uploaded document contains the phrase, “ignore all previous instructions and do this,” some models might actually follow that command. This is called prompt injection – an attempt to inject external commands into the model through an untrusted source.

This isn't an abstract threat. When models are integrated into workflows and automatically process emails, documents, or web content, the potential for such attacks becomes very real.

Что такое иерархия инструкций и почему это важно

What Is an Instruction Hierarchy and Why Is It Important?

The idea behind an instruction hierarchy is simple: not all sources are equally reliable, and the model must understand this. Instructions from the platform developer carry more weight. The user's words are important, but they are limited by the framework set by the developer. Text from an external document should not be interpreted as a command at all.

Simply put, the model must be able to set priorities: who to believe, who to obey, and whose “instructions” to treat as data rather than a call to action.

It sounds logical. But in practice, training a model for this behavior has proven to be difficult – especially without sacrificing its overall usefulness.

Что делает IH-Challenge

What the IH-Challenge Does

The IH-Challenge is a special training approach designed to help models better adhere to this hierarchy. The core idea is to deliberately create situations during training where the model must correctly resolve conflicts between instructions from different trust levels.

The researchers created a set of scenarios where instruction sources explicitly contradict each other and trained the model to make the right decision in each case. Importantly, the task was not simply framed as, “follow the safety rules,” but rather as, “learn to determine who to trust in a given context.”

As a result, models that underwent this training showed improvements in several areas:

  • They are better at following instructions from trusted sources;
  • They are more resistant to manipulation attempts via untrusted content;
  • Their behavior becomes more predictable from a security standpoint.

Это не только про защиту от атак

It's Not Just About Protecting from Attacks

An important nuance: this isn't just about protection. A proper instruction hierarchy is also about giving developers more confidence in controlling the model's behavior within their products.

If a company integrates a language model into a corporate tool and sets certain limitations – for example, “do not discuss competitors” or “always respond in the user's language” – they want to be sure that these rules cannot be easily bypassed. The IH-Challenge helps reinforce this confidence.

This makes the models more controllable – in a good way. Not in the sense that “the model only does what it's told,” but in the sense that “the model understands whose commands carry more weight.”

Открытый вопрос: где граница

The Open Question: Where to Draw the Line

One of the challenging aspects of this topic is finding the right balance. A model that adheres too rigidly to the hierarchy might become less flexible and refuse to help a user in situations where it would be perfectly appropriate. A model that switches to new instructions too easily is vulnerable.

OpenAI acknowledges that this is a difficult calibration, and the IH-Challenge is not a final solution but rather a step in the right direction. Work on how models handle conflicting instruction sources is ongoing.

But the very fact that such research is being formalized into distinct training methods and described publicly shows that the industry is seriously tackling the question of how to make an AI system not just smart, but also predictably obedient to those it is supposed to obey.

Original Title: Improving instruction hierarchy in frontier LLMs
Publication Date: Mar 10, 2026
OpenAI openai.com A U.S.-based company developing general-purpose AI models for text, code, and images.
Previous Article ChatGPT Now Interactively Explains Math and Physics Next Article Agents Instead of Chatbots: How AI Is Learning to Solve Truly Complex Problems

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

We explore why assessing AI agents' skills isn't just a formality, but a crucial step toward building systems you can trust with real-world tasks.

OpenHandsopenhands.dev Mar 18, 2026

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to dive deeper into the world
of neuro-creativity?

Be the first to learn about new books, articles, and AI experiments
on our Telegram channel!

Subscribe