Published on March 20, 2026

Как научить ИИ слушаться тех, кому стоит доверять

How to Teach AI to Obey Trusted Sources

OpenAI has developed the IH-Challenge approach, which helps language models correctly prioritize instructions from different sources.

Security 4 – 6 minutes min read

Event Source: OpenAI 4 – 6 minutes min read

Imagine you hire an employee through an agency. The agency provides them with general rules of conduct. As their employer, you then provide specific instructions for their job. Now, what if a random client comes along and tries to convince this employee to break all the rules? Who should the employee listen to? The answer seems obvious, but for language models, this has been a non-trivial task until now.

This is precisely the problem OpenAI has taken on with an approach called the IH-Challenge (from Instruction Hierarchy).

Откуда вообще берётся путаница

Where Does the Confusion Come From?

Modern language models receive instructions from several sources at once. The platform developer configures the model's behavior via a system prompt. The user writes something in the chat. And sometimes, the model works with external data – documents, web pages, search results – where text that looks like an instruction can also be found.

The problem is that the model doesn't always understand whose words to trust more. If an uploaded document contains the phrase, “ignore all previous instructions and do this,” some models might actually follow that command. This is called prompt injection – an attempt to inject external commands into the model through an untrusted source.

This isn't an abstract threat. When models are integrated into workflows and automatically process emails, documents, or web content, the potential for such attacks becomes very real.

Что такое иерархия инструкций и почему это важно

What Is an Instruction Hierarchy and Why Is It Important?

The idea behind an instruction hierarchy is simple: not all sources are equally reliable, and the model must understand this. Instructions from the platform developer carry more weight. The user's words are important, but they are limited by the framework set by the developer. Text from an external document should not be interpreted as a command at all.

Simply put, the model must be able to set priorities: who to believe, who to obey, and whose “instructions” to treat as data rather than a call to action.

It sounds logical. But in practice, training a model for this behavior has proven to be difficult – especially without sacrificing its overall usefulness.

Что делает IH-Challenge

What the IH-Challenge Does

The IH-Challenge is a special training approach designed to help models better adhere to this hierarchy. The core idea is to deliberately create situations during training where the model must correctly resolve conflicts between instructions from different trust levels.

The researchers created a set of scenarios where instruction sources explicitly contradict each other and trained the model to make the right decision in each case. Importantly, the task was not simply framed as, “follow the safety rules,” but rather as, “learn to determine who to trust in a given context.”

As a result, models that underwent this training showed improvements in several areas:

They are better at following instructions from trusted sources;
They are more resistant to manipulation attempts via untrusted content;
Their behavior becomes more predictable from a security standpoint.

Это не только про защиту от атак

It's Not Just About Protecting from Attacks

An important nuance: this isn't just about protection. A proper instruction hierarchy is also about giving developers more confidence in controlling the model's behavior within their products.

If a company integrates a language model into a corporate tool and sets certain limitations – for example, “do not discuss competitors” or “always respond in the user's language” – they want to be sure that these rules cannot be easily bypassed. The IH-Challenge helps reinforce this confidence.

This makes the models more controllable – in a good way. Not in the sense that “the model only does what it's told,” but in the sense that “the model understands whose commands carry more weight.”

Открытый вопрос: где граница

The Open Question: Where to Draw the Line

One of the challenging aspects of this topic is finding the right balance. A model that adheres too rigidly to the hierarchy might become less flexible and refuse to help a user in situations where it would be perfectly appropriate. A model that switches to new instructions too easily is vulnerable.

OpenAI acknowledges that this is a difficult calibration, and the IH-Challenge is not a final solution but rather a step in the right direction. Work on how models handle conflicting instruction sources is ongoing.

But the very fact that such research is being formalized into distinct training methods and described publicly shows that the industry is seriously tackling the question of how to make an AI system not just smart, but also predictably obedient to those it is supposed to obey.

#analysis #methodology #ai training #ai safety #computer systems #infrastructure #human–machine interaction #ai reliability #human-in-the-loop

Link to Original: https://openai.com/index/instruction-hierarchy-challenge

Original Title: Improving instruction hierarchy in frontier LLMs

Publication Date: Mar 10, 2026

OpenAI openai.com A U.S.-based company developing general-purpose AI models for text, code, and images.

Previous Article ChatGPT Now Interactively Explains Math and Physics Next Article Agents Instead of Chatbots: How AI Is Learning to Solve Truly Complex Problems

Как научить ИИ слушаться тех, кому стоит доверять

Откуда вообще берётся путаница

Что такое иерархия инструкций и почему это важно

Что делает IH-Challenge

Это не только про защиту от атак

Открытый вопрос: где граница

Related Publications

Assessing AI Agent Skills: What to Look For

How to Tell if Your AI Agent is Actually Working or Just Looking Convincing

How Microsoft Is Learning to Spot Backdoors in Language Models

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration