Imagine you hire an employee through an agency. The agency provides them with general rules of conduct. As their employer, you then provide specific instructions for their job. Now, what if a random client comes along and tries to convince this employee to break all the rules? Who should the employee listen to? The answer seems obvious, but for language models, this has been a non-trivial task until now.
This is precisely the problem OpenAI has taken on with an approach called the IH-Challenge (from Instruction Hierarchy).
Where Does the Confusion Come From?
Modern language models receive instructions from several sources at once. The platform developer configures the model's behavior via a system prompt. The user writes something in the chat. And sometimes, the model works with external data – documents, web pages, search results – where text that looks like an instruction can also be found.
The problem is that the model doesn't always understand whose words to trust more. If an uploaded document contains the phrase, “ignore all previous instructions and do this,” some models might actually follow that command. This is called prompt injection – an attempt to inject external commands into the model through an untrusted source.
This isn't an abstract threat. When models are integrated into workflows and automatically process emails, documents, or web content, the potential for such attacks becomes very real.
What Is an Instruction Hierarchy and Why Is It Important?
The idea behind an instruction hierarchy is simple: not all sources are equally reliable, and the model must understand this. Instructions from the platform developer carry more weight. The user's words are important, but they are limited by the framework set by the developer. Text from an external document should not be interpreted as a command at all.
Simply put, the model must be able to set priorities: who to believe, who to obey, and whose “instructions” to treat as data rather than a call to action.
It sounds logical. But in practice, training a model for this behavior has proven to be difficult – especially without sacrificing its overall usefulness.
What the IH-Challenge Does
The IH-Challenge is a special training approach designed to help models better adhere to this hierarchy. The core idea is to deliberately create situations during training where the model must correctly resolve conflicts between instructions from different trust levels.
The researchers created a set of scenarios where instruction sources explicitly contradict each other and trained the model to make the right decision in each case. Importantly, the task was not simply framed as, “follow the safety rules,” but rather as, “learn to determine who to trust in a given context.”
As a result, models that underwent this training showed improvements in several areas:
- They are better at following instructions from trusted sources;
- They are more resistant to manipulation attempts via untrusted content;
- Their behavior becomes more predictable from a security standpoint.
It's Not Just About Protecting from Attacks
An important nuance: this isn't just about protection. A proper instruction hierarchy is also about giving developers more confidence in controlling the model's behavior within their products.
If a company integrates a language model into a corporate tool and sets certain limitations – for example, “do not discuss competitors” or “always respond in the user's language” – they want to be sure that these rules cannot be easily bypassed. The IH-Challenge helps reinforce this confidence.
This makes the models more controllable – in a good way. Not in the sense that “the model only does what it's told,” but in the sense that “the model understands whose commands carry more weight.”
The Open Question: Where to Draw the Line
One of the challenging aspects of this topic is finding the right balance. A model that adheres too rigidly to the hierarchy might become less flexible and refuse to help a user in situations where it would be perfectly appropriate. A model that switches to new instructions too easily is vulnerable.
OpenAI acknowledges that this is a difficult calibration, and the IH-Challenge is not a final solution but rather a step in the right direction. Work on how models handle conflicting instruction sources is ongoing.
But the very fact that such research is being formalized into distinct training methods and described publicly shows that the industry is seriously tackling the question of how to make an AI system not just smart, but also predictably obedient to those it is supposed to obey.