Imagine you're in an unfamiliar building and need to find the exit. You don't have a floor plan, but you methodically explore the corridors: peering into passages, remembering where you've already been, and building a rough map in your head. This seems simple – almost automatic. Yet, this very task proved to be a serious challenge for modern AI systems.
Researchers from Stanford developed a special test called Theory of Space and used it to evaluate six leading AI models. The results showed that what humans do almost without thinking poses a fundamental difficulty for AI.
The task is essentially this: the model is placed in a virtual space and must actively explore it – moving around, noticing new details, and updating its understanding of the space's layout. Then, it must use this accumulated knowledge to make decisions: where to go next, where things are located, and how to get from one point to another.
To put it simply, the model must not just perceive the space, but build an internal model of it as it explores – and revise that model if new information changes the picture.
This is precisely what researchers call 'spatial beliefs' – a dynamic, updatable representation of how the surrounding environment is structured. It's not a static map provided beforehand, but knowledge that needs to be constructed independently during the process.
None of the six models tested passed the task with confidence. Moreover, all of them exhibited similar systemic weaknesses.
First: The Models Explore Poorly
It turned out that AI systems are not good at planning their exploration of a space. Instead of methodically navigating an unfamiliar environment – as a human would – they perform chaotic or inefficient actions. The researchers called this the 'exploration bottleneck': the model doesn't understand where or why to move to learn something new.
This is critical because, without effective exploration, it's impossible to gather enough information to build an accurate representation of the space.
Second: Text and Images Exist in Parallel Worlds
Modern powerful models can work with both text and images. It would seem this should help with spatial tasks: you look at a picture and understand where you are. But in practice, things turned out to be more complicated.
The study revealed a persistent gap between the two modes of operation: when the space is described with words versus when it's shown visually. The models perform much worse in visual scenarios than with textual descriptions of the same situations. What a model understands reasonably well in text causes significant difficulties when presented as an image or a visual scene.
To put it simply, for these models, 'seeing' and 'understanding space' are still two different things.
Third: Once a Belief Is Formed, It's Hard to Change
This is perhaps the most surprising finding. The models demonstrate what the researchers call 'belief inertia': once they form a certain representation of a space, they struggle to revise it – even when new data clearly indicates that their previous understanding was wrong.
It's like a person who has made up their mind about a route and then, upon encountering a locked door, continues to insist that the exit must be right there, instead of reconsidering the path. This can happen with people, but it's rare. For AI models, this proved to be a consistent pattern.
The task of spatial orientation might seem highly specialized – so what if a model can't navigate virtual rooms? But in reality, this is about a much more fundamental ability.
Spatial reasoning isn't just about maps and navigation. It's about the ability to build a dynamic model of reality: to update one's beliefs as new information arrives, to understand what you don't yet know, and to purposefully seek it out. These are precisely the skills needed, for example, by a robot that has to operate in the real world, or by an AI assistant solving multi-step problems in changing conditions.
If a model can't revise its understanding of a situation based on new observations, that's a problem that extends far beyond spatial tasks. It's a question of how well AI can adapt to reality, not just answer questions based on a pre-defined context.
Most existing tests for AI are designed around the 'given a task – get an answer' principle. All the necessary information is present in the prompt. The model doesn't need to search, explore, or clarify anything – it just needs to correctly process what's provided.
Theory of Space is fundamentally different. Here, the model must decide for itself what actions to take to obtain the necessary information. This is called active exploration – and it's what distinguishes 'understanding' from 'pattern reproduction.'
This approach is closer to how real intelligence works. We don't receive all the context in advance – we gather it, often on the fly. And if an AI system is to operate in the real world, not just in controlled test environments, this ability becomes key.
The study's results don't mean that modern AI models are bad in general. They mean that the models have a specific, measurable gap – and now it has a name and a method for measuring it.
Having a clear benchmark is useful in itself. The industry has long been looking for ways to understand what large models can and cannot do, beyond standard tasks like text generation or question answering. Theory of Space provides one such tool.
For those developing autonomous systems, robots, or AI agents capable of acting in the real world, this research points to specific, unresolved challenges: flexible knowledge updates, the ability to plan exploration, and working with visual information in a dynamic context.
The study honestly documents the problems but doesn't offer ready-made solutions – which is normal for this type of work. Understanding where the gap is, is often more important than immediately closing it.
It remains unclear how much the identified weaknesses are tied to the architectural limitations of the models themselves, versus how they were trained. Perhaps some of the problems can be solved by fine-tuning on active exploration tasks. Or perhaps, more profound changes are needed in how models work with accumulated context altogether.
A separate open question is the transfer of these findings to real-world scenarios. The test operates in a virtual environment, and how accurately it reflects the models' behavior in more complex, physical, or mixed-reality conditions remains to be seen.
But the fact that such questions can now be asked with the support of concrete data is already a step forward.