Published on March 17, 2026

Почему ИИ не умеет «читать» мир так, как это делает человек

Why AI Can't «Read» the World Like We Do

Researchers tested how resilient visual language models are to misleading geographical cues – and the results were quite telling.

Research 5 – 7 minutes min read
Event Source: Capital One 5 – 7 minutes min read

Imagine looking at a photo of a street café. Tables on the sidewalk, signs, characteristic architecture – and you almost instantly get a feeling: this is Paris or, perhaps, Rome. Your brain captures dozens of small details at once and assembles them into a coherent picture. Modern AI systems that can «look» at images and answer questions about them have also learned to do something similar. But how well do they cope when the picture is intentionally misleading?

This is the very question a group of researchers addressed in their work presented at CVPR – one of the key conferences in the field of computer vision. They decided to test so-called visual language models (VLMs): systems that perceive both images and text, and then answer questions, describe scenes, or reason about the photo's content.

Что такое VLM и зачем проверять модели ИИ на устойчивость

What Are VLMs and Why Test Their Robustness

Simply put, a VLM is an AI that can not only read text but also «look» at pictures. You show it a photo and ask, «What is pictured here?» or «What country was this taken in?» – and the model responds.

Such systems are already used in a wide variety of scenarios: from helping blind people describe their surroundings to automatically processing documents with illustrations. The broader the application, the more crucial it is to understand where the model might make mistakes – and especially in what situations it can be easily confused.

A model's robustness is its ability to provide correct answers even when the input data is slightly altered or contains «noise.» If a small change in the picture or caption drastically changes the model's response, it's a sign: the system doesn't understand the essence but relies on superficial features.

Как VLM ошибаются при ложных ориентирах

Tourists with a Poor Sense of Direction

The study's authors framed the problem figuratively but accurately: VLMs behave like disoriented tourists. They may know a lot about the world in general, but they get lost when familiar landmarks are out of place.

To test this idea, the researchers created a special set of tests – a kind of «cultural stress test.» It is based on the idea of spoofing geographical cues: the models were shown images with visual or textual elements intentionally creating a false impression of the location. For example, a photograph with distinctive cultural markers of one country might be accompanied by hints pointing to a completely different region.

The goal was simple: to see if the model could maintain a correct judgment when surrounded by an intentionally distorted context. Would it resist the false cues or follow them?

Результаты тестирования устойчивости VLM к культурным подменам

What the Results Showed

The results were telling. Visual language models showed significant instability precisely when it came to cultural and geographical features. As soon as the context was slightly altered – by adding misleading text, swapping background details, or mixing visual signals from different cultures – the models began to make mistakes.

This means that many VLMs perceive cultural context not as a holistic understanding, but as a set of superficial patterns. It's as if they have «learned by heart» that certain visual elements are associated with specific places, but they haven't developed a deeper logic – one that would allow them to resist manipulation.

A human in a similar situation would likely notice the contradiction: «Wait, the architecture is clearly not from here. Something isn't right.» The models, however, often followed the planted cue without noticing the discrepancy.

Почему устойчивость VLM важна для реальных приложений

Why This Matters Beyond the Test Environment

One might think: so what, it's just a lab experiment. But in practice, such situations occur much more often than it seems.

Take, for example, content moderation systems that analyze images along with text captions. Or apps that help users navigate unfamiliar places using photos. Or tourism and educational services that rely on automatic recognition of cultural context. In all these cases, resilience to intentionally or accidentally distorted cues is not an academic problem, but a very practical one.

Furthermore, the study raises a broader question about how models actually «understand» culture. Or more precisely – do they understand it at all, or have they just memorized the statistical correlations between visual elements and geographical names? Based on the results, the latter seems more likely.

Набор тестов для оценки культурной устойчивости VLM

A Test Suite as a Tool for the Industry

Beyond the findings themselves, the researchers offered something of practical value: a structured test suite for evaluating the cultural robustness of VLMs. Simply put – a ready-made tool that developers can use to check their models for such vulnerabilities.

This is important because the industry currently lacks a unified standard for this type of evaluation. Most existing benchmarks check whether a model correctly recognizes objects or answers questions about an image's content. But very few systematically check what happens when the input data is intentionally distorted specifically in the cultural and geographical dimensions.

The emergence of such a tool is a step toward developers starting to include such checks in their standard system testing processes.

Открытые вопросы в исследовании устойчивости визуальных моделей ИИ

Open Questions

The work also honestly acknowledges what remains outside its scope. The research focuses on a specific type of vulnerability – geographical and cultural cues. Other types of «confusing» context are left out, as is the question of how to fine-tune models to make them more robust in this regard.

Also open is the question of the problem's nature itself: is it a lack of data during training, an architectural feature, or something more fundamental related to how «understanding» works in neural networks at all? The research is more of an accurate diagnosis of the disease than a proposed cure – but a good diagnosis is often the first necessary step.

Ultimately, this work serves as a reminder: AI systems that appear confident and competent may be built on a more fragile foundation than it seems from the outside. And the more widely they are used in the real world, the more crucial it is to understand exactly where this fragility manifests.

Original Title: VLMs are confused tourists
Publication Date: Jun 3, 2026
Capital One www.capitalone.com A U.S.-based financial technology corporation applying artificial intelligence and machine learning to banking services, data analytics, and financial process automation.
Previous Article M4-RAG: When AI Seeks Answers in Images, Not Just Text, and Across Multiple Languages Next Article How Cursor Protects Its Code with Autonomous AI Agents

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to dive deeper into the world
of neuro-creativity?

Be the first to learn about new books, articles, and AI experiments
on our Telegram channel!

Subscribe