Published on March 19, 2026

MolmoPoint: A New AI Approach to Precise Object Pointing in Images

MolmoPoint: A New Approach to How AI 'Points' at Objects in Images

Researchers from Allen AI have introduced MolmoPoint – an architecture that allows language models to more accurately point to specific objects in images.

Research 5 – 7 minutes min read
Event Source: Ai2 5 – 7 minutes min read

Most of us are used to interacting with language models through text: you ask a question, you get an answer. But modern AI systems also have a visual side: they can look at images and answer questions about what's in them. This field is called vision-language models, and it is actively developing.

One of the more challenging skills in this field isn't just describing an image in general terms, but precisely pinpointing a specific object. In other words, not just saying «there is a cat in the image», but «here it is, here is its nose, here is its right ear.» This exact task of precise object pointing is what the new development from Allen AI, called MolmoPoint, aims to solve.

Why Precise Object Pointing is Essential for AI

Why Is 'Pointing' Necessary at All?

It might seem like a minor detail. But in practice, the ability to accurately pinpoint an object is the foundation for many useful applications. For a robot that needs to pick up an item from a table, it's not enough to know that «the glass is on the left» – it needs precise coordinates. An augmented reality system needs to understand exactly what the user is looking at. A medical AI needs to point out not just «something suspicious in the scan», but a specific location.

Simply put, object pointing is the bridge between understanding and action. And the more precise this bridge is, the broader the model's applicability in real-world tasks.

Previous AI Approaches and Their Limitations

Previous Approaches and Their Problems

Previous approaches to object pointing in images varied, but most of them essentially boiled down to one of two methods: either the model would predict a bounding box around the object, or it would output a set of coordinate points to mark the location.

Both strategies work, but they share a common weakness: achieving high accuracy requires a lot of labeled data. And data labeling is expensive, slow, and requires human effort. Furthermore, a model trained on coordinates in one format may struggle with the same task in a different format or on a different type of image.

Additionally, the architectural solutions themselves were often «tacked on» to the main model as an add-on rather than being organically integrated. This limited the potential for co-training and reduced overall efficiency.

What MolmoPoint Offers: Key Innovations

What MolmoPoint Offers

The Allen AI team proposed a different way of structuring the architecture itself – that is, how the model «thinks» about pointing to an object.

The key idea of MolmoPoint is to separate visual understanding and precise positioning. Instead of requiring one part of the model to do everything at once, the authors created a separate component that specifically handles localization – finding the precise location in the image.

In this setup, the model doesn't just predict a single point or a rectangle, but works with heatmaps – essentially «probability maps» where brighter areas correspond to higher confidence that the target object is located there. This is a more flexible approach, as it allows pointing to objects of various sizes and shapes without being constrained by the rigid format of a bounding box.

Another important aspect is the training process. MolmoPoint is designed to efficiently use synthetically generated data – data created automatically rather than labeled by hand. This significantly reduces the reliance on expensive manual labeling and clears a path for scaling: the more synthetic data is used, the better the model becomes at its task.

MolmoPoint Performance and Accuracy

How Well Does It Work?

According to the results reported by the Allen AI team, MolmoPoint demonstrates significantly higher pointing accuracy compared to previous approaches. This includes tasks where vision-language models previously struggled, such as with objects that have indistinct boundaries, fine details, or scenes crowded with similar items.

Moreover, the model remains relatively compact and doesn't require a drastic increase in computational resources. This is crucial, as accuracy improvements often come at the cost of significant system complexity, rendering them impractical for real-world applications.

MolmoPoint: Part of the Broader Molmo AI Initiative

Part of a Bigger Picture

MolmoPoint isn't an isolated development. It expands on the Molmo line of models, which Allen AI has been developing for some time. Molmo was initially built as an open, research-oriented alternative to closed-source commercial systems, with a focus on transparency and the reproducibility of results.

The addition of MolmoPoint is a step toward more «action-oriented» models – those that don't just understand what's in an image but can also interact with visual information on a more concrete, operational level. In the future, this will be crucial for robotics, augmented reality interfaces, assistive technologies for people with disabilities, and many other applications where precise spatial pointing is critical.

Allen AI's Openness in Sharing MolmoPoint

Openness as a Principle

The Allen AI team is publishing the model's weights, code, and a description of their approach. This is in line with the institute's overall philosophy of making research accessible to other developers and researchers, rather than keeping it behind closed doors.

For those working on related tasks – such as training robots, developing visual interfaces, or medical imaging – this means that MolmoPoint can be adopted and tested in their own projects without having to build everything from scratch.

Open Questions and Future Directions for MolmoPoint AI

What Remains Unclear

As is often the case with research publications, some questions remain open. How well does MolmoPoint perform in conditions that differ significantly from its training data? How does it handle low-quality images or unconventional visual scenes? How effectively can synthetic data replace real-world labeling in the most complex cases?

This isn't a criticism – it's a normal part of the process for any new development. Answers to these questions will likely emerge as other teams begin to work with the model and publish their findings.

For now, MolmoPoint looks like a significant step forward in a field that is still actively taking shape: teaching AI not just to see, but to point with precision.

Original Title: MolmoPoint: Better pointing architecture for vision-language models
Publication Date: Mar 18, 2026
Ai2 allenai.org A U.S.-based research institute developing language models and AI systems for science and education.
Previous Article How AI Learns to 'Hear' What Matters: Extracting Data from Live Speech in Real Time Next Article Inference: Why a Single Metric Can't Judge an AI Accelerator

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Don’t miss a single experiment!

Subscribe to our Telegram channel —
we regularly post announcements of new books, articles, and interviews.

Subscribe