Published on March 25, 2026

MolmoWeb: An Open AI Agent for Autonomous Web Browsing

The Allen Institute has introduced MolmoWeb, an open-source web agent. It navigates browsers visually, much like a human, and outperforms many proprietary competitors.

Products 6 – 8 minutes min read
Event Source: Ai2 6 – 8 minutes min read

Imagine this: you ask an AI to find the cheapest direct flight from one city to another. It doesn't just give you a list of links – it opens the browser itself, goes to the right website, enters the search parameters, scrolls through the results, and brings back a ready answer. This is exactly how web agents work – systems capable of performing tasks in a browser just as a human would.

Such tools exist, but until recently, the most powerful ones were closed-source: trained on secret data, inaccessible for study or independent verification. The Allen Institute for AI (Ai2) decided to change the situation and released MolmoWeb – a completely open web agent, including the model, training data, evaluation tools, and code.

How MolmoWeb Multimodal Model Operates

See and Act

MolmoWeb is built on the Molmo 2 multimodal model and is available in two versions: 4B and 8B parameters. Put simply, «multimodality» means the model can work not only with text but also with images.

The agent's principle of operation is surprisingly concise: look at the screen, decide what to do, and execute the action. At each step, the model receives a task, a screenshot of the current browser state, and a history of previous actions. It then formulates a brief explanation of its intentions and takes the next step: clicks, types text, scrolls the page, opens tabs, or reports the result to the user.

A key difference between MolmoWeb and several other agents is that it works specifically with the visual representation of the page – that is, with screenshots – rather than with HTML code or other internal site structures. This is as close as possible to human behavior: you see a button, you click it. This approach offers practical advantages: a screenshot takes up much less «room» during processing than a full page structure, and a site's visual interface changes less frequently than its code. Moreover, the agent's actions are easier to track and understand, as it sees the same thing the user does.

As a result, MolmoWeb handles a wide range of everyday tasks: navigating multi-page sites, filling out forms, searching and filtering products, and extracting necessary information. And all of this without the need to use a specific site's dedicated API.

MolmoWebMix Open Dataset and Training Process

Where the Training Data Comes From

One of the main difficulties in developing web agents is the lack of public training data. Most existing systems are trained on closed datasets. The creators of MolmoWeb solved this problem differently: alongside the model, they published MolmoWebMix – a large, open dataset created specifically for training visual web agents.

The dataset consists of several parts. The first is real-user demonstrations: crowdworkers performed various browser tasks using a Chrome extension that recorded their actions and screenshots. The result is over 30,000 recorded sessions covering more than 1,100 sites and over 590,000 individual subtasks. This is the largest publicly available dataset of its kind.

The second part consists of synthetic trajectories generated automatically. Specialized agents independently explored sites based on their structure, completed tasks, and verified results without human intervention. This allowed the dataset to scale beyond what can be collected manually.

The third part is data for «vision» training: tasks for determining the position of interface elements on the screen and answering questions about screenshot content. This block alone contains over 2.2 million «question-answer» pairs collected from nearly 400 sites.

What is also important is what is not in the training: the Ai2 team intentionally avoided distillation from proprietary systems. This means MolmoWeb didn't learn to mimic closed agents but was trained from scratch on its own data.

Performance Benchmarks and Evaluation Results

Testing Results

MolmoWeb was evaluated on four benchmarks requiring interaction with real websites. The tests cover general web navigation, multi-step tasks across a wide range of resources, complex queries in online stores, and instruction-following accuracy.

Despite its relatively modest size, both versions of the model showed results on par with the best open-source web agents. The 8B version scored 78.2% on WebVoyager, 42.3% on DeepShop, and 49.5% on WebTailBench, surpassing competing open models. The smaller 4B version also outperformed larger alternatives in some tests, including situations where the competitor used significantly more steps.

Another curious result: if you run several independent agent sessions and pick the best result, the quality increases sharply. With this approach, the 8B version reaches 94.7% on WebVoyager compared to 78.2% in a single run. Simply put: the more computing resources invested in the agent's workflow, the more reliably it performs.

Separately, the model's ability to accurately «see» interface elements – finding buttons, fields, and links on the screen – was tested. Here, the specialized version of MolmoWeb (8B) outperformed not only other open models but also a number of large proprietary systems.

Technical Limitations and Security Considerations

Limitations and Developer Caveats

The team honestly lists current shortcomings. Since the model only sees screenshots, it sometimes makes mistakes when reading text from the screen. It can get confused if it performs an accidental action at the wrong time – for example, scrolling a page before it has fully loaded. Complex tasks with many conditions are more difficult, and some manipulations, like dragging elements or scrolling within a separate block, remain problematic for now.

For security and privacy reasons, MolmoWeb was also not trained on tasks related to website authorization or financial transactions.

Many open questions remain in this field. How should an agent comply with site terms of use? How can access to undesirable content be prevented? How can user personal data be protected and irreversible actions avoided? The developers do not pretend to have ready-made answers, and that is precisely why they are opening all their developments: the more people can study and improve the system, the faster these problems will be solved.

Significance of Open Source Web Agents

Why It Matters

The situation with web agents today resembles the development of language models before open alternatives appeared: capabilities were concentrated in the hands of a few companies, reproducing or verifying them was nearly impossible, and the research community worked under a deficit of information.

MolmoWeb is an attempt to change the dynamic. An open model, data, training pipeline, and evaluation tools mean that any developer or researcher can not only use the agent but also understand how it works, fine-tune it for a specific task, or suggest improvements.

The internet is the largest software platform in the world. Agents capable of working reliably in a browser can significantly expand people's access to information and digital services. MolmoWeb has become one of the first steps in this direction taken openly. 🌐

Original Title: MolmoWeb: An open agent for automating web tasks
Publication Date: Mar 24, 2026
Ai2 allenai.org A U.S.-based research institute developing language models and AI systems for science and education.
Previous Article JetBrains Central: When AI Agents Become Too Many for Manual Control Next Article Mercury 2: Fast AI Models and the First Steps Towards a Personal Assistant

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to dive deeper into the world
of neuro-creativity?

Be the first to learn about new books, articles, and AI experiments
on our Telegram channel!

Subscribe