Imagine this: you ask an AI to find the cheapest direct flight from one city to another. It doesn't just give you a list of links – it opens the browser itself, goes to the right website, enters the search parameters, scrolls through the results, and brings back a ready answer. This is exactly how web agents work – systems capable of performing tasks in a browser just as a human would.
Such tools exist, but until recently, the most powerful ones were closed-source: trained on secret data, inaccessible for study or independent verification. The Allen Institute for AI (Ai2) decided to change the situation and released MolmoWeb – a completely open web agent, including the model, training data, evaluation tools, and code.
See and Act
MolmoWeb is built on the Molmo 2 multimodal model and is available in two versions: 4B and 8B parameters. Put simply, «multimodality» means the model can work not only with text but also with images.
The agent's principle of operation is surprisingly concise: look at the screen, decide what to do, and execute the action. At each step, the model receives a task, a screenshot of the current browser state, and a history of previous actions. It then formulates a brief explanation of its intentions and takes the next step: clicks, types text, scrolls the page, opens tabs, or reports the result to the user.
A key difference between MolmoWeb and several other agents is that it works specifically with the visual representation of the page – that is, with screenshots – rather than with HTML code or other internal site structures. This is as close as possible to human behavior: you see a button, you click it. This approach offers practical advantages: a screenshot takes up much less «room» during processing than a full page structure, and a site's visual interface changes less frequently than its code. Moreover, the agent's actions are easier to track and understand, as it sees the same thing the user does.
As a result, MolmoWeb handles a wide range of everyday tasks: navigating multi-page sites, filling out forms, searching and filtering products, and extracting necessary information. And all of this without the need to use a specific site's dedicated API.
Where the Training Data Comes From
One of the main difficulties in developing web agents is the lack of public training data. Most existing systems are trained on closed datasets. The creators of MolmoWeb solved this problem differently: alongside the model, they published MolmoWebMix – a large, open dataset created specifically for training visual web agents.
The dataset consists of several parts. The first is real-user demonstrations: crowdworkers performed various browser tasks using a Chrome extension that recorded their actions and screenshots. The result is over 30,000 recorded sessions covering more than 1,100 sites and over 590,000 individual subtasks. This is the largest publicly available dataset of its kind.
The second part consists of synthetic trajectories generated automatically. Specialized agents independently explored sites based on their structure, completed tasks, and verified results without human intervention. This allowed the dataset to scale beyond what can be collected manually.
The third part is data for «vision» training: tasks for determining the position of interface elements on the screen and answering questions about screenshot content. This block alone contains over 2.2 million «question-answer» pairs collected from nearly 400 sites.
What is also important is what is not in the training: the Ai2 team intentionally avoided distillation from proprietary systems. This means MolmoWeb didn't learn to mimic closed agents but was trained from scratch on its own data.
Testing Results
MolmoWeb was evaluated on four benchmarks requiring interaction with real websites. The tests cover general web navigation, multi-step tasks across a wide range of resources, complex queries in online stores, and instruction-following accuracy.
Despite its relatively modest size, both versions of the model showed results on par with the best open-source web agents. The 8B version scored 78.2% on WebVoyager, 42.3% on DeepShop, and 49.5% on WebTailBench, surpassing competing open models. The smaller 4B version also outperformed larger alternatives in some tests, including situations where the competitor used significantly more steps.
Another curious result: if you run several independent agent sessions and pick the best result, the quality increases sharply. With this approach, the 8B version reaches 94.7% on WebVoyager compared to 78.2% in a single run. Simply put: the more computing resources invested in the agent's workflow, the more reliably it performs.
Separately, the model's ability to accurately «see» interface elements – finding buttons, fields, and links on the screen – was tested. Here, the specialized version of MolmoWeb (8B) outperformed not only other open models but also a number of large proprietary systems.
Limitations and Developer Caveats
The team honestly lists current shortcomings. Since the model only sees screenshots, it sometimes makes mistakes when reading text from the screen. It can get confused if it performs an accidental action at the wrong time – for example, scrolling a page before it has fully loaded. Complex tasks with many conditions are more difficult, and some manipulations, like dragging elements or scrolling within a separate block, remain problematic for now.
For security and privacy reasons, MolmoWeb was also not trained on tasks related to website authorization or financial transactions.
Many open questions remain in this field. How should an agent comply with site terms of use? How can access to undesirable content be prevented? How can user personal data be protected and irreversible actions avoided? The developers do not pretend to have ready-made answers, and that is precisely why they are opening all their developments: the more people can study and improve the system, the faster these problems will be solved.
Why It Matters
The situation with web agents today resembles the development of language models before open alternatives appeared: capabilities were concentrated in the hands of a few companies, reproducing or verifying them was nearly impossible, and the research community worked under a deficit of information.
MolmoWeb is an attempt to change the dynamic. An open model, data, training pipeline, and evaluation tools mean that any developer or researcher can not only use the agent but also understand how it works, fine-tune it for a specific task, or suggest improvements.
The internet is the largest software platform in the world. Agents capable of working reliably in a browser can significantly expand people's access to information and digital services. MolmoWeb has become one of the first steps in this direction taken openly. 🌐