Training robots is an expensive business. Not in terms of «buying the hardware», but rather what happens before the robot actually learns to do anything. You need people to operate it manually, demonstrating the desired behavior over and over again. It requires hundreds of recording hours, dozens of sites, and a coordinated infrastructure. Open X-Embodiment – one of the largest open datasets of its kind – was compiled by 21 organizations and contains over a million real-world trajectories. DROID – another well-known dataset – consists of 350 hours of teleoperation collected across 13 institutions. This is a monumental task that remains the primary bottleneck for most labs.
That is why the idea of training a robot entirely in simulation – without a single real-world demonstration – looks both enticing and risky. Enticing, because simulation is cheap, scalable, and reproducible. Risky, because the real world differs from the virtual one, and this «sim-to-real gap» has traditionally been seen as one of the main hurdles.
Virtual Experience – Real Results
The Ai2 Research Institute decided to see if this gap could be bridged not through more realistic simulations, but through sheer diversity. The idea is this: if you show the model enough varied virtual scenes – different objects, lighting, camera angles, textures, and physical conditions – it will learn to generalize and transfer that experience into reality.
On March 11, 2026, Ai2 introduced MolmoBot – a suite of models for controlling robotic manipulators trained exclusively on synthetic data. No real-world teleoperation. No fine-tuning on real scenes. Just simulation – and then straight to a real robot.
The results proved unexpectedly compelling. On tasks like «pick up an object and place it in the right spot», the top model in the suite outperformed π0.5 – a system from Physical Intelligence trained on vast amounts of real-world data. Notably, MolmoBot had never seen these objects or scenes before – not in simulation, and certainly not in reality.
What MolmoBot Can Do 🤖
The suite covers several types of tasks:
- picking up objects and moving them across a table;
- interacting with articulated parts: drawers, cabinets, microwaves;
- opening doors – including the approach, grabbing the handle, and moving through the full range of motion.
The robot can be controlled via words or by pointing to a spot – for example, «pick up», «put down», or «close.» All of this works across two different platforms: the Franka FR3 stationary manipulator and the Rainbow Robotics RB-Y1 mobile robot.
Simply put, this isn't a niche system built for one task and one robot. It is an attempt to create something more universal and keep it open-source.
Why This Matters More Than It Seems
Most modern systems that utilize simulation use it as a supplement to real-world data. Simulation helps, but real-world demonstrations remain the core. MolmoBot removes that layer entirely.
For the industry, this shifts the very nature of the «bottleneck.» Currently, the main constraint is data collection: you need people, robots, space, and time. If simulation works as the sole source of training, the critical factor is no longer collection, but the design of virtual environments. And that is a task that can be scaled using computation and open tools – without an army of operators.
This is especially vital for academic labs. Many simply cannot afford the teleoperation infrastructure or a partnership on the scale of Open X-Embodiment. MolmoBot, along with the open MolmoSpaces ecosystem – a set of tools for generating synthetic data – potentially makes manipulation robotics more accessible.
A Fair Assessment
It is important to understand that MolmoBot is not a claim to the ultimate solution for the «robot problem.» It is a hypothesis test: can simulation-only training work effectively for manipulation? The answer – at least for the tasks tested – seems to be yes.
However, many open questions remain. How will the system behave in more complex, chaotic environments? How will it handle tasks requiring fine tactile feedback, which simulations replicate inaccurately? Where exactly does it break, and what is needed to fix it?
The authors themselves state they want to see where the model fails. This is exactly why they have released not just the models, but the entire tech stack: data, generation pipelines, training code, and the technical report. This is unusual for robotics, where most heavy-duty systems remain behind closed doors.
In short: MolmoBot is an argument that synthetic data can become the foundation, rather than just a supplement, in robot training. For now, it is just one experiment, albeit a convincing one. But the direction it sets looks like one of the most realistic paths toward making robots accessible to more than just giant corporations.