Imagine you need to highlight not just «a person» in a photo, but «a person in a blue shirt standing by the left railing of the bridge and looking down.» Most computer vision models would stumble here – they are built for simple categories but lose the plot when descriptions get specific. Moondream excels at exactly this: it understands elaborate verbal prompts and accurately isolates the desired object in an image. On March 10, 2026, the team released an updated version of this feature.
Segmentation is when a model doesn't just find an object in a picture but literally «traces» its outline. Simply put, it creates a mask: a precise shape of the object that can be used for photo editing, scene analysis, automated data labeling, and dozens of other tasks.
What sets Moondream apart is its ability to handle referring expressions – descriptive phrases in natural language. Not just «find a car», but «find the white Porsche 911 in the foreground.» Or «laundry on the floor.» Or «Waldo number 25317.» This is fundamentally more challenging than simply recognizing an object category.
What's New in the Update
The new version of the model brings improvements across three key areas.
Higher Mask Quality. Moondream natively generates masks in SVG format – a vector graphic that stays sharp at any scale. Unlike pixel-based masks that «blur» when zoomed in, SVG remains crisp. The new version traces object contours even more meticulously.
40% Speed Boost. This is a game-changer for those processing large volumes of images or building applications where low latency is critical.
Improved Benchmark Scores. To evaluate segmentation quality, special datasets like RefCOCO, RefCOCO+, and RefCOCOg are used. These test how accurately a model understands different types of descriptions: spatial locations, physical appearance, and long, complex phrases. The new version outperformed the previous one across all these tests. Notably, the previous benchmark leader was also Moondream – meaning the team just broke their own record.
What About the Competition?
In September 2025, when Moondream first launched its segmentation feature as part of Moondream 3 Preview, it immediately topped the benchmarks. Since then, several other models with similar capabilities have emerged, but according to the team, Moondream maintains its lead.
A prime example is the comparison with Meta's SAM 3. While SAM 3 can segment objects based on simple prompts like «car» or «person», it struggles with more nuanced descriptions – such as «the person touching the door.» To handle these, one usually has to plug in an additional Large Language Model, which increases both processing time and cost. Moondream handles such queries natively without intermediaries.
Generally, there is a clear divide in this field: powerful multimodal models understand complex descriptions but are slow and expensive. Lightweight models are fast but trip over anything more complex than a simple noun. Moondream positions itself as the solution that checks both boxes simultaneously.
Who Benefits Right Now
The update is already live in Moondream Cloud. If you are already using segmentation through this service, the improvements will be applied automatically; no extra setup is required.
For those who prefer running models locally, the team announced that the local version will be released in the coming days. Along with it, a technical paper is expected for those who want to dive into the implementation details.
In short: Moondream is doubling down on the sweet spot between accuracy and speed in a niche where most tools sacrifice one for the other. The March 10 update is another big step in that direction. ✦