Alibaba has released Qwen3.5, the inaugural model in the new Qwen3 generation. Its core feature is its ability to inherently understand text, images, and audio from the ground up – not through separate modules or adapters, but as a unified whole.
What Does «Native Multimodality» Mean?
Typically, language models are trained to work with text, and then components for processing images or sound are «bolted on». This approach works, but not always seamlessly; the model can lose some meaning when switching between modalities or require extra processing steps.
Qwen3.5 has taken a different path. From its inception, it was trained to perceive different data types as part of a single process. Simply put, for this model, text, images, and audio are not separate «languages» but natural ways of expressing information. This allows the model to better understand context when information is received in various formats simultaneously.
What Native Multimodality Means
Why Is This Necessary?
The idea is to move closer to the operating principles of agents – programs that can perform tasks in a real-world environment. It's not enough for an agent to simply answer a question. It needs to see an interface, hear commands, read instructions, and act on all this information simultaneously.
If a model is designed from the ground up to process all of this together, it becomes more advantageous for such scenarios. For example, it can analyze an application screenshot, listen to a user's voice command, and suggest the next step – all without switching between modes, working in a single, unified flow.
Why Native Multimodality is Necessary
What Can Qwen3.5 Do?
The model is trained to work with three main data types:
- Text: like any language model.
- Images: it can analyze content, describe objects, and understand scenes.
- Audio: it recognizes speech and sounds and can use them to understand context.
Moreover, Qwen3.5 doesn't just process each modality separately; it attempts to combine them. For instance, if you provide it with an image containing text and ask a question vocally, it can use all three sources to formulate an answer.
Qwen3.5 Capabilities: Text, Images, and Audio
Open Weights and Availability
Alibaba has released the model with open weights. This means developers can download it, study it, use it in their projects, or fine-tune it for specific tasks. For researchers and teams working on agents or multimodal applications, this is crucial: there's no need to wait for an API or pay for access – you can start experimenting right away.
Open weights also allow the community to evaluate how well native multimodality performs in practice. This isn't just a marketing claim – it can be verified independently.
Qwen3.5 Open Weights and Availability
What's Next?
Qwen3.5 is the first model in the Qwen3 lineup, but it's unlikely to be the last. Alibaba calls it a step toward «native multimodal agents», which sounds like a long-term goal. More versions will likely follow – perhaps with more parameters, improved accuracy, or support for additional modalities.
It's still unclear how well the model handles complex agent-based tasks in real-world conditions. Native multimodality is an architectural advantage, but the final quality depends on the data it was trained on and how it behaves in unexpected situations.
What's Next for Qwen3.5 and Multimodal Agents
Who Is This For?
Qwen3.5 may be of interest to those working on projects that require combining multiple data types:
- Developers of agents and assistants that need to interact with interfaces and users simultaneously.
- Researchers studying multimodal models and their capabilities.
- Teams creating content analysis applications, such as for video processing, where images, sound, and text are all important.
For the average user, this is still more of a glimpse into the future. But if the trend toward native multimodality continues, we might soon see assistants that understand context much better than they do now – not because they were taught each skill separately, but because they are designed differently from the ground up.