Published February 17, 2026

Qwen3.5: First Natively Multimodal AI Model Explained

Qwen3.5: The First Natively Multimodal Model

Alibaba has introduced Qwen3.5, the first model in the Qwen3 family, adept at processing text, images, and audio natively, without needing additional adapters.

Products
Event Source: Alibaba Cloud Reading Time: 3 – 5 minutes

Alibaba has released Qwen3.5, the inaugural model in the new Qwen3 generation. Its core feature is its ability to inherently understand text, images, and audio from the ground up – not through separate modules or adapters, but as a unified whole.

What Does «Native Multimodality» Mean?

Typically, language models are trained to work with text, and then components for processing images or sound are «bolted on». This approach works, but not always seamlessly; the model can lose some meaning when switching between modalities or require extra processing steps.

Qwen3.5 has taken a different path. From its inception, it was trained to perceive different data types as part of a single process. Simply put, for this model, text, images, and audio are not separate «languages» but natural ways of expressing information. This allows the model to better understand context when information is received in various formats simultaneously.

What Native Multimodality Means

Why Is This Necessary?

The idea is to move closer to the operating principles of agents – programs that can perform tasks in a real-world environment. It's not enough for an agent to simply answer a question. It needs to see an interface, hear commands, read instructions, and act on all this information simultaneously.

If a model is designed from the ground up to process all of this together, it becomes more advantageous for such scenarios. For example, it can analyze an application screenshot, listen to a user's voice command, and suggest the next step – all without switching between modes, working in a single, unified flow.

Why Native Multimodality is Necessary

What Can Qwen3.5 Do?

The model is trained to work with three main data types:

  • Text: like any language model.
  • Images: it can analyze content, describe objects, and understand scenes.
  • Audio: it recognizes speech and sounds and can use them to understand context.

Moreover, Qwen3.5 doesn't just process each modality separately; it attempts to combine them. For instance, if you provide it with an image containing text and ask a question vocally, it can use all three sources to formulate an answer.

Qwen3.5 Capabilities: Text, Images, and Audio

Open Weights and Availability

Alibaba has released the model with open weights. This means developers can download it, study it, use it in their projects, or fine-tune it for specific tasks. For researchers and teams working on agents or multimodal applications, this is crucial: there's no need to wait for an API or pay for access – you can start experimenting right away.

Open weights also allow the community to evaluate how well native multimodality performs in practice. This isn't just a marketing claim – it can be verified independently.

Qwen3.5 Open Weights and Availability

What's Next?

Qwen3.5 is the first model in the Qwen3 lineup, but it's unlikely to be the last. Alibaba calls it a step toward «native multimodal agents», which sounds like a long-term goal. More versions will likely follow – perhaps with more parameters, improved accuracy, or support for additional modalities.

It's still unclear how well the model handles complex agent-based tasks in real-world conditions. Native multimodality is an architectural advantage, but the final quality depends on the data it was trained on and how it behaves in unexpected situations.

What's Next for Qwen3.5 and Multimodal Agents

Who Is This For?

Qwen3.5 may be of interest to those working on projects that require combining multiple data types:

  • Developers of agents and assistants that need to interact with interfaces and users simultaneously.
  • Researchers studying multimodal models and their capabilities.
  • Teams creating content analysis applications, such as for video processing, where images, sound, and text are all important.

For the average user, this is still more of a glimpse into the future. But if the trend toward native multimodality continues, we might soon see assistants that understand context much better than they do now – not because they were taught each skill separately, but because they are designed differently from the ground up.

Original Title: Qwen3.5: Towards Native Multimodal Agents
Publication Date: Feb 17, 2026
Alibaba Cloud www.alibabacloud.com A Chinese cloud and AI division of Alibaba, providing infrastructure and AI services for businesses.
Previous Article SWE-fficiency: Evaluating Not Just an AI's Bug-Finding Ability, But the Efficiency of Its Fixes Next Article Claude Sonnet 4.6: More Accurate, More Honest, Better Context Understanding

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.5 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.5 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

Don’t miss a single experiment!

Subscribe to our Telegram channel —
we regularly post announcements of new books, articles, and interviews.

Subscribe