Published February 13, 2026

AMD MI300X and Qwen Optimization for AI Performance

How AMD and Qwen Optimized MI300X GPUs for Peak Performance

The Qwen team optimized their models to effectively run on AMD MI300X GPUs, achieving a response latency as low as 15 ms per token and full image generation in just 0.4 seconds.

Technical context Infrastructure
Event Source: LMSYS ORG Reading Time: 4 – 6 minutes

When people talk about accelerating language models, they usually mean NVIDIA. However, that's not the only option. The Qwen team decided to showcase the capabilities of AMD accelerators, and the results are quite impressive.

We are referring to the MI300X series – professional AMD GPUs designed for handling large models. Qwen took their third-generation models, including the multimodal Qwen3-VL, and pushed their performance on this hardware to a level where latency is no longer an issue, even for interactive tasks.

What Was Accelerated?

What Did They Speed Up?

Simply put, there are two primary scenarios for how a language model operates. The first is prefill, where the model processes your entire request before it begins generating a response. The second is decode, where it outputs tokens one by one.

Qwen's objective was to make both of these stages run as fast as possible on AMD hardware. To achieve this, they employed several techniques:

  • Quantization – compressing the model's weights to 4 bits instead of the standard 16. This reduces the amount of data that needs to be moved in memory and accelerates computations.
  • Continuous batching – a method for processing multiple requests simultaneously without waiting for previous ones to finish. This is crucial for server scenarios where requests are constantly arriving.
  • Specialized kernels for the attention operation – a key component of transformer models. Here, they utilized FlashAttention-2 and optimized versions for AMD.

All of these optimizations made it possible to extract performance from the hardware that typically requires more expensive solutions.

Practical Implications of Optimization

What This Means in Practice

The team tested several configurations. For example, the Qwen2.5-Coder-32B-Instruct model with AWQ quantization (4-bit) on a single MI300X card outputs approximately 66 tokens per second when handling a single request. The per-token latency is about 15 milliseconds.

For comparison, this means the model can generate a 100-token response (roughly 75 words) in one and a half seconds. This is already a very comfortable speed for conversational interfaces.

If you increase the number of concurrent requests, the throughput also increases. On two MI300X cards, the model can process up to 32 requests in parallel with a total speed of about 1,000 tokens per second. This demonstrates server-scale performance.

Performance of Multimodal Models

What About Multimodal Models?

Qwen3-VL deserves special mention – it's a version of the model that works not only with text but also with images. Here, the task is more complex: the image must first be converted into a set of tokens, then processed along with the text, and finally, a response – or a new image – must be generated.

On an MI300X, the Qwen3-VL-7B model with 4-bit quantization can generate a 1024×1024 pixel image in about 0.4 seconds. This is noticeably faster than most diffusion models, which are typically used for image generation.

The latency when working with both text and images simultaneously is about 18 milliseconds per token. In other words, it's almost as fast as the text-only models.

Significance of These Advancements

Why This Matters

First, it demonstrates that the AMD MI300X is a perfectly viable option for large model inference. Previously, such tasks were almost exclusively handled by NVIDIA, and there were few alternatives.

Second, Qwen's results confirm that quantization and proper optimization make it possible to run models with 30+ billion parameters on a single card – and to do so quickly. This lowers infrastructure requirements and makes model deployment more affordable.

Third, the image generation speed of Qwen3-VL opens up possibilities for interactive applications: editors, assistants, and interfaces where the user expects an instant response.

Important Details and Considerations

The Fine Print

Of course, there are nuances. 4-bit quantization always entails a slight loss in quality – the model becomes a little less accurate. In most cases, this is unnoticeable, but for tasks requiring high precision, it can make a difference.

It's also worth noting that these results were achieved under optimal conditions: using specially configured software, with up-to-date library versions, and taking into account the specifics of AMD's architecture. Real-world scenarios may introduce additional complexities – for example, when integrating with existing systems or working with other models.

Finally, the MI300X is still professional-grade hardware, and its cost is comparable to top-tier NVIDIA solutions. This means it is not a budget alternative but rather another option for those building serious infrastructure.

Summary of Qwen and AMD MI300X Results

The Bottom Line

The Qwen team has demonstrated that their third-generation models can run on the AMD MI300X with latencies suitable for interactive applications. Text generation is around 15 ms per token, while image generation takes as little as 0.4 seconds for a 1024×1024 picture.

This is the result of a combination of quantization, optimized kernels, and proper memory management. And it's another sign that the market for AI accelerators is becoming increasingly diverse.

Original Title: Unleashing Computational Power: Ultimate Latency Optimization of Qwen3 and Qwen3-VL on AMD MI300X Series
Publication Date: Feb 11, 2026
LMSYS ORG lmsys.org A U.S.-based non-profit research organization studying scalable language models and distributed training systems.
Previous Article Training Language Models with Feedback: verl Now Runs on AMD GPUs Next Article MiniMax Introduces Forge: A Platform for Training AI Agents on Powerful Computing Clusters

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.5 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.5 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

Want to dive deeper into the world
of neuro-creativity?

Be the first to learn about new books, articles, and AI experiments
on our Telegram channel!

Subscribe