Published on February 13, 2026

AMD MI300X and Qwen Optimization for AI Performance

How AMD and Qwen Optimized MI300X GPUs for Peak Performance

The Qwen team optimized their models to effectively run on AMD MI300X GPUs, achieving a response latency as low as 15 ms per token and full image generation in just 0.4 seconds.

Infrastructure / Technical context 4 – 6 minutes min read

Event Source: LMSYS ORG 4 – 6 minutes min read

When people talk about accelerating language models, they usually mean NVIDIA. However, that's not the only option. The Qwen team decided to showcase the capabilities of AMD accelerators, and the results are quite impressive.

We are referring to the MI300X series – professional AMD GPUs designed for handling large models. Qwen took their third-generation models, including the multimodal Qwen3-VL, and pushed their performance on this hardware to a level where latency is no longer an issue, even for interactive tasks.

What Was Accelerated?

What Did They Speed Up?

Simply put, there are two primary scenarios for how a language model operates. The first is prefill, where the model processes your entire request before it begins generating a response. The second is decode, where it outputs tokens one by one.

Qwen's objective was to make both of these stages run as fast as possible on AMD hardware. To achieve this, they employed several techniques:

Quantization – compressing the model's weights to 4 bits instead of the standard 16. This reduces the amount of data that needs to be moved in memory and accelerates computations.
Continuous batching – a method for processing multiple requests simultaneously without waiting for previous ones to finish. This is crucial for server scenarios where requests are constantly arriving.
Specialized kernels for the attention operation – a key component of transformer models. Here, they utilized FlashAttention-2 and optimized versions for AMD.

All of these optimizations made it possible to extract performance from the hardware that typically requires more expensive solutions.

Practical Implications of Optimization

What This Means in Practice

The team tested several configurations. For example, the Qwen2.5-Coder-32B-Instruct model with AWQ quantization (4-bit) on a single MI300X card outputs approximately 66 tokens per second when handling a single request. The per-token latency is about 15 milliseconds.

For comparison, this means the model can generate a 100-token response (roughly 75 words) in one and a half seconds. This is already a very comfortable speed for conversational interfaces.

If you increase the number of concurrent requests, the throughput also increases. On two MI300X cards, the model can process up to 32 requests in parallel with a total speed of about 1,000 tokens per second. This demonstrates server-scale performance.

Performance of Multimodal Models

What About Multimodal Models?

Qwen3-VL deserves special mention – it's a version of the model that works not only with text but also with images. Here, the task is more complex: the image must first be converted into a set of tokens, then processed along with the text, and finally, a response – or a new image – must be generated.

On an MI300X, the Qwen3-VL-7B model with 4-bit quantization can generate a 1024×1024 pixel image in about 0.4 seconds. This is noticeably faster than most diffusion models, which are typically used for image generation.

The latency when working with both text and images simultaneously is about 18 milliseconds per token. In other words, it's almost as fast as the text-only models.

Significance of These Advancements

Why This Matters

First, it demonstrates that the AMD MI300X is a perfectly viable option for large model inference. Previously, such tasks were almost exclusively handled by NVIDIA, and there were few alternatives.

Second, Qwen's results confirm that quantization and proper optimization make it possible to run models with 30+ billion parameters on a single card – and to do so quickly. This lowers infrastructure requirements and makes model deployment more affordable.

Third, the image generation speed of Qwen3-VL opens up possibilities for interactive applications: editors, assistants, and interfaces where the user expects an instant response.

Important Details and Considerations

The Fine Print

Of course, there are nuances. 4-bit quantization always entails a slight loss in quality – the model becomes a little less accurate. In most cases, this is unnoticeable, but for tasks requiring high precision, it can make a difference.

It's also worth noting that these results were achieved under optimal conditions: using specially configured software, with up-to-date library versions, and taking into account the specifics of AMD's architecture. Real-world scenarios may introduce additional complexities – for example, when integrating with existing systems or working with other models.

Finally, the MI300X is still professional-grade hardware, and its cost is comparable to top-tier NVIDIA solutions. This means it is not a budget alternative but rather another option for those building serious infrastructure.

Summary of Qwen and AMD MI300X Results

The Bottom Line

The Qwen team has demonstrated that their third-generation models can run on the AMD MI300X with latencies suitable for interactive applications. Text generation is around 15 ms per token, while image generation takes as little as 0.4 seconds for a 1024×1024 picture.

This is the result of a combination of quantization, optimized kernels, and proper memory management. And it's another sign that the market for AI accelerators is becoming increasingly diverse.

#applied analysis #technical context #neural networks #ai development #engineering #computer systems #model quantization #multimodal models #inference optimization #hardware acceleration optimization

Link to Original: https://lmsys.org/blog/2026-02-11-Qwen-latency

Original Title: Unleashing Computational Power: Ultimate Latency Optimization of Qwen3 and Qwen3-VL on AMD MI300X Series

Publication Date: Feb 11, 2026

LMSYS ORG lmsys.org A U.S.-based non-profit research organization studying scalable language models and distributed training systems.

Previous Article Training Language Models with Feedback: verl Now Runs on AMD GPUs Next Article MiniMax Introduces Forge: A Platform for Training AI Agents on Powerful Computing Clusters

AMD MI300X and Qwen Optimization for AI Performance

What Was Accelerated?

Practical Implications of Optimization

Performance of Multimodal Models

Significance of These Advancements

Important Details and Considerations

Summary of Qwen and AMD MI300X Results

Related Publications

How to Curb the «Appetites» of Embedding Models on AMD Ryzen AI

AMD Introduces GPU Partitioning for Concurrent LLM Execution

Robot Digital Twins Now Run on Standard PCs with AMD Graphics Cards

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration