When people talk about accelerating language models, they usually mean NVIDIA. However, that's not the only option. The Qwen team decided to showcase the capabilities of AMD accelerators, and the results are quite impressive.
We are referring to the MI300X series – professional AMD GPUs designed for handling large models. Qwen took their third-generation models, including the multimodal Qwen3-VL, and pushed their performance on this hardware to a level where latency is no longer an issue, even for interactive tasks.
What Was Accelerated?
What Did They Speed Up?
Simply put, there are two primary scenarios for how a language model operates. The first is prefill, where the model processes your entire request before it begins generating a response. The second is decode, where it outputs tokens one by one.
Qwen's objective was to make both of these stages run as fast as possible on AMD hardware. To achieve this, they employed several techniques:
- Quantization – compressing the model's weights to 4 bits instead of the standard 16. This reduces the amount of data that needs to be moved in memory and accelerates computations.
- Continuous batching – a method for processing multiple requests simultaneously without waiting for previous ones to finish. This is crucial for server scenarios where requests are constantly arriving.
- Specialized kernels for the attention operation – a key component of transformer models. Here, they utilized FlashAttention-2 and optimized versions for AMD.
All of these optimizations made it possible to extract performance from the hardware that typically requires more expensive solutions.
Practical Implications of Optimization
What This Means in Practice
The team tested several configurations. For example, the Qwen2.5-Coder-32B-Instruct model with AWQ quantization (4-bit) on a single MI300X card outputs approximately 66 tokens per second when handling a single request. The per-token latency is about 15 milliseconds.
For comparison, this means the model can generate a 100-token response (roughly 75 words) in one and a half seconds. This is already a very comfortable speed for conversational interfaces.
If you increase the number of concurrent requests, the throughput also increases. On two MI300X cards, the model can process up to 32 requests in parallel with a total speed of about 1,000 tokens per second. This demonstrates server-scale performance.
Performance of Multimodal Models
What About Multimodal Models?
Qwen3-VL deserves special mention – it's a version of the model that works not only with text but also with images. Here, the task is more complex: the image must first be converted into a set of tokens, then processed along with the text, and finally, a response – or a new image – must be generated.
On an MI300X, the Qwen3-VL-7B model with 4-bit quantization can generate a 1024×1024 pixel image in about 0.4 seconds. This is noticeably faster than most diffusion models, which are typically used for image generation.
The latency when working with both text and images simultaneously is about 18 milliseconds per token. In other words, it's almost as fast as the text-only models.
Significance of These Advancements
Why This Matters
First, it demonstrates that the AMD MI300X is a perfectly viable option for large model inference. Previously, such tasks were almost exclusively handled by NVIDIA, and there were few alternatives.
Second, Qwen's results confirm that quantization and proper optimization make it possible to run models with 30+ billion parameters on a single card – and to do so quickly. This lowers infrastructure requirements and makes model deployment more affordable.
Third, the image generation speed of Qwen3-VL opens up possibilities for interactive applications: editors, assistants, and interfaces where the user expects an instant response.
Important Details and Considerations
The Fine Print
Of course, there are nuances. 4-bit quantization always entails a slight loss in quality – the model becomes a little less accurate. In most cases, this is unnoticeable, but for tasks requiring high precision, it can make a difference.
It's also worth noting that these results were achieved under optimal conditions: using specially configured software, with up-to-date library versions, and taking into account the specifics of AMD's architecture. Real-world scenarios may introduce additional complexities – for example, when integrating with existing systems or working with other models.
Finally, the MI300X is still professional-grade hardware, and its cost is comparable to top-tier NVIDIA solutions. This means it is not a budget alternative but rather another option for those building serious infrastructure.
Summary of Qwen and AMD MI300X Results
The Bottom Line
The Qwen team has demonstrated that their third-generation models can run on the AMD MI300X with latencies suitable for interactive applications. Text generation is around 15 ms per token, while image generation takes as little as 0.4 seconds for a 1024×1024 picture.
This is the result of a combination of quantization, optimized kernels, and proper memory management. And it's another sign that the market for AI accelerators is becoming increasingly diverse.