When a company participates in an industry benchmark, you usually expect something along the lines of «our GPU got faster.» This time, AMD went further: as part of MLPerf Inference 6.0, the company not only improved its performance on familiar tasks but also ventured into entirely new ones, including text-to-video generation. The results proved interesting enough to warrant a closer look.
What is MLPerf and Why is it Important?
MLPerf is a standardized set of tests that allows for comparing the performance of different hardware on machine learning tasks. Simply put, it's a kind of standardized exam for AI accelerators: everyone takes the same «tests», and the results can be fairly compared.
Inference is the operational mode of a trained model when it simply responds to queries. This is a crucial mode for real-world products: when you type something into a chatbot or ask a system to generate something, it is operating in inference mode.
Version 6.0 proved to be noteworthy for AMD for several reasons.
A Million Tokens Per Second – Is That a Lot?
One of the key results of this round is that AMD surpassed the 1 million tokens per second mark for the first time in MLPerf Inference. A token is roughly a word or part of a word; it is the unit used to measure the speed of language models.
A million tokens per second isn't a figure for a single GPU. We are talking about a cluster of several servers working together. AMD achieved this result on the Llama 2 70B model with a configuration of 11 nodes and 87 Instinct MI355X GPUs, and on the GPT-OSS-120B model with 12 nodes and 94 GPUs.
Why is this important? Because real-world production systems – especially those serving thousands of users simultaneously – run on clusters, not single cards. The ability to scale without losing efficiency is a key requirement for infrastructure.
When scaling from one to 11 nodes, efficiency was maintained at 93% in standard modes and 98% in interactive mode. This is close to ideal linear scaling, meaning each new server adds almost as much performance as the previous one, without significant coordination overhead.
How the MI355X Stacks Up Against Competitors
AMD compared its results with the NVIDIA B200 and B300 on the Llama 2 70B task – the most common language benchmark in MLPerf.
On a single node, the picture is as follows: compared to the NVIDIA B200, the AMD Instinct MI355X platform matched its performance in Offline mode, showed 97% in Server mode, and 119% in interactive mode. Against the newer B300, the results were 93%, 92%, and 104%, respectively.
This isn't a victory on all fronts, but it's not a case of falling behind either. It is particularly telling that AMD is competitive across all three modes – not just in the one where its results are most favorable.
The generational improvement is also worth noting: compared to the previous AMD Instinct MI325X model, the new MI355X delivered 3.1 times more tokens per second on the Llama 2 70B Server benchmark. That's a significant leap in just six months.
GPT-OSS-120B: A Competitive Debut
One of the new benchmarks in MLPerf Inference 6.0 was the GPT-OSS-120B model – appearing in the tests for the first time. This makes the results particularly interesting: it's not just about running a familiar model, but ensuring its functionality, optimization, and compliance with accuracy requirements from scratch – all within a tight deadline.
AMD succeeded: in single-node tests, the MI355X platform showed 111% of the NVIDIA B200's performance in Offline mode and 115% in Server mode. Against the B300, the figures were 91% and 82%, respectively.
In multi-node scaling, GPT-OSS-120B also became the second model to break the 1 million tokens per second barrier. Scaling efficiency with 12 nodes reached 92% in Offline mode and 93% in Server mode.
Video Generation: New Territory
Perhaps the most surprising development in this round was AMD's venture beyond language models. The company submitted results for the Wan-2.2-t2v test for the first time, which evaluates video generation from a text description.
This is a fundamentally different type of task: here, the model doesn't generate text but creates a sequence of frames. Such tasks require a different nature of computation and significantly more memory.
AMD submitted its results in the Open category, covering the Single Stream mode – without the Offline part required for a full submission in the Closed division. However, the Single Stream submission itself met the requirements of the Closed category and can be directly compared with the results of other participants.
The official test result: 93% of the NVIDIA B200's performance and 87% of the B300's in Single Stream. After the deadline, with further optimizations, the metrics increased to 108% of the B200 and parity with the B300, and in an unofficial Offline test, to 111% of the B200. These post-deadline figures are not part of the official submission and are not verified by MLCommons, but they clearly demonstrate how quickly performance can grow with fine-tuning.
The very fact of participating in this test speaks volumes: generative AI is no longer limited to text, and AMD clearly intends to cover a broader spectrum of tasks.
Not Just AMD: The Partner Ecosystem
Another important detail is reproducibility. In this round, nine partners submitted their own results on AMD hardware: Cisco, Dell, Giga Computing, HPE, MangoBoost, MiTAC, Oracle, Supermicro, and Red Hat. In terms of the number of partners, AMD tied for first place among all participants.
The tests covered four generations of GPUs: MI300X, MI325X, MI350X, and MI355X. Notably, partners' results on the MI355X deviated from AMD's own figures by no more than 4%, and in some cases by no more than 1% – even on tasks that were new to all participants.
This is important for a practical reason: a customer buys a server from Dell or HPE and gets roughly the same numbers as in AMD's official tests. The gap between lab results and the real hardware from partners is minimal.
The First Heterogeneous Test: Three GPU Generations, Two Countries
Another unconventional result from this round was the first submission in MLPerf history using three different generations of AMD GPUs simultaneously. A configuration of MI300X, MI325X, and MI355X, assembled by Dell and MangoBoost, achieved 141,521 tokens per second in Server mode and 151,843 in Offline mode on Llama 2 70B.
A detail that makes this result particularly interesting: the MI355X was located in a Dell lab in the US, while the MI300X and MI325X were in Korea. This means the test was not just evaluating a mixed configuration but distributed inference across different geographical locations.
The practical significance here is clear: most companies don't overhaul their entire infrastructure at once. The ability to use different GPU generations in a single cluster is a scenario that real-world data centers constantly face.
What's Next
AMD is sticking to an annual update cycle for its Instinct lineup: the MI300X laid the foundation in 2023, the MI325X expanded on it in 2024, and the MI350 Series, including the MI355X, will add new data types and larger memory capacity in 2025. A transition to the MI400 series on the CDNA 5 architecture is planned for 2026, which is also tied to AMD Helios, a rack-scale solution for large-scale AI deployments.
In this context, MLPerf Inference 6.0 is not just a set of numbers but a demonstration that AMD is consistently moving towards cluster- and rack-scale infrastructure: with predictable hardware, reproducible results, and a software stack capable of operating in heterogeneous configurations.
The competition with NVIDIA remains uneven on many fronts – but the gap is narrowing, and the scope of supported tasks is broadening. And that in itself is worthy of attention.