There's an industry benchmark for AI systems called MLPerf Inference. In short, it's a kind of official test: companies take real models, run them on their hardware, and publish the results publicly. No closed-door demos – just numbers that can be compared. Rounds are held several times a year, and each new release showcases the industry's progress.
In the sixth round – MLPerf Inference v6.0 – Red Hat AI secured top spots in several categories. This is remarkable in itself because hardware manufacturers are usually the ones leading the pack. Here, a company that focuses on its software stack and open-source tools came to the forefront.
Three Models, Three Stories
Red Hat AI tested three models, each with a different profile.
The first is Whisper. It's a speech recognition model that transcribes audio into text. The task might seem simple, but in practice, it requires fast processing of a data stream, especially when requests are coming in continuously. It was in this category that Red Hat achieved one of its best results.
The second is Qwen3-VL. This is a multimodal model: it can work not only with text but also with images simultaneously. Simply put, you can show it a picture and ask a question – it will understand both. Such models are more complex to serve because they need to process different data types coherently.
The third is GPT-OSS-120B. This is a large language model with 120 billion open-source weights. The more parameters, the higher the memory and speed requirements. Keeping such a model within acceptable limits for latency and throughput is a non-trivial engineering challenge.
Why This Is More Than Just “Good Numbers”
Many MLPerf participants often optimize for a specific test: they take one model, one piece of hardware, one scenario – and squeeze out the maximum performance there. Red Hat took a slightly different path: three different models, two different GPU manufacturers – NVIDIA and AMD – and a unified software approach.
This is important because, in real-world deployments, companies rarely operate with a perfectly homogeneous infrastructure. Some use NVIDIA, while others are starting to consider AMD as an alternative. If your toolkit works well on both, that's a practical advantage, not just a line in a press release.
How It Worked Under the Hood – Just What You Need to Know
Red Hat AI used vLLM, an inference engine for running large language models that is optimized for high throughput. It can efficiently manage memory and process many requests in parallel without sacrificing speed.
Additionally, they used llm-d, a distributed request scheduler that allows for scaling inference horizontally – in other words, distributing the load across multiple nodes without manually configuring each one.
All of this ran on top of OpenShift AI, a platform for running AI workloads in enterprise environments. Its role here wasn't so much about acceleration itself, but rather the ability to deploy such systems reproducibly and manageably in real-world conditions, not just in a lab.
Simply put, the team didn't invent specialized solutions just for impressive numbers in a benchmark; they used the same stack that is applied in real products. This changes the meaning of the result slightly: it's not a “synthetic record,” but a demonstration that existing tools are genuinely competitive.
Openness as a Strategy
Another point worth noting is that all the components used by Red Hat are open source. vLLM, llm-d, and the models are not proprietary developments kept closed within the company. Participating in MLPerf with an open stack is both a demonstration of capabilities and an argument that open source in AI infrastructure is no longer just a “budget option.”
For the industry, this is no small matter. For a long time, the unwritten rule was: if you want the best performance, use closed, proprietary solutions optimized for specific hardware. Results like these are gradually blurring that line.
MLPerf is a good guidepost, but it's not the absolute truth. The test measures performance under strictly defined conditions: specific models, specific load scenarios, and specific metrics. In real-world systems, the conditions are always different – different requests, different usage profiles, and different constraints.
Furthermore, optimizing for a benchmark and optimizing for production are not the same thing. The teams participating in MLPerf know the rules of the game and prepare for them. How well these same results can be reproduced “in the wild” is a separate question that no test can definitively answer.
Nevertheless, MLPerf remains one of the few places where different approaches can be fairly compared under more or less controlled conditions. And Red Hat AI's appearance there with an open stack, multiple models, and two GPU platforms is, at the very least, a signal that this direction was chosen deliberately.