Published on April 1, 2026

Red Hat AI Achieves Top Results in MLPerf Inference v6.0 – Key Insights

Red Hat AI Achieves Top Results in MLPerf Inference v6.0 – Here's What's Behind It

Red Hat AI has secured top spots in the latest round of the MLPerf Inference v6.0 benchmark, testing three models on both NVIDIA and AMD GPUs.

Infrastructure 4 – 6 minutes min read
Event Source: Red Hat 4 – 6 minutes min read

There's an industry benchmark for AI systems called MLPerf Inference. In short, it's a kind of official test: companies take real models, run them on their hardware, and publish the results publicly. No closed-door demos – just numbers that can be compared. Rounds are held several times a year, and each new release showcases the industry's progress.

In the sixth round – MLPerf Inference v6.0 – Red Hat AI secured top spots in several categories. This is remarkable in itself because hardware manufacturers are usually the ones leading the pack. Here, a company that focuses on its software stack and open-source tools came to the forefront.

Three AI Models, Three Performance Stories

Three Models, Three Stories

Red Hat AI tested three models, each with a different profile.

The first is Whisper. It's a speech recognition model that transcribes audio into text. The task might seem simple, but in practice, it requires fast processing of a data stream, especially when requests are coming in continuously. It was in this category that Red Hat achieved one of its best results.

The second is Qwen3-VL. This is a multimodal model: it can work not only with text but also with images simultaneously. Simply put, you can show it a picture and ask a question – it will understand both. Such models are more complex to serve because they need to process different data types coherently.

The third is GPT-OSS-120B. This is a large language model with 120 billion open-source weights. The more parameters, the higher the memory and speed requirements. Keeping such a model within acceptable limits for latency and throughput is a non-trivial engineering challenge.

Why These MLPerf Results Go Beyond Simple Benchmarks

Why This Is More Than Just “Good Numbers”

Many MLPerf participants often optimize for a specific test: they take one model, one piece of hardware, one scenario – and squeeze out the maximum performance there. Red Hat took a slightly different path: three different models, two different GPU manufacturers – NVIDIA and AMD – and a unified software approach.

This is important because, in real-world deployments, companies rarely operate with a perfectly homogeneous infrastructure. Some use NVIDIA, while others are starting to consider AMD as an alternative. If your toolkit works well on both, that's a practical advantage, not just a line in a press release.

Technical Details of Red Hat's MLPerf Inference Setup

How It Worked Under the Hood – Just What You Need to Know

Red Hat AI used vLLM, an inference engine for running large language models that is optimized for high throughput. It can efficiently manage memory and process many requests in parallel without sacrificing speed.

Additionally, they used llm-d, a distributed request scheduler that allows for scaling inference horizontally – in other words, distributing the load across multiple nodes without manually configuring each one.

All of this ran on top of OpenShift AI, a platform for running AI workloads in enterprise environments. Its role here wasn't so much about acceleration itself, but rather the ability to deploy such systems reproducibly and manageably in real-world conditions, not just in a lab.

Simply put, the team didn't invent specialized solutions just for impressive numbers in a benchmark; they used the same stack that is applied in real products. This changes the meaning of the result slightly: it's not a “synthetic record,” but a demonstration that existing tools are genuinely competitive.

Open Source as a Core Strategy in AI Infrastructure

Openness as a Strategy

Another point worth noting is that all the components used by Red Hat are open source. vLLM, llm-d, and the models are not proprietary developments kept closed within the company. Participating in MLPerf with an open stack is both a demonstration of capabilities and an argument that open source in AI infrastructure is no longer just a “budget option.”

For the industry, this is no small matter. For a long time, the unwritten rule was: if you want the best performance, use closed, proprietary solutions optimized for specific hardware. Results like these are gradually blurring that line.

What Remains Behind the Scenes

MLPerf is a good guidepost, but it's not the absolute truth. The test measures performance under strictly defined conditions: specific models, specific load scenarios, and specific metrics. In real-world systems, the conditions are always different – different requests, different usage profiles, and different constraints.

Furthermore, optimizing for a benchmark and optimizing for production are not the same thing. The teams participating in MLPerf know the rules of the game and prepare for them. How well these same results can be reproduced “in the wild” is a separate question that no test can definitively answer.

Nevertheless, MLPerf remains one of the few places where different approaches can be fairly compared under more or less controlled conditions. And Red Hat AI's appearance there with an open stack, multiple models, and two GPU platforms is, at the very least, a signal that this direction was chosen deliberately.

Original Title: Red Hat AI tops MLPerf Inference v6.0 with vLLM on Qwen3-VL, Whisper, and GPT-OSS-120B
Publication Date: Apr 1, 2026
Red Hat www.redhat.com Global company developing open software platforms and infrastructure solutions with AI support.
Previous Article AI Governance: Why Rules Without Adaptation Become a List of Workarounds Next Article AMD at MLPerf Inference 6.0: A Million Tokens Per Second and a Debut in Video Generation

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Don’t miss a single experiment!

Subscribe to our Telegram channel —
we regularly post announcements of new books, articles, and interviews.

Subscribe