There's an industry-standard benchmark for AI systems – MLPerf Inference. This is an independent set of tasks used to measure how quickly and efficiently a given hardware platform handles real-world usage scenarios. The benchmark doesn't evaluate synthetic workloads, but rather tasks that real systems encounter: image recognition, speech processing, and working with large language models. The results are published openly, and companies rely on them when choosing infrastructure for deploying AI.
In the latest round – MLPerf Inference v6.0 – Red Hat and NVIDIA collaborated and demonstrated some of the best results across several categories.
Why Is Such a Test Even Needed?
As long as AI remains something abstract, no one really questions how it works «under the hood.» But as soon as it comes to real-world deployment – in the cloud, in a corporate environment, or in production – very specific requirements immediately arise: how many requests the system processes per second, how quickly it delivers the first response, and how stably it performs under load.
MLPerf is designed precisely to provide comparable and verifiable answers to these questions. The benchmark covers several scenarios: a model can be run in maximum throughput mode (how many requests it can process per unit of time) or in a mode with strict latency constraints (as in real-world applications where a user expects an immediate response).
What Exactly Was Tested?
In this round, the task set covered several areas. First, vision – image classification and object detection tasks. Second, speech – automatic recognition and audio transcription. And, perhaps the most interesting area today, reasoning: this includes large language models, particularly Llama 3.1 405B, one of the most demanding open models available today.
Llama 3.1 405B became one of the main challenges of the round: the MLPerf organizers added it specifically to assess how platforms handle models that require a colossal amount of computation for each generated token.
Collaboration as a Key to Results
The unique aspect of this participation wasn't just running a pre-built stack on powerful hardware, but a deep, collaborative engineering effort between Red Hat and NVIDIA. Simply put, the teams worked together to ensure the software and hardware components were tuned for maximum synergy.
Red Hat is responsible for the enterprise Linux platform and the software stack on which AI services are deployed. NVIDIA is responsible for the hardware infrastructure and optimized computing libraries. When these two layers are designed in tandem, rather than separately, the benchmark results are fundamentally different – and this is precisely what the v6.0 figures confirmed.
This approach isn't just about getting a nice entry in the results table. For companies deploying AI in a production environment, it sends a signal: the Red Hat + NVIDIA combination was tested and optimized not in isolation, but in the exact configuration that can be replicated in a real-world infrastructure.
What the Numbers Say
The results were recorded across several categories – for throughput and latency, on various models. In tasks related to language models and reasoning, as well as speech and image recognition, the partners demonstrated leading performance among the published participants.
The performance on Llama 3.1 405B deserves special attention. This model requires processing hundreds of billions of parameters, and even on flagship hardware, achieving both a fast first-response time and high throughput simultaneously is a non-trivial task. Nevertheless, the results on this model were among the best of all who published official data for this benchmark.
Why This Matters Beyond the Scoreboard
MLPerf is more than just a competition. It's a way for the industry to agree on a common language for evaluation. When different teams publish results according to the same rules, customers and developers can compare platforms without marketing distortions.
Red Hat's participation in this round is also notable because the enterprise Linux environment has historically been seen as a neutral foundation rather than an active participant in the race for AI performance. The joint results with NVIDIA are changing this picture: the software stack is becoming as significant a factor as the hardware.
This is especially relevant in the context of growing interest in open models like Llama. Companies are increasingly deploying them on their own, rather than through cloud APIs. And in this case, the question of how efficiently a specific software-hardware stack handles the load becomes very practical – it directly impacts the cost of operation.
What's Left Out of the Picture?
It's worth noting: MLPerf measures performance under strictly defined conditions and on specific models. Real-world usage scenarios are more diverse: they can involve mixed workloads, non-standard configurations, and additional security and reliability requirements. A benchmark is a good guide, but not a universal guarantee.
Nevertheless, publishing official results in MLPerf is a deliberate step towards transparency. And the fact that Red Hat and NVIDIA did it jointly speaks to the serious level of engineering integration between the two platforms.