Published on March 4, 2026

25x Inference Speedup: What's Happening with AI Performance on New NVIDIA Hardware

The new NVIDIA GB300 NVL72 server, paired with the SGLang framework, has demonstrated a 25x performance boost when running language models.

Infrastructure 4 – 6 minutes min read

Event Source: LMSYS ORG 4 – 6 minutes min read

When we discuss the operational speed of language models, there's quite a bit happening behind the scenes. One of the key metrics here is inference: how quickly a model responds to requests once it's already trained and simply “working.” This is precisely where NVIDIA and the SGLang team recently achieved a result that's hard to ignore – a 25x speedup compared to previous configurations.

Understanding AI Inference and Its Role in Performance

What Is Inference and Why Is It Important?

Simply put, inference occurs when you type a question to ChatGPT or any other AI assistant and await a response. Everything that transpires between hitting the “send” button and the text appearing on your screen – that's inference. The faster and more efficient it is, the more users can interact with the model simultaneously, and the cheaper it is for the provider to operate.

For companies deploying large language models in production (that is, in real-world services, not a research environment), inference speed directly translates to cost savings and improved user experience quality. Therefore, any significant speedup in this area isn't just a technical record, but a practical shift.

Overview of NVIDIA GB300 NVL72 Architecture and Specs

The GB300 NVL72: What Kind of System Is It?

The NVIDIA GB300 NVL72 is a next-generation server platform. Without delving into the architectural details, the key takeaway is this: it's an incredibly powerful system designed specifically for artificial intelligence tasks. The “72” in the name indicates the number of GPU modules inside, making it one of the most computationally dense servers available today.

Systems like these aren't used in home computers or even typical corporate servers. This is infrastructure on the scale of major cloud providers and research labs – the kind that handles millions of AI requests daily.

SGLang Framework for Large Language Model Optimization

SGLang: The Framework for Squeezing Out Maximum Performance

SGLang is a system designed for efficiently running large language models. It's developed by the LMSYS team, also known for Chatbot Arena – a platform where users compare responses from different AI models.

To put it very simply, SGLang's job is to make the model run as fast as possible and serve as many concurrent requests as possible. It's not the model itself, but a “wrapper” that manages how requests are received, processed, and returned to the user.

The key feature of SGLang in this context is its ability to work efficiently with new hardware. Not every framework can truly unlock the potential of new hardware. SGLang paired with the GB300 NVL72 is an example of software and hardware working in synergy, not just side-by-side.

Analyzing the 25x Throughput Increase on GB300 NVL72

25x: Where Does That Number Come From?

The authors of the publication report a 25x performance increase compared to the previous generation of configurations. This is measured in what's known as throughput – the number of tokens (think of them as “pieces” of text) that the system can process per unit of time.

It's important to understand: this result isn't just a case of “we installed new hardware and everything got faster.” Several factors played a role here simultaneously: the new GB300 hardware architecture, SGLang's internal optimizations specially adapted for this platform, and smart memory and computation management when processing long contexts.

To draw an analogy: it's like the difference between hauling cargo one load at a time in a passenger car versus organizing a convoy of trucks with smart logistics – the amount of work done in the same timeframe is incomparable.

Optimizing LLM Performance for Long Context Processing

Long Contexts: A Separate Story

One of the challenges of working with modern language models is processing lengthy texts. When a model needs to “keep in mind” a large document or a long conversation, it requires significantly more resources than a short query.

SGLang on the GB300 NVL72 shows particularly noticeable improvements in this exact scenario. Simply put, the longer the context, the more tangible the advantage of the new configuration becomes. This is important for tasks like analyzing large documents, multi-turn conversations, or working with codebases.

Practical Implications for AI Developers and Enterprises

What This Means in Practice

For the end user, there won't be an immediate effect – the GB300 NVL72 isn't something that will appear in a cloud service tomorrow morning. But in the medium term, results like these shape what eventually reaches us in the form of faster responses, more affordable APIs, and more complex tasks that AI can solve in real time.

For developers and companies building products on top of large language models, this is a signal: the new generation of infrastructure is truly changing the equation. While running a large model used to require trade-offs between speed, cost, and quality, these results show that the room for maneuver is expanding.

Current Limitations and Future of AI Infrastructure Scaling

Open Questions

Despite the impressive figure, it's worth keeping a few things in mind. First, the 25x speedup is a result under specific testing conditions. Real-world workloads might behave differently. Second, the availability of the GB300 NVL72 is still limited – it's hardware for major players, not the mass market. Third, SGLang's optimizations for specific hardware mean that some of its advantages simply won't materialize without that hardware.

This isn't an attempt to downplay the result – rather, it's a reminder that in AI infrastructure, as in any other field, there's always a gap between a lab record and everyday reality. But the direction of progress is clear: inference performance is growing rapidly, and that's good news for the entire industry.

#analysis #technical context #ai development #engineering #computer systems #infrastructure #gpu optimization #inference optimization

Link to Original: https://lmsys.org/blog/2026-02-20-gb300-inferencex

Original Title: Unlocking 25x Inference Performance with SGLang on NVIDIA GB300 NVL72

Publication Date: Feb 20, 2026

LMSYS ORG lmsys.org A U.S.-based non-profit research organization studying scalable language models and distributed training systems.

Previous Article EDiTh: How to Test Corporate Search Without Revealing Company Secrets Next Article Voice AI Wants to Act, Not Just Answer: What's Holding It Back?

25x Inference Speedup: What's Happening with AI Performance on New NVIDIA Hardware

Understanding AI Inference and Its Role in Performance

Overview of NVIDIA GB300 NVL72 Architecture and Specs

SGLang Framework for Large Language Model Optimization

Analyzing the 25x Throughput Increase on GB300 NVL72

Optimizing LLM Performance for Long Context Processing

Practical Implications for AI Developers and Enterprises

Current Limitations and Future of AI Infrastructure Scaling

Related Publications

Qualcomm Unveils AI200 Rack: A Turnkey Solution for Large AI Models

How Specialized Chips Are Changing the Way AI Works

AMD and Artificial Intelligence: How the Company is Catching Up to Market Leaders in Inference Performance

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration