Published March 4, 2026

25x Inference Speedup: What's Happening with AI Performance on New NVIDIA Hardware

The new NVIDIA GB300 NVL72 server, paired with the SGLang framework, has demonstrated a 25x performance boost when running language models.

Infrastructure
Event Source: LMSYS ORG Reading Time: 4 – 6 minutes

When we discuss the operational speed of language models, there's quite a bit happening behind the scenes. One of the key metrics here is inference: how quickly a model responds to requests once it's already trained and simply “working.” This is precisely where NVIDIA and the SGLang team recently achieved a result that's hard to ignore – a 25x speedup compared to previous configurations.

Understanding AI Inference and Its Role in Performance

What Is Inference and Why Is It Important?

Simply put, inference occurs when you type a question to ChatGPT or any other AI assistant and await a response. Everything that transpires between hitting the “send” button and the text appearing on your screen – that's inference. The faster and more efficient it is, the more users can interact with the model simultaneously, and the cheaper it is for the provider to operate.

For companies deploying large language models in production (that is, in real-world services, not a research environment), inference speed directly translates to cost savings and improved user experience quality. Therefore, any significant speedup in this area isn't just a technical record, but a practical shift.

Overview of NVIDIA GB300 NVL72 Architecture and Specs

The GB300 NVL72: What Kind of System Is It?

The NVIDIA GB300 NVL72 is a next-generation server platform. Without delving into the architectural details, the key takeaway is this: it's an incredibly powerful system designed specifically for artificial intelligence tasks. The “72” in the name indicates the number of GPU modules inside, making it one of the most computationally dense servers available today.

Systems like these aren't used in home computers or even typical corporate servers. This is infrastructure on the scale of major cloud providers and research labs – the kind that handles millions of AI requests daily.

SGLang Framework for Large Language Model Optimization

SGLang: The Framework for Squeezing Out Maximum Performance

SGLang is a system designed for efficiently running large language models. It's developed by the LMSYS team, also known for Chatbot Arena – a platform where users compare responses from different AI models.

To put it very simply, SGLang's job is to make the model run as fast as possible and serve as many concurrent requests as possible. It's not the model itself, but a “wrapper” that manages how requests are received, processed, and returned to the user.

The key feature of SGLang in this context is its ability to work efficiently with new hardware. Not every framework can truly unlock the potential of new hardware. SGLang paired with the GB300 NVL72 is an example of software and hardware working in synergy, not just side-by-side.

Analyzing the 25x Throughput Increase on GB300 NVL72

25x: Where Does That Number Come From?

The authors of the publication report a 25x performance increase compared to the previous generation of configurations. This is measured in what's known as throughput – the number of tokens (think of them as “pieces” of text) that the system can process per unit of time.

It's important to understand: this result isn't just a case of “we installed new hardware and everything got faster.” Several factors played a role here simultaneously: the new GB300 hardware architecture, SGLang's internal optimizations specially adapted for this platform, and smart memory and computation management when processing long contexts.

To draw an analogy: it's like the difference between hauling cargo one load at a time in a passenger car versus organizing a convoy of trucks with smart logistics – the amount of work done in the same timeframe is incomparable.

Optimizing LLM Performance for Long Context Processing

Long Contexts: A Separate Story

One of the challenges of working with modern language models is processing lengthy texts. When a model needs to “keep in mind” a large document or a long conversation, it requires significantly more resources than a short query.

SGLang on the GB300 NVL72 shows particularly noticeable improvements in this exact scenario. Simply put, the longer the context, the more tangible the advantage of the new configuration becomes. This is important for tasks like analyzing large documents, multi-turn conversations, or working with codebases.

Practical Implications for AI Developers and Enterprises

What This Means in Practice

For the end user, there won't be an immediate effect – the GB300 NVL72 isn't something that will appear in a cloud service tomorrow morning. But in the medium term, results like these shape what eventually reaches us in the form of faster responses, more affordable APIs, and more complex tasks that AI can solve in real time.

For developers and companies building products on top of large language models, this is a signal: the new generation of infrastructure is truly changing the equation. While running a large model used to require trade-offs between speed, cost, and quality, these results show that the room for maneuver is expanding.

Current Limitations and Future of AI Infrastructure Scaling

Open Questions

Despite the impressive figure, it's worth keeping a few things in mind. First, the 25x speedup is a result under specific testing conditions. Real-world workloads might behave differently. Second, the availability of the GB300 NVL72 is still limited – it's hardware for major players, not the mass market. Third, SGLang's optimizations for specific hardware mean that some of its advantages simply won't materialize without that hardware.

This isn't an attempt to downplay the result – rather, it's a reminder that in AI infrastructure, as in any other field, there's always a gap between a lab record and everyday reality. But the direction of progress is clear: inference performance is growing rapidly, and that's good news for the entire industry.

Original Title: Unlocking 25x Inference Performance with SGLang on NVIDIA GB300 NVL72
Publication Date: Feb 20, 2026
LMSYS ORG lmsys.org A U.S.-based non-profit research organization studying scalable language models and distributed training systems.
Previous Article EDiTh: How to Test Corporate Search Without Revealing Company Secrets Next Article Voice AI Wants to Act, Not Just Answer: What's Holding It Back?

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

Don’t miss a single experiment!

Subscribe to our Telegram channel —
we regularly post announcements of new books, articles, and interviews.

Subscribe