When people discuss large language models, the conversation usually revolves around how 'smart' their answers are. However, for those who use these models in their work, another question arises: how fast and cost-effective are they? This is particularly relevant when processing long texts – such as large documents, lengthy dialogues, and complex tasks with contexts spanning thousands of words.
This very question prompted a new study from the LMSYS team, which tested the DeepSeek model on the new NVIDIA GB300 NVL72 accelerator. The results were significant enough to warrant sharing.
Long Context Is Its Own Challenge
In short, the longer the text a model processes, the more memory and computational resources it requires. This isn't just 'a little more' – the relationship is non-linear. When processing long sequences, the model must hold vast amounts of intermediate data in memory, and this is where standard configurations begin to struggle.
Simply put, if you want a model to read an entire book or a large technical document and answer questions about it, the workload is fundamentally different from that of answering a short query.
DeepSeek is an interesting model in this regard: it supports a very large context window, which makes it particularly attractive for such scenarios. But for this window to be truly practical, it requires the right hardware.
GB300 NVL72 What It Is and Why It Matters
GB300 NVL72: What It Is and Why It Matters
The NVIDIA GB300 NVL72 is the latest accelerator configuration designed for large-scale inference tasks (that is, running already-trained models, not training them). The main difference from the previous generation is a significantly larger amount of memory and faster memory performance.
For long contexts, this is critical: memory capacity and bandwidth are most often the bottleneck. The GB300 NVL72 alleviates some of these constraints.
In their study, LMSYS compared DeepSeek's performance on the GB300 NVL72 with the previous generation, the H100 NVL8. This is a fair comparison, as the H100 is a widely used configuration that many are currently relying on.
What the Tests Showed
The results were notable in several areas.
First, generation speed on long contexts increased significantly. For short queries, the difference between hardware generations is usually not as dramatic. However, the longer the context, the more the GB300 pulls ahead. This is exactly the kind of situation where new hardware solves a real problem, rather than just adding percentage points to a benchmark.
Second, the system's throughput – that is, how many requests it can handle simultaneously – also increased. This is important for practical deployment: if a model handles each request faster, it can serve more users in parallel.
Third, the researchers noted improvements in the so-called prefill stage – the phase where the model 'reads' the input text before starting to generate a response. For long contexts, this stage can consume a significant amount of time, and this is where the GB300 showed a particularly noticeable boost.
Why It's Not Just About Speed
Speed is convenient, but behind it lies something more practical: cost.
When a model runs faster and processes more requests on the same hardware, the cost per query decreases. For services that handle large volumes of text – legal documents, medical records, code, long support dialogues – this translates to direct savings.
Furthermore, a long context enables scenarios that were previously unrealistic in real time. For example, analyzing a large contract with an immediate response or an agent-based system that maintains a long history of interactions without losing context.
A Few Nuances to Consider
The results look convincing, but there is some important context to keep in mind.
The GB300 NVL72 is very expensive and not yet widely available hardware. Most companies are currently working with H100 or earlier configurations. So, this is more about the future outlook than a sign that everyone will be switching to the new infrastructure tomorrow.
It's also worth noting that the tests were conducted under specific conditions – on a particular model (DeepSeek) and in a specific configuration. How applicable these results are to other models and other workloads is a separate question that will require further investigation.
Finally, the very fact that LMSYS and NVIDIA are publishing these results is more than just a technical report. It's part of a broader conversation about how the industry will handle the growing demands for long contexts. The demand for this is increasing: models are getting smarter, tasks are becoming more complex, and documents are getting longer.
Conclusion Hardware Is Catching Up to Model Ambitions
Conclusion: Hardware Is Catching Up to Model Ambitions
For a long time, there has been a somewhat paradoxical situation: models were theoretically capable of handling very long texts, but in practice, it was too slow or too expensive to be truly viable.
The GB300 NVL72 takes a step toward closing this gap. It's not a complete solution, and it's not for everyone just yet, but the direction is clear. Long context is ceasing to be an exotic feature and is gradually becoming a standard that can be supported by real-world infrastructure with reasonable performance.
For those building products on top of language models, this is a positive signal: scenarios that seemed premature just a year ago are now becoming technically feasible.