Published February 6, 2026

How to Scale vLLM and Avoid Out-of-Memory Errors

The AI21 Labs team shared their experience optimizing vLLM – a popular tool for deploying language models that often faces critical errors due to RAM shortages when scaling.

Technical context Infrastructure
Event Source: AI21 Labs Reading Time: 4 – 5 minutes

vLLM is one of the most popular tools for running large language models. It is fast, user-friendly, and widely used in production environments. However, there is a problem that almost everyone faces when scaling workloads: the OOM (Out Of Memory) error – a shortage of RAM.

Engineers at AI21 Labs encountered this while working with their Jamba model. They were running vLLM on GPU servers and increasing the number of requests – until at some point the system would crash. Moreover, predicting the moment of failure was tricky: the system could run stably for a while and then suddenly «conk out».

Why vLLM Consumes So Much Memory

Why vLLM Is Such a Memory Hog 🧠

The issue lies in how vLLM manages resources. When a model processes a request, it needs to store intermediate data – known as KV caches. These are a kind of «margin notes» that help the model maintain the conversation context and generate text faster.

vLLM pre-allocates a significant amount of VRAM for these caches. The idea is to avoid wasting time on dynamic memory allocation during computation. But if there are many requests or they contain long contexts, this reserve is quickly exhausted – and the system throws an error.

Simply put: vLLM prioritizes maximum speed, so it grabs memory with a safety margin, but sometimes this amount proves to be either excessive or, conversely, insufficient depending on the specific workload.

AI21 Labs Solutions to Reduce Memory Usage

What AI21 Labs Did

The team tested several approaches. First, they tried simply limiting the number of concurrent requests to reduce the load. This only partially helped; in certain scenarios, memory still overflowed.

Next, they started experimenting with vLLM settings. Specifically, with the gpu_memory_utilization parameter – it determines the fraction of VRAM vLLM can occupy for KV caches. By default, this is 90%, but AI21 found that this setting was too aggressive for their tasks.

They lowered the value to 80% and then to 70% – the system became more stable. However, this meant that some GPU resources were sitting idle, and overall throughput dropped. This solution was far from perfect.

The Key Breakthrough: Dynamic Batching Management

The solution was found in how batches – groups of requests processed simultaneously – are formed. Instead of hard memory limits, the team focused on queue management.

vLLM tries to pack as many requests as possible into a single batch to maximize GPU utilization. But if a batch contains several long requests with large contexts, the memory limit can be exceeded right in the middle of the operation.

AI21 Labs implemented a system that monitors the actual memory required for current requests in real-time and dynamically adjusts the batch size. If the system sees that free memory is running low, it pauses adding new requests to the batch and waits for resources to free up.

There is no complex magic here – it is more like careful balancing. But the effect was significant: the number of OOM errors was slashed, while throughput remained high.

Practical Implications for Production Deployments

What This Means for the Industry

AI21 Labs' experience confirms an important point: vLLM's default settings are great for a quick start but aren't a one-size-fits-all solution. When launching a model into production with a real-world workload, it is essential to analyze exactly how the system consumes resources.

This is especially relevant when:

  • requests vary greatly in length;
  • the load is uneven (sharp spikes after periods of silence);
  • heavy models are used on limited hardware capacities.

In such scenarios, a «set it and forget it» approach won't work. You need to implement memory monitoring, experiment with batching parameters, and potentially layer on your own load management logic.

Future Development and Optimization Opportunities

Open Questions

AI21 Labs did not disclose all the technical details of the implementation, as it is an internal solution for their infrastructure. However, the general concept is clear: vLLM provides an excellent foundation, but for stable operation under high load, the tool requires fine-tuning.

The question remains: will the vLLM project itself evolve toward smarter memory management? The project is actively supported by the community, so it's quite likely that such mechanisms will eventually appear in the base version «out of the box».

In the meantime, those scaling vLLM should view configuration not as a one-off task, but as a continuous process. Out-of-memory errors aren't a death sentence; they're a signal that it's time to optimize the system's parameters.

Original Title: Go big or go OOM: the art of scaling vLLM
Publication Date: Feb 6, 2026
AI21 Labs www.ai21.com An Israeli company building large language models and AI tools for working with text.
Previous Article Perplexity Introduces Benchmark for Evaluating Deep AI Research Quality Next Article Cursor Unveils Prototype for Autonomous Codebase Editing

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.5 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.5 Anthropic
2.
Gemini 3 Flash Preview Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

2. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 3 Flash Preview Google DeepMind
3.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

3. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
4.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

4. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

Want to dive deeper into the world
of neuro-creativity?

Be the first to learn about new books, articles, and AI experiments
on our Telegram channel!

Subscribe