vLLM is one of the most popular tools for running large language models. It is fast, user-friendly, and widely used in production environments. However, there is a problem that almost everyone faces when scaling workloads: the OOM (Out Of Memory) error – a shortage of RAM.
Engineers at AI21 Labs encountered this while working with their Jamba model. They were running vLLM on GPU servers and increasing the number of requests – until at some point the system would crash. Moreover, predicting the moment of failure was tricky: the system could run stably for a while and then suddenly «conk out».
Why vLLM Consumes So Much Memory
Why vLLM Is Such a Memory Hog 🧠
The issue lies in how vLLM manages resources. When a model processes a request, it needs to store intermediate data – known as KV caches. These are a kind of «margin notes» that help the model maintain the conversation context and generate text faster.
vLLM pre-allocates a significant amount of VRAM for these caches. The idea is to avoid wasting time on dynamic memory allocation during computation. But if there are many requests or they contain long contexts, this reserve is quickly exhausted – and the system throws an error.
Simply put: vLLM prioritizes maximum speed, so it grabs memory with a safety margin, but sometimes this amount proves to be either excessive or, conversely, insufficient depending on the specific workload.
AI21 Labs Solutions to Reduce Memory Usage
What AI21 Labs Did
The team tested several approaches. First, they tried simply limiting the number of concurrent requests to reduce the load. This only partially helped; in certain scenarios, memory still overflowed.
Next, they started experimenting with vLLM settings. Specifically, with the gpu_memory_utilization parameter – it determines the fraction of VRAM vLLM can occupy for KV caches. By default, this is 90%, but AI21 found that this setting was too aggressive for their tasks.
They lowered the value to 80% and then to 70% – the system became more stable. However, this meant that some GPU resources were sitting idle, and overall throughput dropped. This solution was far from perfect.
The Key Breakthrough: Dynamic Batching Management
The solution was found in how batches – groups of requests processed simultaneously – are formed. Instead of hard memory limits, the team focused on queue management.
vLLM tries to pack as many requests as possible into a single batch to maximize GPU utilization. But if a batch contains several long requests with large contexts, the memory limit can be exceeded right in the middle of the operation.
AI21 Labs implemented a system that monitors the actual memory required for current requests in real-time and dynamically adjusts the batch size. If the system sees that free memory is running low, it pauses adding new requests to the batch and waits for resources to free up.
There is no complex magic here – it is more like careful balancing. But the effect was significant: the number of OOM errors was slashed, while throughput remained high.
Practical Implications for Production Deployments
What This Means for the Industry
AI21 Labs' experience confirms an important point: vLLM's default settings are great for a quick start but aren't a one-size-fits-all solution. When launching a model into production with a real-world workload, it is essential to analyze exactly how the system consumes resources.
This is especially relevant when:
- requests vary greatly in length;
- the load is uneven (sharp spikes after periods of silence);
- heavy models are used on limited hardware capacities.
In such scenarios, a «set it and forget it» approach won't work. You need to implement memory monitoring, experiment with batching parameters, and potentially layer on your own load management logic.
Future Development and Optimization Opportunities
Open Questions
AI21 Labs did not disclose all the technical details of the implementation, as it is an internal solution for their infrastructure. However, the general concept is clear: vLLM provides an excellent foundation, but for stable operation under high load, the tool requires fine-tuning.
The question remains: will the vLLM project itself evolve toward smarter memory management? The project is actively supported by the community, so it's quite likely that such mechanisms will eventually appear in the base version «out of the box».
In the meantime, those scaling vLLM should view configuration not as a one-off task, but as a continuous process. Out-of-memory errors aren't a death sentence; they're a signal that it's time to optimize the system's parameters.