Published on April 2, 2026

New Approach to Running AI Models in Production Efficiently

When One GPU Isn't Enough, and a Second Is Too Costly: A New Approach to Running AI in Production

Two new open-source projects offer a way to run multiple AI models on a single GPU with dynamic memory management, without sacrificing performance.

Infrastructure 5 – 7 minutes min read

Event Source: Red Hat 5 – 7 minutes min read

Running large language models in a real-world setting often looks simpler than it actually is. Downloading a model, loading it into a GPU's memory, and getting a response – it sounds easy. However, once you move from experiments to production, the picture becomes much more complicated.

Imagine you have several models, each dedicated to a specific task. One answers user questions, another processes documents, and a third generates code. Each one takes up a significant chunk of GPU memory. Keeping them all loaded simultaneously is often impossible – there simply isn't enough memory. You have to choose: either buy expensive hardware or accept that models will be constantly unloaded and reloaded, which wastes time.

GPU Memory Limitations for Large Language Models

The Problem We Don't Talk About

GPU memory is an expensive and limited resource. Most modern language models require tens of gigabytes just for a basic load. On top of that, there's the KV cache – a working space that stores the context of the current conversation or task. The longer the dialogue, the more space it consumes.

This leads to an unpleasant situation: the GPU sits idle during low-load periods but cannot accommodate additional models because its memory is occupied. During peak moments, the system begins to struggle as everything hits the same memory limit.

Put simply, current solutions are not effective at dynamically sharing resources among multiple models – depending on which one is actively working and which is idle.

Implementing Virtual Memory for GPU AI Workloads

Virtual Memory: Not a New Idea, But Largely Unexplored for GPUs

On regular computers, this problem was solved long ago. The operating system uses virtual memory: if there isn't enough RAM, some data is temporarily moved to the disk and brought back as needed. Applications don't notice the difference – they operate as if they have as much memory as they require.

For GPUs, such a mechanism has been virtually nonexistent. A graphics card either holds a model entirely or not at all. There is no swapping or dynamic reallocation between tasks.

This is precisely the gap that two new open-source projects, kvcached and Sardeenz, have attempted to fill.

kvcached and Sardeenz: Tools for GPU Memory Management

The kvcached and Sardeenz Approach

The idea behind both tools is essentially to bring the concept of virtual memory to the GPU environment for AI tasks.

kvcached manages the KV cache – that same working area that expands as the model runs. Instead of keeping the entire cache on the GPU, the system can move its fragments between VRAM, RAM, and the disk, depending on how urgently they are needed at that moment. This allows for handling more concurrent requests on the same hardware.

Sardeenz operates at a higher level: it manages the models themselves. Its job is to fit multiple models onto a single GPU and dynamically allocate available memory among them. If one model is actively used, it receives more resources. If another is idle, it can be partially evicted to free up space for its neighbors. The model isn't completely unloaded – it just takes a back seat and can be brought back quickly when needed again.

Additionally, Sardeenz includes a web interface for managing models: you can see in real time what's loaded, what's idle, and intervene manually if necessary.

Practical Benefits of Dynamic GPU Memory Allocation

Why This Matters in Practice

If you have one powerful GPU and several models that are needed at different times – for example, one is active during the day, another at night – then without such a tool, you're forced to either keep them all in memory (and waste resources) or reload them manually (and lose time with each swap). Both options are inconvenient.

With dynamic memory management, the situation changes: the system itself determines what's important at the moment and allocates resources accordingly. The GPU's load becomes more balanced, which means the infrastructure cost per unit of useful work decreases.

This is especially relevant for small teams and companies that cannot afford a dedicated server for each model, or for those working with cloud GPUs, where every hour of downtime means actual money.

Early Stages of GPU Virtual Memory Implementation

Still an Experiment, but the Idea Is Taking Shape

It's important to understand: both projects are in their early stages. This isn't a ready-made enterprise solution with support and guarantees. These are open-source tools that offer a specific approach to a long-standing problem.

The idea itself – bringing the principles of virtual memory to the world of GPU inference – is not new in theory. However, its practical implementation in the form of ready-to-use tools is only just emerging. And that, perhaps, is the main point: not the discovery of a principle, but the appearance of a working prototype that you can actually try.

The question of performance under a real workload remains open: how noticeable are the latencies when moving the cache between memory tiers? How does the system behave when all models become active simultaneously? The answers to these questions will emerge as the tools are tested in real-world conditions.

Nevertheless, the direction seems logical. Hardware is getting more expensive, models are becoming heavier, and the demand for infrastructure efficiency is only growing. Tools that help you get more out of what you already have – not by purchasing new servers, but by managing existing ones more intelligently – will be in high demand.

#applied analysis #technical context #neural networks #engineering #computer systems #infrastructure #scaling #gpu optimization #inference optimization

Link to Original: https://www.redhat.com/en/blog/running-llms-dynamically-production-limited-resources-hard-we-think-theres-room-another-approach

Original Title: Running LLMs dynamically, in production, on limited resources, is hard. We think there's room for another approach…

Publication Date: Apr 2, 2026

Red Hat www.redhat.com Global company developing open software platforms and infrastructure solutions with AI support.

Previous Article Google's TurboQuant: AI Learns to Conserve Memory Next Article Qwen3.6-Plus: Alibaba's New Model on the Path to True AI Agents

New Approach to Running AI Models in Production Efficiently

GPU Memory Limitations for Large Language Models

Implementing Virtual Memory for GPU AI Workloads

kvcached and Sardeenz: Tools for GPU Memory Management

Practical Benefits of Dynamic GPU Memory Allocation

Early Stages of GPU Virtual Memory Implementation

Related Publications

How to Scale vLLM and Avoid Out-of-Memory Errors

Getting the Most Out of AI Models: Three Ways to Speed Up Inference

Cache as a Resource: How Alibaba Cloud Teaches AI Not to Calculate the Same Thing Twice

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration