Published on April 2, 2026

New Approach to Running AI Models in Production Efficiently

When One GPU Isn't Enough, and a Second Is Too Costly: A New Approach to Running AI in Production

Two new open-source projects offer a way to run multiple AI models on a single GPU with dynamic memory management, without sacrificing performance.

Infrastructure 5 – 7 minutes min read
Event Source: Red Hat 5 – 7 minutes min read

Running large language models in a real-world setting often looks simpler than it actually is. Downloading a model, loading it into a GPU's memory, and getting a response – it sounds easy. However, once you move from experiments to production, the picture becomes much more complicated.

Imagine you have several models, each dedicated to a specific task. One answers user questions, another processes documents, and a third generates code. Each one takes up a significant chunk of GPU memory. Keeping them all loaded simultaneously is often impossible – there simply isn't enough memory. You have to choose: either buy expensive hardware or accept that models will be constantly unloaded and reloaded, which wastes time.

GPU Memory Limitations for Large Language Models

The Problem We Don't Talk About

GPU memory is an expensive and limited resource. Most modern language models require tens of gigabytes just for a basic load. On top of that, there's the KV cache – a working space that stores the context of the current conversation or task. The longer the dialogue, the more space it consumes.

This leads to an unpleasant situation: the GPU sits idle during low-load periods but cannot accommodate additional models because its memory is occupied. During peak moments, the system begins to struggle as everything hits the same memory limit.

Put simply, current solutions are not effective at dynamically sharing resources among multiple models – depending on which one is actively working and which is idle.

Implementing Virtual Memory for GPU AI Workloads

Virtual Memory: Not a New Idea, But Largely Unexplored for GPUs

On regular computers, this problem was solved long ago. The operating system uses virtual memory: if there isn't enough RAM, some data is temporarily moved to the disk and brought back as needed. Applications don't notice the difference – they operate as if they have as much memory as they require.

For GPUs, such a mechanism has been virtually nonexistent. A graphics card either holds a model entirely or not at all. There is no swapping or dynamic reallocation between tasks.

This is precisely the gap that two new open-source projects, kvcached and Sardeenz, have attempted to fill.

kvcached and Sardeenz: Tools for GPU Memory Management

The kvcached and Sardeenz Approach

The idea behind both tools is essentially to bring the concept of virtual memory to the GPU environment for AI tasks.

kvcached manages the KV cache – that same working area that expands as the model runs. Instead of keeping the entire cache on the GPU, the system can move its fragments between VRAM, RAM, and the disk, depending on how urgently they are needed at that moment. This allows for handling more concurrent requests on the same hardware.

Sardeenz operates at a higher level: it manages the models themselves. Its job is to fit multiple models onto a single GPU and dynamically allocate available memory among them. If one model is actively used, it receives more resources. If another is idle, it can be partially evicted to free up space for its neighbors. The model isn't completely unloaded – it just takes a back seat and can be brought back quickly when needed again.

Additionally, Sardeenz includes a web interface for managing models: you can see in real time what's loaded, what's idle, and intervene manually if necessary.

Practical Benefits of Dynamic GPU Memory Allocation

Why This Matters in Practice

If you have one powerful GPU and several models that are needed at different times – for example, one is active during the day, another at night – then without such a tool, you're forced to either keep them all in memory (and waste resources) or reload them manually (and lose time with each swap). Both options are inconvenient.

With dynamic memory management, the situation changes: the system itself determines what's important at the moment and allocates resources accordingly. The GPU's load becomes more balanced, which means the infrastructure cost per unit of useful work decreases.

This is especially relevant for small teams and companies that cannot afford a dedicated server for each model, or for those working with cloud GPUs, where every hour of downtime means actual money.

Early Stages of GPU Virtual Memory Implementation

Still an Experiment, but the Idea Is Taking Shape

It's important to understand: both projects are in their early stages. This isn't a ready-made enterprise solution with support and guarantees. These are open-source tools that offer a specific approach to a long-standing problem.

The idea itself – bringing the principles of virtual memory to the world of GPU inference – is not new in theory. However, its practical implementation in the form of ready-to-use tools is only just emerging. And that, perhaps, is the main point: not the discovery of a principle, but the appearance of a working prototype that you can actually try.

The question of performance under a real workload remains open: how noticeable are the latencies when moving the cache between memory tiers? How does the system behave when all models become active simultaneously? The answers to these questions will emerge as the tools are tested in real-world conditions.

Nevertheless, the direction seems logical. Hardware is getting more expensive, models are becoming heavier, and the demand for infrastructure efficiency is only growing. Tools that help you get more out of what you already have – not by purchasing new servers, but by managing existing ones more intelligently – will be in high demand.

Original Title: Running LLMs dynamically, in production, on limited resources, is hard. We think there's room for another approach…
Publication Date: Apr 2, 2026
Red Hat www.redhat.com Global company developing open software platforms and infrastructure solutions with AI support.
Previous Article Google's TurboQuant: AI Learns to Conserve Memory Next Article Qwen3.6-Plus: Alibaba's New Model on the Path to True AI Agents

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

AI: Events

How to Scale vLLM and Avoid Out-of-Memory Errors

Technical context Infrastructure

The AI21 Labs team shared their experience optimizing vLLM – a popular tool for deploying language models that often faces critical errors due to RAM shortages when scaling.

AI21 Labswww.ai21.com Feb 6, 2026

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to know about new
experiments first?

Subscribe to our Telegram channel — we share all the latest
and exciting updates from NeuraBooks.

Subscribe