Running large language models in a real-world setting often looks simpler than it actually is. Downloading a model, loading it into a GPU's memory, and getting a response – it sounds easy. However, once you move from experiments to production, the picture becomes much more complicated.
Imagine you have several models, each dedicated to a specific task. One answers user questions, another processes documents, and a third generates code. Each one takes up a significant chunk of GPU memory. Keeping them all loaded simultaneously is often impossible – there simply isn't enough memory. You have to choose: either buy expensive hardware or accept that models will be constantly unloaded and reloaded, which wastes time.
The Problem We Don't Talk About
GPU memory is an expensive and limited resource. Most modern language models require tens of gigabytes just for a basic load. On top of that, there's the KV cache – a working space that stores the context of the current conversation or task. The longer the dialogue, the more space it consumes.
This leads to an unpleasant situation: the GPU sits idle during low-load periods but cannot accommodate additional models because its memory is occupied. During peak moments, the system begins to struggle as everything hits the same memory limit.
Put simply, current solutions are not effective at dynamically sharing resources among multiple models – depending on which one is actively working and which is idle.
Virtual Memory: Not a New Idea, But Largely Unexplored for GPUs
On regular computers, this problem was solved long ago. The operating system uses virtual memory: if there isn't enough RAM, some data is temporarily moved to the disk and brought back as needed. Applications don't notice the difference – they operate as if they have as much memory as they require.
For GPUs, such a mechanism has been virtually nonexistent. A graphics card either holds a model entirely or not at all. There is no swapping or dynamic reallocation between tasks.
This is precisely the gap that two new open-source projects, kvcached and Sardeenz, have attempted to fill.
The kvcached and Sardeenz Approach
The idea behind both tools is essentially to bring the concept of virtual memory to the GPU environment for AI tasks.
kvcached manages the KV cache – that same working area that expands as the model runs. Instead of keeping the entire cache on the GPU, the system can move its fragments between VRAM, RAM, and the disk, depending on how urgently they are needed at that moment. This allows for handling more concurrent requests on the same hardware.
Sardeenz operates at a higher level: it manages the models themselves. Its job is to fit multiple models onto a single GPU and dynamically allocate available memory among them. If one model is actively used, it receives more resources. If another is idle, it can be partially evicted to free up space for its neighbors. The model isn't completely unloaded – it just takes a back seat and can be brought back quickly when needed again.
Additionally, Sardeenz includes a web interface for managing models: you can see in real time what's loaded, what's idle, and intervene manually if necessary.
Why This Matters in Practice
If you have one powerful GPU and several models that are needed at different times – for example, one is active during the day, another at night – then without such a tool, you're forced to either keep them all in memory (and waste resources) or reload them manually (and lose time with each swap). Both options are inconvenient.
With dynamic memory management, the situation changes: the system itself determines what's important at the moment and allocates resources accordingly. The GPU's load becomes more balanced, which means the infrastructure cost per unit of useful work decreases.
This is especially relevant for small teams and companies that cannot afford a dedicated server for each model, or for those working with cloud GPUs, where every hour of downtime means actual money.
Still an Experiment, but the Idea Is Taking Shape
It's important to understand: both projects are in their early stages. This isn't a ready-made enterprise solution with support and guarantees. These are open-source tools that offer a specific approach to a long-standing problem.
The idea itself – bringing the principles of virtual memory to the world of GPU inference – is not new in theory. However, its practical implementation in the form of ready-to-use tools is only just emerging. And that, perhaps, is the main point: not the discovery of a principle, but the appearance of a working prototype that you can actually try.
The question of performance under a real workload remains open: how noticeable are the latencies when moving the cache between memory tiers? How does the system behave when all models become active simultaneously? The answers to these questions will emerge as the tools are tested in real-world conditions.
Nevertheless, the direction seems logical. Hardware is getting more expensive, models are becoming heavier, and the demand for infrastructure efficiency is only growing. Tools that help you get more out of what you already have – not by purchasing new servers, but by managing existing ones more intelligently – will be in high demand.