One of the most expensive problems in modern AI isn't so much about intelligence itself, but about memory. The longer a conversation with a language model, the more context it needs to retain, and the more RAM it requires. And not just any memory, but very fast and expensive memory. This is precisely why maintaining large AI systems is so costly, and why conventional devices struggle to run powerful models locally.
Google Research has proposed a solution that seems surprisingly elegant for such a “heavy” problem. The algorithm is called TurboQuant, and its core idea is to radically compress the data a model stores in its working memory without sacrificing the quality of its responses.
What is AI's working memory, and why is it so “heavy”?
When a language model engages in a dialogue, it doesn't reread everything from scratch with each new message. Instead, it saves intermediate text processing results in a special area – the so-called KV cache. To put it simply, it's like a working notepad: the model jots down important details from the already processed text to avoid starting its calculations all over again.
The problem is that this “notepad” grows very quickly. The longer the context, the more data needs to be stored. And this data must be stored in fast memory directly on the accelerators (graphics cards or specialized chips), which is already scarce. This is why processing long documents or multi-turn dialogues is so resource-intensive.
Until now, the industry has dealt with this by scaling up hardware. TurboQuant offers a different approach – rethinking the logic of data storage.
Polar Coordinates Instead of Bulky Tables
TurboQuant is based on two methods that work in parallel.
The first is called PolarQuant. Instead of storing data as cumbersome multidimensional coordinates, the algorithm converts it into a polar system, meaning it only remembers the direction and distance. It's like the difference between describing a person's location with a full address versus simply saying, “two hundred meters north.” For neural networks, as it turns out, the direction of a vector is far more important than its precise coordinates, and this allows for significant savings.
The second method is QJL. It acts as a corrector: when compression is high, small distortions inevitably appear, and QJL “moves” this noise into a mathematically safe area where it doesn't affect the final calculations. This allows the algorithm to be aggressive with compression without sacrificing accuracy.
Together, both methods make it possible to compress the KV cache 6-fold while maintaining the same quality of responses. In tests on the Gemma and Mistral models, TurboQuant not only reduced memory consumption but also sped up computations by up to 8 times on NVIDIA H100 chips.
Why This Matters Beyond the Lab
Putting the technical details aside, the implications are quite obvious.
First, cost reduction. Less memory means less hardware, which means it's cheaper to run the models. For companies spending huge sums on AI infrastructure, this translates to direct savings.
Second, accessibility. If models start running on less memory, it will become easier to run them on conventional devices – laptops, phones, and local servers. Today, powerful models require specialized hardware partly because of their high memory consumption.
Third, context scale. The same resources that previously allowed for processing, say, 10 pages of text can now cover 60. This directly impacts how long and coherent conversations with AI can be, or how large of a document it can analyze in a single pass.
A Reaction Few Expected
The publication of the research had an unexpected effect on the financial markets. The stock prices of memory manufacturers – companies that profit from the ever-increasing storage capacity in data centers – declined. The logic is simple: if AI systems suddenly require six times less memory, the demand for the corresponding “hardware” could shrink.
However, analysts are divided. Some see it as a direct blow to the memory market. Others point to the so-called Jevons paradox: when a technology becomes more efficient, it gets used more actively, and as a result, overall resource consumption doesn't fall, but rises. If AI becomes cheaper to operate, it will likely be applied in far more scenarios, and the total demand for memory might remain the same or even increase.
The comparison to DeepSeek, which is already being mentioned in comments, is also relevant: the Chinese model showed at the time that high efficiency with modest resources is achievable. TurboQuant is moving in the same direction, but with respect to the model's memory during operation, not during its training process.
It's important to understand that for now, TurboQuant remains a laboratory development. Google plans to present it at the ICLR 2026 conference, where the underlying methods will be described in detail. There are still several steps to go before its actual implementation in products or widespread adoption.
There are open questions as well. How well does the algorithm perform on architectures other than those it was tested on? How does it handle very long contexts or non-standard tasks? How can it be integrated into existing systems without major overhauls?
None of these questions diminish the significance of the finding – but they serve as a reminder that between “it works in an experiment” and “it works everywhere,” there is usually a great deal of engineering work.
Nevertheless, the very fact that the memory problem in AI is being solved not by scaling up hardware but by rethinking the underlying mathematics is a truly interesting signal. An industry accustomed to buying its way forward is increasingly starting to compute it instead. 🧮