Published on March 6, 2026

FlashOptim: How to Compress a Neural Network Without Losing Quality

What if you could train a massive neural network using half the memory – without breaking anything? That's exactly what the creators of FlashOptim are exploring.

Computer Science 11 – 16 minutes min read

Author: Dr. Kim Lee 11 – 16 minutes min read

«When I was digging into this paper, what struck me wasn't just the compression itself, but how rigorously the authors defined the boundaries where it's safe. It's rare to see something proven mathematically, instead of just 'validated by experiments.' I'd like to believe this kind of approach will become the standard, not the exception. And it makes you wonder: how soon will techniques like this be built into standard libraries, so that researchers can stop worrying about bytes and just run it and forget it?» – Dr. Kim Lee

Memory Consumption in Large Language Models

A Problem Everyone Sees, But That's Harder to Solve Than It Looks

Imagine you want to train a large language model – say, something like the ones that can write text, answer questions, and carry on a conversation. Let's assume the model has 7 billion parameters. That sounds impressive, but let's translate it into practical terms: how much memory do you need to train all of that?

Each parameter in the model is a floating-point number. And during training, the neural network remembers not just the number itself, but also several of its “companions”: the gradient (which way to adjust the parameter) and the optimizer's state variables (something like a memory of previous training steps to move more accurately in the right direction). Each of these “companions” takes up 4 bytes. In total, a single parameter can hog up to 16 bytes of memory.

Multiply 16 bytes by 7 billion parameters, and you get over 100 gigabytes of accelerator memory. That's more than most researchers have. Modern professional GPUs, like those used in science labs, often have 40 or 80 gigabytes of memory – and even that isn't enough. Training large models becomes a task accessible only to organizations with expensive infrastructure.

This is where FlashOptim comes in – a set of techniques designed to radically reduce memory consumption during training without sacrificing model quality.

Understanding Mixed Precision Training

What Is “Mixed Precision” and Why Is It Such a Big Deal?

Before diving into FlashOptim, you need to understand one basic principle: mixed-precision training. It sounds complicated, but the idea is simple.

Numbers in a computer can be stored with varying precision. For example, a 32-bit number is like a detailed notebook where you can jot down many decimal places. A 16-bit number is a shorter, less precise record, but it takes up half the space. When training neural networks, 32-bit precision is necessary for some operations (like optimizer weight updates), while 16-bit is sufficient for others (like the actual computations in the network).

Mixed precision is a compromise: we use 16-bit numbers where we can and 32-bit numbers where accuracy is critical. This already saves memory, but not enough. The optimizer still stores its internal variables in 32 bits, and they're the ones that “eat up” most of the memory.

FlashOptim goes a step further by proposing two key techniques: smart splitting of the master weights and compression of the optimizer's state variables.

FlashOptim Technique One: Smart Weight Splitting

Technique One: Smart Weight Splitting

During standard mixed-precision training, there's a so-called “master copy” of the weights – a 32-bit version of each parameter that the optimizer updates at every step. This copy is needed because small changes can simply get “lost” with 16-bit parameter storage due to insufficient precision. The master copy acts as an insurance policy against these losses.

But storing a full 32-bit copy for every parameter is wasteful. The authors of FlashOptim propose a different approach called Master Weight Splitting. The idea is not to store the entire 32-bit copy, but only the difference between it and the 16-bit version – the so-called delta.

Imagine you have a high-resolution city map (32-bit) and a simplified tourist map (16-bit). Instead of carrying both, you take the tourist map and a small slip of paper with notes like, “this street is a bit wider here, and that building was demolished”.This slip of paper is the delta.

The FlashOptim authors took this classic approach a step further: they proved that this delta can be stored not in 16 bits, but in just 8 bits – provided that the quantization error (the loss of precision when converting to fewer bits) does not exceed a certain threshold. They derived a precise mathematical bound for this error and showed that if a parameter update fits within the allowable range, an 8-bit delta works perfectly. If the update goes beyond the limits, the system can be carefully scaled or, in rare cases, temporarily switched to a 16-bit format.

As a result, the parameters are stored as a 16-bit copy plus an 8-bit delta – a total of 3 bytes instead of 4. The saving seems small, but with billions of parameters, the numbers become impressive.

FlashOptim Technique Two: Compressing Optimizer Memory via Companding

Technique Two: Compressing Optimizer Memory via Companding

Now for the most interesting part – the optimizer's state variables. To understand this, we need a quick explanation of why they exist at all.

The popular AdamW optimizer, used to train most modern language models, tracks two “moments” for each parameter. The first moment is the average value of recent gradients (roughly, which direction we've been moving in lately). The second moment is the average of the squared gradients (how much that direction has been changing). Knowing both numbers allows AdamW to take adaptive steps: moving quickly where it's confident and cautiously where the gradients are “jumpy”.

Think of an experienced rock climber who remembers not only where they're climbing but also how unstable the rock is at each point. That's the purpose of the two moments. Each is typically stored in a 32-bit format – totaling 8 bytes just for the optimizer state.

FlashOptim proposes storing these moments in an 8-bit format. But this creates a problem: a straightforward compression of a 32-bit number to 8 bits severely reduces precision. This works especially poorly for values that are widely scattered across a scale – and the moment values are just like that: most are small, but some large ones pop up.

To solve this, the authors use companding – a term from telephony formed from “compressing” and “expanding”.The idea is to non-linearly “squash” the scale of values before saving and “stretch” it back out when reading. This allows more 8-bit “slots” to be used for small values (where precision is needed) and fewer for large ones (where precision is less critical).

A good analogy is audio recording. It's important to capture quiet sounds in a recording accurately because they create the nuances of the music. Loud sounds are “coarser” and can tolerate greater error. That's why audio formats use a non-linear volume scale – quiet levels are marked more densely, loud ones more sparsely. Companding does the same thing for the numbers in the optimizer.

The FlashOptim authors developed two types of companding:

Analytical Logarithmic Companding. The compression function is based on a logarithmic scale – a mathematically sound choice that works well for the typical distributions of moment values. It can be set in advance and requires no additional training.
Learnable Companding. Here, the parameters of the compression function are themselves learned during model training. A small auxiliary network adapts to the specific value distributions for a particular task – this is potentially more accurate but slightly more complex to implement.

Both approaches allow each optimizer state variable to be stored in 1 byte instead of 4. For AdamW with its two moments, that's a reduction from 8 bytes to 2.

The Third Component: 16-Bit Gradients

Here, the FlashOptim authors aren't inventing anything new – they're simply sticking to established practice. In standard mixed-precision training, gradients have long been stored in a 16-bit format (FP16 or BF16). This is sufficient for accurate weight updates, and no drop in quality is observed. So, the 2 bytes per parameter for gradients remain unchanged.

Additionally, if you use a strategy of immediately freeing gradients (i.e., deleting them right after updating the weights without waiting for the step to end), the memory for gradients doesn't need to be counted as constantly occupied – bringing the total consumption down to 5 bytes per parameter.

Memory Consumption: Before and After FlashOptim

The Bottom Line in Bytes: Before and After

Now let's put it all together. With standard mixed-precision training using the AdamW optimizer, one parameter requires:

4 bytes – 32-bit master copy of the parameter,
2 bytes – 16-bit working parameter (used in computations),
2 bytes – gradient in BF16 format,
4 bytes – first moment of the optimizer,
4 bytes – second moment of the optimizer.

Total – about 16 bytes per parameter. With FlashOptim, the picture changes:

2 bytes – 16-bit copy of the parameter,
1 byte – 8-bit delta,
2 bytes – gradient in BF16 format,
1 byte – first moment in 8-bit quantized format,
1 byte – second moment in 8-bit quantized format.

Total – 7 bytes. And with gradient freeing, it's 5 bytes. That's more than a two-fold reduction in memory consumption compared to the original 16 bytes.

For a model with 7 billion parameters, this means going from a requirement of 100+ gigabytes to around 35–50 gigabytes. That's the difference between “you need a special computing cluster” and “you can run it on a single powerful GPU”.

FlashOptim Experimental Results

What the Experiments Showed

The authors tested FlashOptim on several optimizers: SGD, AdamW, and Lion. The tasks included computer vision (image classification on the ImageNet dataset using ResNet and Vision Transformer architectures) and natural language processing tasks – including fine-tuning the Llama-3.1-8B model.

The result that the authors especially emphasize is this: no measurable degradation in quality was recorded in any of the tests. Image classification accuracy remained within the margin of statistical error. After fine-tuning with FlashOptim, language models showed comparable results on the perplexity metric (a measure of how confidently a model predicts text – the lower, the better) and other language understanding tasks.

This is a critically important finding. Compression methods often provide memory gains at the cost of model accuracy – and this trade-off is usually what stops them from being adopted. FlashOptim, according to the published experiments, allows you to bypass this compromise.

Another bonus is a reduction in checkpoint size. A checkpoint is a saved “snapshot” of the model's state at a specific point in training: parameters, optimizer states, everything. Since FlashOptim stores states more compactly, checkpoints are more than halved in size. When working with large models, this means significant savings in disk space and much faster file transfers.

FlashOptim Impact for Researchers and Small Labs

Why This Matters for Those Outside of Big Tech

For a long time, training large neural networks has been the domain of organizations with hundreds of GPUs and petabytes of storage. Smaller research groups, universities, and independent labs were effectively cut off from working with such models – not due to a lack of ideas or algorithms, but simply due to memory constraints.

FlashOptim changes this equation. It works as an add-on to existing optimizers – SGD, AdamW, Lion – and is compatible with popular deep learning frameworks like PyTorch. This means you don't have to rewrite all your training code to use it. You just need to swap out the optimizer, and the memory savings are achieved almost automatically.

Compatibility with the API (the software interface through which developers interact with libraries) isn't a technical whim; it's a fundamental choice. It's what determines whether a tool will actually be used or will remain a nice result in a research paper.

Behind FlashOptim: The Mathematics of Quantization

How Complicated Is What's Happening Inside?

Behind the ease of use lies some non-trivial mathematics. Quantization – compressing a number from one format to another with fewer bits – always introduces an error. The question is how critical this error is to the final quality of the model.

For weights and gradients, quantization error can be catastrophic – which is why model parameters are usually stored in high-precision formats. For optimizer states – the first and second moments of AdamW – the situation is different: these numbers are less sensitive to precision, and an 8-bit representation with a well-chosen companding function turns out to be sufficient.

The authors of FlashOptim conducted a detailed analysis: they formally proved that the quantization error from storing the weight delta in 8 bits remains below the threshold where it would affect training. This isn't just an empirical observation (“we tried it, and it works”), but a mathematically grounded statement – which makes the approach significantly more reliable.

An analogy from everyday life: if you round a long number like “3.14159265358979” to “3.14”, it's perfectly acceptable for most practical calculations. But if that number is used to navigate a spacecraft, every digit counts. FlashOptim methodically figures out in which of the neural network's “calculations” a “3.14” is good enough – and uses this knowledge to save memory where it's safe.

Two Companding Modes for Optimizer States

Two Companding Modes: A Choice for the Task

An interesting detail is that FlashOptim doesn't force a single method for compressing optimizer states. The two companding options are geared toward different situations:

Analytical logarithmic companding is suited for cases where predictability and simplicity are needed. The function is predefined, requires no extra computation during training, and works well for typical distributions of moment values. It's the “standard edition”.

Learnable companding is the “deluxe edition”.The parameters of the compression function are fine-tuned for a specific model and task during training. This involves slightly more overhead but potentially yields more accurate quantization for non-standard distributions. Practical experiments showed that both methods work well – which itself speaks to the robustness of the overall concept.

FlashOptim in Practice: Key Benefits and Implications

What This Means in Practice

FlashOptim isn't a revolution in neural network architecture or a breakthrough in optimization theory. It's a piece of precision engineering: identifying where memory is wasted during training, proving that it can be saved there without harming the result, and implementing it neatly and compatibly with existing code.

The result is a memory requirement cut by more than half while maintaining model quality, compatibility with popular optimizers and frameworks, smaller checkpoint sizes, and the ability to train 7-billion-parameter-scale models on hardware with 35–50 gigabytes of memory instead of 100+.

In a world where access to computing resources increasingly determines who can conduct cutting-edge research, tools like FlashOptim shift that imbalance toward greater accessibility. And that, perhaps, is a result no less important than the technique itself.

#technical context #educational content #neural networks #engineering #computer systems #scaling #model quantization #model training optimization

Source: https://arxiv.org/abs/2602.23349v1

Original Title: FlashOptim: Optimizers for Memory Efficient Training

Article Publication Date: Feb 26, 2026

Original Article Authors : Jose Javier Gonzalez Ortiz, Abhay Gupta, Chris Renard, Davis Blalock

Dr. Kim Lee View Profile

«Code is poetry – just written in another language.»

View Profile

I'm a researcher in machine learning. To me, algorithms aren't magic tricks – they're a mirror of human thought.

Previous Article Non-linearity in Disguise: How Complex Networks Pretend to Be Simple Next Article Symphony of Immunity: How Mathematics Helps Tame Dengue Fever

FlashOptim: How to Compress a Neural Network Without Losing Quality

Memory Consumption in Large Language Models

Understanding Mixed Precision Training

FlashOptim Technique One: Smart Weight Splitting

FlashOptim Technique Two: Compressing Optimizer Memory via Companding

The Third Component: 16-Bit Gradients

Memory Consumption: Before and After FlashOptim

FlashOptim Experimental Results

FlashOptim Impact for Researchers and Small Labs

Behind FlashOptim: The Mathematics of Quantization

Two Companding Modes for Optimizer States

FlashOptim in Practice: Key Benefits and Implications

Related Publications

How to Train AI Together Without Spilling Secrets: CEPAM and the Magic of Quantization

How to Curb the «Appetites» of Embedding Models on AMD Ryzen AI

Liquid AI Releases LFM2-24B: A Large Model with a Small Memory Footprint

From Research to Understanding

Neural Networks Involved in the Process

1. Research Summarization

2. Creating Text from Summary

3. step.translate-en.title

4. Editorial Review

5. Preparing Description for Illustration

6. Creating Illustration