Published on February 12, 2026

How to Cut Language Model Training Time by 25% Without Quality Loss

Specialists at AI21 Labs have demonstrated that simple data packing optimization during LLM training allows the process to be significantly sped up without altering the neural network architecture.

Research 4 – 5 minutes min read

Event Source: AI21 Labs 4 – 5 minutes min read

Training large language models is a costly process. It's not just about money, but also time, energy, and carbon footprint. Therefore, any optimization that speeds up training without sacrificing quality becomes a vital find for the industry.

The AI21 Labs team published a study showing that language model training time can be cut by roughly 25% simply by changing how data is packed into batches. The approach was dubbed padding minimization. It sounds technically complex, but the essence is simple.

Efficiency Issues with Standard LLM Data Padding

What's the problem with the standard approach

When a model learns, data is fed in portions – batches. The problem is that texts vary in length: one sentence might consist of five words, another of fifty. But the model architecture requires all examples in a single batch to be of identical length. Therefore, short texts have to be artificially supplemented with empty tokens; this is exactly what padding is.

The difficulty lies in the fact that the model spends computational resources processing these empty tokens. It “reads” them, passes them through all layers, and updates weights, even though there is no useful information in them. The more placeholders there are, the more the hardware works for nothing.

Usually, this is fought with a simple method: grouping texts of roughly the same length into one batch. If short examples end up in a batch, there are fewer placeholders. If long ones do, there are also fewer, since the texts are close in size. This helps, but the result remains less than ideal.

AI21 Labs Padding Minimization Method for Efficient Training

What AI21 Labs proposed

The researchers decided to go further. Instead of simply grouping texts by length, they began looking for the optimal combination of fragments in each batch so that the total number of placeholders would be minimal.

Simply put, the task is formulated like this: there is a set of texts of different lengths and a fixed batch size. You need to select a combination of texts so they pack as densely as possible – like in “Tetris”, where you need to avoid gaps.

To do this, the team developed an algorithm that works quickly and requires no changes to the model itself. It is universal: it can be applied to any architecture – GPT, LLaMA, or BERT. Hence the definition in the original publication – model-agnostic, meaning independent of the specific model.

Training Performance and Model Quality Results

How it works in practice

AI21 Labs tested the approach when training a model with 1.5 billion parameters. They used standard datasets on which language models are usually trained. The result: training time was reduced by approximately 25%.

At the same time, the model's quality did not suffer. On test tasks, it showed the same results as a model trained in the conventional way. This is a solution without compromises: the same accuracy, but faster.

It is important that the speedup is not achieved through clever tricks with architecture or loss of calculation precision. Simply put, the model stops wasting time processing emptiness. It sounds obvious, but until now, no one had done this so consistently.

Benefits of Reduced Training Time and Resource Consumption

Why this matters

Training large models is one of the most expensive operations in modern machine learning. Large neural networks can train for weeks or months on thousands of GPUs. Cutting the time by a quarter is not just a convenience, but real resource saving.

For research labs, this means the ability to conduct more experiments in the same amount of time. For companies, it means lower infrastructure costs. For the industry as a whole, it means reduced energy consumption and carbon footprint.

Furthermore, the approach is universal. It can be applied to any model at any stage of training and with any data. There is no need to rewrite code or change the architecture. It is enough to change the way batches are formed, and you're done.

Future Outlook for Model Agnostic Training Optimizations

What's next

AI21 Labs did not specify whether they plan to release the code to the public, but the idea itself is transparent enough for reproduction. Implementations from the community will likely appear in the near future.

Interestingly, such optimizations often remain in the shadows. Loud headlines are not written about them, and they do not shift the paradigm of working with models. But it is precisely such improvements – small, technical, and almost unnoticeable – that, in sum, make model training more accessible and efficient.

Perhaps in a few years, padding minimization will become standard practice. And then we will wonder how we managed without it before – much like we now wonder that models were once trained without “mixed precision” or “gradient checkpointing”.

#technical context #research review #machine learning #ai training #engineering #scaling #model optimization #model training optimization

Link to Original: https://www.ai21.com/blog/padding-minimization-efficiency/

Original Title: Reducing LLM training waste with model-agnostic padding minimization

Publication Date: Feb 11, 2026

AI21 Labs www.ai21.com An Israeli company building large language models and AI tools for working with text.

Previous Article AMD Unveils Lemonade – A Unified API for Local AI Next Article Qwen-Image 2.0: When a Neural Network Can Both Draw and Edit

How to Cut Language Model Training Time by 25% Without Quality Loss

Efficiency Issues with Standard LLM Data Padding

AI21 Labs Padding Minimization Method for Efficient Training

Training Performance and Model Quality Results

Benefits of Reduced Training Time and Resource Consumption

Future Outlook for Model Agnostic Training Optimizations

Related Publications

How “Snoozing” Data Helps Save on AI Training Costs

AMD Shows How to Train Large Models Without the Fear of Losing Progress to a Single Crash

How a Single Token Broke an Entire Model: The Story of a vLLM Bug

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration