Published February 12, 2026

How to Cut Language Model Training Time by 25% Without Quality Loss

Specialists at AI21 Labs have demonstrated that simple data packing optimization during LLM training allows the process to be significantly sped up without altering the neural network architecture.

Research
Event Source: AI21 Labs Reading Time: 4 – 5 minutes

Training large language models is a costly process. It's not just about money, but also time, energy, and carbon footprint. Therefore, any optimization that speeds up training without sacrificing quality becomes a vital find for the industry.

The AI21 Labs team published a study showing that language model training time can be cut by roughly 25% simply by changing how data is packed into batches. The approach was dubbed padding minimization. It sounds technically complex, but the essence is simple.

Efficiency Issues with Standard LLM Data Padding

What's the problem with the standard approach

When a model learns, data is fed in portions – batches. The problem is that texts vary in length: one sentence might consist of five words, another of fifty. But the model architecture requires all examples in a single batch to be of identical length. Therefore, short texts have to be artificially supplemented with empty tokens; this is exactly what padding is.

The difficulty lies in the fact that the model spends computational resources processing these empty tokens. It “reads” them, passes them through all layers, and updates weights, even though there is no useful information in them. The more placeholders there are, the more the hardware works for nothing.

Usually, this is fought with a simple method: grouping texts of roughly the same length into one batch. If short examples end up in a batch, there are fewer placeholders. If long ones do, there are also fewer, since the texts are close in size. This helps, but the result remains less than ideal.

AI21 Labs Padding Minimization Method for Efficient Training

What AI21 Labs proposed

The researchers decided to go further. Instead of simply grouping texts by length, they began looking for the optimal combination of fragments in each batch so that the total number of placeholders would be minimal.

Simply put, the task is formulated like this: there is a set of texts of different lengths and a fixed batch size. You need to select a combination of texts so they pack as densely as possible – like in “Tetris”, where you need to avoid gaps.

To do this, the team developed an algorithm that works quickly and requires no changes to the model itself. It is universal: it can be applied to any architecture – GPT, LLaMA, or BERT. Hence the definition in the original publication – model-agnostic, meaning independent of the specific model.

Training Performance and Model Quality Results

How it works in practice

AI21 Labs tested the approach when training a model with 1.5 billion parameters. They used standard datasets on which language models are usually trained. The result: training time was reduced by approximately 25%.

At the same time, the model's quality did not suffer. On test tasks, it showed the same results as a model trained in the conventional way. This is a solution without compromises: the same accuracy, but faster.

It is important that the speedup is not achieved through clever tricks with architecture or loss of calculation precision. Simply put, the model stops wasting time processing emptiness. It sounds obvious, but until now, no one had done this so consistently.

Benefits of Reduced Training Time and Resource Consumption

Why this matters

Training large models is one of the most expensive operations in modern machine learning. Large neural networks can train for weeks or months on thousands of GPUs. Cutting the time by a quarter is not just a convenience, but real resource saving.

For research labs, this means the ability to conduct more experiments in the same amount of time. For companies, it means lower infrastructure costs. For the industry as a whole, it means reduced energy consumption and carbon footprint.

Furthermore, the approach is universal. It can be applied to any model at any stage of training and with any data. There is no need to rewrite code or change the architecture. It is enough to change the way batches are formed, and you're done.

Future Outlook for Model Agnostic Training Optimizations

What's next

AI21 Labs did not specify whether they plan to release the code to the public, but the idea itself is transparent enough for reproduction. Implementations from the community will likely appear in the near future.

Interestingly, such optimizations often remain in the shadows. Loud headlines are not written about them, and they do not shift the paradigm of working with models. But it is precisely such improvements – small, technical, and almost unnoticeable – that, in sum, make model training more accessible and efficient.

Perhaps in a few years, padding minimization will become standard practice. And then we will wonder how we managed without it before – much like we now wonder that models were once trained without “mixed precision” or “gradient checkpointing”.

Original Title: Reducing LLM training waste with model-agnostic padding minimization
Publication Date: Feb 11, 2026
AI21 Labs www.ai21.com An Israeli company building large language models and AI tools for working with text.
Previous Article AMD Unveils Lemonade – A Unified API for Local AI Next Article Qwen-Image 2.0: When a Neural Network Can Both Draw and Edit

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.5 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.5 Anthropic
2.
Gemini 3 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 3 Pro Google DeepMind
3.
Gemini 3 Flash Preview Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 3 Flash Preview Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

Want to know about new
experiments first?

Subscribe to our Telegram channel — we share all the latest
and exciting updates from NeuraBooks.

Subscribe