Published on March 10, 2026

How to Train AI on Million-Token Texts: A Game-Changing Idea

Researchers have proposed a method for distributing the processing of ultra-long texts across multiple GPUs, allowing models to be trained on contexts of up to one million tokens.

Infrastructure / Technical context 5 – 7 minutes min read
Event Source: Hugging Face 5 – 7 minutes min read

One of the most notable trends in the development of language models is the growth of the so-called “context window.” In short, this is the amount of text the model can simultaneously hold in memory while processing a request. A few years ago, we were talking about thousands of characters. Today, it's hundreds of thousands or even millions of tokens (a token is roughly a word or part of a word).

However, as the context grows, a serious engineering problem arises: training such models becomes physically difficult. Not just in the sense that it “requires effort,” but literally – it doesn't fit into the memory of a single graphics processing unit (GPU). And this is where the story of Ulysses Sequence Parallelism begins.

Why Long Context Poses Hardware Challenges

Why a Long Context Is a Headache for Hardware

When a model processes text, it doesn't just read it word by word. It builds connections between all parts of the text simultaneously, comparing every word with every other. This is called the attention mechanism. And the longer the text, the more such connections need to be calculated and stored in memory.

For short texts, this is fine. But imagine you need to hold not just a page in your head, but an entire book – and remember how every sentence relates to every other one. This is exactly what happens when working with a million-token context. A single GPU's memory simply can't handle it.

The standard solution is to split the model into parts and distribute it across multiple GPUs. But organizing this to work efficiently and without excessive latency is difficult.

The Idea of Sequence Parallelism

Ulysses Sequence Parallelism is an approach where a long sequence of tokens is divided among several GPUs, not by splitting the model, but by splitting the text itself. Each processor gets its own “chunk” of the input text and processes it.

The problem is that the attention mechanism is “global” by nature: to process one fragment correctly, you need to know what's happening in the others. Therefore, the GPUs need to periodically exchange information with each other.

The key idea behind DeepSpeed Ulysses, on which this approach is based, is to minimize this data exchange. Instead of shuffling the entire text between GPUs, the model exchanges only the data that is essential for computation. This makes the process significantly more efficient.

To put it simply, imagine several people reading different chapters of the same book and then briefly summarizing the key points for each other – instead of everyone rereading the whole thing from the beginning. The principle is roughly the same.

What Has Been Implemented and How It Works in Practice

The publication on Hugging Face presents an implementation of this approach integrated into the model training ecosystem. Importantly, the authors didn't just describe the idea – they integrated it into existing tools so that developers wouldn't have to rewrite everything from scratch.

The implementation supports interoperability with other types of parallelism, such as distributing model weights across multiple GPUs. This allows for combining approaches and flexibly scaling the training depending on the available hardware.

In practice, this means that it's now possible to train models on contexts of up to a million tokens on clusters of multiple GPUs – without needing to invent your own infrastructure from scratch. This is what makes the publication practically significant: it's not just “we came up with a method,” but “here is a working tool you can pick up and use.”

How Fast Does It Really Work?

The authors provide test results on long sequences. As the number of GPUs increases, the scaling efficiency remains high – meaning that adding more processors genuinely accelerates training proportionally, rather than just slightly improving the situation.

This is important because in distributed systems, a “bottleneck” often arises: communication between GPUs starts to slow down the entire process. Ulysses Sequence Parallelism is designed to avoid this by minimizing the amount of data transferred in the most computationally “expensive” part.

Moreover, the approach pairs well with other optimizations, particularly with the so-called Flash Attention, which speeds up the attention calculation itself. Together, they provide a significant performance boost when working with long contexts.

Who Needs This and Why?

Long contexts are needed for more than just allowing a model to “read” a large document. They open up a whole class of tasks that were previously inaccessible or difficult to solve:

  • analyzing large codebases in their entirety, not in parts;
  • working with long legal, scientific, or medical documents;
  • tasks where a dialogue history spanning several hours is important;
  • complex, multi-step reasoning that requires a large “workspace.”

Until recently, training models with such a context required either enormous resources or serious engineering work. Ulysses Sequence Parallelism lowers this barrier – not to zero, of course, but significantly.

This is particularly relevant for research teams and companies that are fine-tuning existing models for specific tasks. They are the ones who most often face memory limitations when working with long texts.

Open Questions

The approach looks convincing, but it has its limitations. It is most effective when the number of GPUs corresponds to the structure of the sequence partitioning – if this ratio is off, efficiency decreases.

Furthermore, the implementation requires specific configuration tailored to the model's architecture and the cluster setup. It's not a “press a button and it works” solution, but a tool that requires an understanding of how your training system is structured.

Finally, there's the question of how this approach will scale to even larger contexts – say, tens of millions of tokens. The authors don't claim to have solved the problem once and for all; this is more of an important and well-executed step in a direction that continues to evolve rapidly.

Overall, Ulysses Sequence Parallelism is an example of how “under-the-hood” engineering work pushes the capabilities of AI forward. Not through a new architecture or a breakthrough algorithm, but because someone effectively solved a specific infrastructural problem – and made the solution available to others.

Original Title: Ulysses Sequence Parallelism: Training with Million-Token Contexts
Publication Date: Mar 9, 2026
Hugging Face huggingface.co A U.S.-based open platform and company for hosting, training, and sharing AI models.
Previous Article When «Same» and «Same Result» Are Not the Same: Numerical Divergence in MoE Models Next Article Tencent Tames the Virtual World: What is WorldCompass and Why Does It Matter?

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

AI: Events

How to Scale vLLM and Avoid Out-of-Memory Errors

Technical context Infrastructure

The AI21 Labs team shared their experience optimizing vLLM – a popular tool for deploying language models that often faces critical errors due to RAM shortages when scaling.

AI21 Labswww.ai21.com Feb 6, 2026

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to dive deeper into the world
of neuro-creativity?

Be the first to learn about new books, articles, and AI experiments
on our Telegram channel!

Subscribe