Published February 11, 2026

Unsloth Speeds Up MoE Model Training 12x and Boosts Context Window

Unsloth's new kernels and mathematical optimizations slash memory requirements by 35%, boost training speeds by 12x, and enable context windows six times longer than the original.

Technical context Development
Event Source: Unsloth Reading Time: 5 – 7 minutes

Unsloth MoE Training Speed and Memory Optimizations

Not all expert models are created equal... but now they're all equally fast

Mixture of Experts (MoE) is an architecture that has gained popularity for its ability to scale a model's «intelligence» without a radical surge in computational costs. Simply put: instead of running every request through the entire massive neural network, the system selects a few «experts» from a large pool and activates only them. It sounds elegant, but there's a catch: until recently, training such models was painstakingly slow and required a massive amount of memory.

The Unsloth team has released a suite of optimizations that speed up MoE model training by approximately 12x compared to traditional approaches, reduce VRAM requirements by over a third, and support context windows six times longer. All this, without any loss in accuracy.

Challenges of Training Mixture of Experts Models

Why MoEs used to be a headache to train

Imagine a model with a hundred experts – separate neural networks, each with its own weights. When a token arrives, the system decides which of these experts to activate (usually 6–8 of them). The rest remain on standby.

The problem was that expert weights were stored as a list of separate layers. To process data, you had to loop through this list: first one expert, then the second, then the third. It was incredibly slow.

Recently, PyTorch introduced the grouped_mm function – a way to perform multiple matrix multiplications at once instead of sequential passes. The Transformers library version 5 began using this function, yielding a sixfold speedup. Unsloth went further by writing custom Triton kernels that double the speed again and cut memory consumption by over a third.

Optimizing MoE Training with Split LoRA

How it works: Split LoRA instead of materialization

Usually, when training an MoE using LoRA (Low-Rank Adaptation – a fine-tuning method that adds small adapters to a model instead of updating all weights), the following happens: the system first merges the adapter with the base weights and then runs the data through this merged version. The problem is that such merging requires storing an intermediate matrix for every single expert. If there are 128 experts (as in Qwen3-30B), you run out of memory instantly.

Unsloth uses a different order of operations. Instead of assembling the full weight first and then applying it to the data, the system applies the LoRA adapter to the data first and then multiplies the result by the second part of the adapter. Mathematically, the result remains the same (matrix multiplication is associative), but this order of operations eliminates the need to store bulky intermediate matrices.

This saves a massive amount of memory and time. For Qwen3-30B-A3B, this approach allows for context windows up to 32,000 tokens. For gpt-oss, it's even higher.

MoE Training Performance Benchmarks on NVIDIA GPUs

Benchmarks: From data center hardware to consumer GPUs

On the NVIDIA B200, training gpt-oss with an 8,192-token context is now 7x faster compared to Transformers v5, with memory consumption down by 36%. At a 16K context, Transformers v5 throws an «out of memory» error, while Unsloth keeps chugging along.

Qwen3-30B-A3B on the B200 shows a roughly 1.7x speedup with a 35% memory saving. On the H100, speed increased by 1.77x; notably, at an 8K context, Unsloth uses less memory than the base version does at 4K.

GLM 4.7 Flash is a model with 64 experts and one shared expert (DeepSeek-style configuration). On an RTX PRO 6000, Unsloth delivers a 2.1x speed boost and 15% memory savings.

Crucially, these optimizations aren't limited to server-grade GPUs like the H100 or B200; they also work on consumer-grade cards like the RTX 3090 or older A100s. Support starts from the NVIDIA T4 architecture.

Automatic backend selection

Unsloth automatically selects the acceleration method based on your hardware. If you're using an H100 or newer, it enables grouped_mm, the optimized PyTorch function. If you have an A100 or an older version of PyTorch, Unsloth's custom Triton kernels kick in, which run 2.5x faster than grouped_mm on the A100. If the hardware supports neither, it falls back to base PyTorch, though all memory optimizations are still preserved.

Modes can be toggled manually via the UNSLOTH_MOE_BACKEND environment variable, but the system defaults to the optimal choice.

Supported MoE Models and Architectures

Which models are supported

The update affects the Qwen3 family (including Thinking, Instruct, VL, and Coder versions), gpt-oss (20B, 120B, safeguard), GLM (4.5, 4.6, 4.6-Air, 4.7, 4.7-Flash), and DeepSeek (V3, R1, V3.1, V3.2). Unsloth should also support other MoE models, even if they aren't explicitly listed in the documentation.

Bonus: Additional updates

Alongside the MoE speedup, the team introduced several other improvements:

  • Gemma-3 now defaults to Flex-Attention, which reduces memory complexity from O(N2) to O(N) and speeds up training more than threefold. At an 8K context, it saves 24.8 GB, and at 16K, the model no longer crashes due to memory limits.
  • Fine-tuning visual models now supports mixed data: you can feed text and images interchangeably.
  • Windows is now officially supported without needing WSL.
  • Compatibility with trl==0.27.1 and transformers==5.1.0 has reached 80% (across all 120 Unsloth notebooks, up from 30%). Full support is expected in the coming days.

Benefits of Accessible MoE Model Fine-Tuning

Why this matters

MoE models allow for the creation of massive neural networks without a proportional increase in compute costs. However, until recently, training them remained expensive and slow. Unsloth's new optimizations make MoE more accessible: now, fine-tuning 30B-parameter models is possible on a single consumer GPU rather than just on clusters. This significantly lowers the barrier for researchers wanting to experiment with such architectures.

In short: «Faster training, lower memory usage, and long context support – all without sacrificing model performance».

Original Title: Faster MoE Training
Publication Date: Feb 11, 2026
Unsloth unsloth.ai A U.S.-based project optimizing training and fine-tuning of language models.
Previous Article Indian Company Sarvam Unveils Arya Voice Assistant with 10-Language Support Next Article Semantic Router: How to Teach a System to Understand User Intent

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.5 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.5 Anthropic
2.
Gemini 3 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 3 Pro Google DeepMind
3.
Gemini 3 Flash Preview Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 3 Flash Preview Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

AI: Events

How to Scale vLLM and Avoid Out-of-Memory Errors

Technical context Infrastructure

The AI21 Labs team shared their experience optimizing vLLM – a popular tool for deploying language models that often faces critical errors due to RAM shortages when scaling.

AI21 Labswww.ai21.com Feb 6, 2026

Don’t miss a single experiment!

Subscribe to our Telegram channel —
we regularly post announcements of new books, articles, and interviews.

Subscribe