Published on February 11, 2026

Unsloth Speeds Up MoE Model Training 12x and Boosts Context Window

Unsloth's new kernels and mathematical optimizations slash memory requirements by 35%, boost training speeds by 12x, and enable context windows six times longer than the original.

Development / Technical context 5 – 7 minutes min read

Event Source: Unsloth 5 – 7 minutes min read

Unsloth MoE Training Speed and Memory Optimizations

Not all expert models are created equal... but now they're all equally fast

Mixture of Experts (MoE) is an architecture that has gained popularity for its ability to scale a model's «intelligence» without a radical surge in computational costs. Simply put: instead of running every request through the entire massive neural network, the system selects a few «experts» from a large pool and activates only them. It sounds elegant, but there's a catch: until recently, training such models was painstakingly slow and required a massive amount of memory.

The Unsloth team has released a suite of optimizations that speed up MoE model training by approximately 12x compared to traditional approaches, reduce VRAM requirements by over a third, and support context windows six times longer. All this, without any loss in accuracy.

Challenges of Training Mixture of Experts Models

Why MoEs used to be a headache to train

Imagine a model with a hundred experts – separate neural networks, each with its own weights. When a token arrives, the system decides which of these experts to activate (usually 6–8 of them). The rest remain on standby.

The problem was that expert weights were stored as a list of separate layers. To process data, you had to loop through this list: first one expert, then the second, then the third. It was incredibly slow.

Recently, PyTorch introduced the grouped_mm function – a way to perform multiple matrix multiplications at once instead of sequential passes. The Transformers library version 5 began using this function, yielding a sixfold speedup. Unsloth went further by writing custom Triton kernels that double the speed again and cut memory consumption by over a third.

Optimizing MoE Training with Split LoRA

How it works: Split LoRA instead of materialization

Usually, when training an MoE using LoRA (Low-Rank Adaptation – a fine-tuning method that adds small adapters to a model instead of updating all weights), the following happens: the system first merges the adapter with the base weights and then runs the data through this merged version. The problem is that such merging requires storing an intermediate matrix for every single expert. If there are 128 experts (as in Qwen3-30B), you run out of memory instantly.

Unsloth uses a different order of operations. Instead of assembling the full weight first and then applying it to the data, the system applies the LoRA adapter to the data first and then multiplies the result by the second part of the adapter. Mathematically, the result remains the same (matrix multiplication is associative), but this order of operations eliminates the need to store bulky intermediate matrices.

This saves a massive amount of memory and time. For Qwen3-30B-A3B, this approach allows for context windows up to 32,000 tokens. For gpt-oss, it's even higher.

MoE Training Performance Benchmarks on NVIDIA GPUs

Benchmarks: From data center hardware to consumer GPUs

On the NVIDIA B200, training gpt-oss with an 8,192-token context is now 7x faster compared to Transformers v5, with memory consumption down by 36%. At a 16K context, Transformers v5 throws an «out of memory» error, while Unsloth keeps chugging along.

Qwen3-30B-A3B on the B200 shows a roughly 1.7x speedup with a 35% memory saving. On the H100, speed increased by 1.77x; notably, at an 8K context, Unsloth uses less memory than the base version does at 4K.

GLM 4.7 Flash is a model with 64 experts and one shared expert (DeepSeek-style configuration). On an RTX PRO 6000, Unsloth delivers a 2.1x speed boost and 15% memory savings.

Crucially, these optimizations aren't limited to server-grade GPUs like the H100 or B200; they also work on consumer-grade cards like the RTX 3090 or older A100s. Support starts from the NVIDIA T4 architecture.

Automatic backend selection

Unsloth automatically selects the acceleration method based on your hardware. If you're using an H100 or newer, it enables grouped_mm, the optimized PyTorch function. If you have an A100 or an older version of PyTorch, Unsloth's custom Triton kernels kick in, which run 2.5x faster than grouped_mm on the A100. If the hardware supports neither, it falls back to base PyTorch, though all memory optimizations are still preserved.

Modes can be toggled manually via the UNSLOTH_MOE_BACKEND environment variable, but the system defaults to the optimal choice.

Supported MoE Models and Architectures

Which models are supported

The update affects the Qwen3 family (including Thinking, Instruct, VL, and Coder versions), gpt-oss (20B, 120B, safeguard), GLM (4.5, 4.6, 4.6-Air, 4.7, 4.7-Flash), and DeepSeek (V3, R1, V3.1, V3.2). Unsloth should also support other MoE models, even if they aren't explicitly listed in the documentation.

Bonus: Additional updates

Alongside the MoE speedup, the team introduced several other improvements:

Gemma-3 now defaults to Flex-Attention, which reduces memory complexity from O(N2) to O(N) and speeds up training more than threefold. At an 8K context, it saves 24.8 GB, and at 16K, the model no longer crashes due to memory limits.
Fine-tuning visual models now supports mixed data: you can feed text and images interchangeably.
Windows is now officially supported without needing WSL.
Compatibility with trl==0.27.1 and transformers==5.1.0 has reached 80% (across all 120 Unsloth notebooks, up from 30%). Full support is expected in the coming days.

Benefits of Accessible MoE Model Fine-Tuning

Why this matters

MoE models allow for the creation of massive neural networks without a proportional increase in compute costs. However, until recently, training them remained expensive and slow. Unsloth's new optimizations make MoE more accessible: now, fine-tuning 30B-parameter models is possible on a single consumer GPU rather than just on clusters. This significantly lowers the barrier for researchers wanting to experiment with such architectures.

In short: «Faster training, lower memory usage, and long context support – all without sacrificing model performance».

#applied analysis #technical context #neural networks #ai development #engineering #model architecture #scaling #gpu optimization #model training optimization

Link to Original: https://unsloth.ai/docs/new/faster-moe

Original Title: Faster MoE Training

Publication Date: Feb 11, 2026

Unsloth unsloth.ai A U.S.-based project optimizing training and fine-tuning of language models.

Previous Article Indian Company Sarvam Unveils Arya Voice Assistant with 10-Language Support Next Article Semantic Router: How to Teach a System to Understand User Intent

Unsloth Speeds Up MoE Model Training 12x and Boosts Context Window

Unsloth MoE Training Speed and Memory Optimizations

Challenges of Training Mixture of Experts Models

Optimizing MoE Training with Split LoRA

MoE Training Performance Benchmarks on NVIDIA GPUs

Automatic backend selection

Supported MoE Models and Architectures

Bonus: Additional updates

Benefits of Accessible MoE Model Fine-Tuning

Related Publications

How Mistral AI Found a Memory Leak in vLLM – And Why It Wasn't Where They Were Looking

How to Scale vLLM and Avoid Out-of-Memory Errors

AMD Introduces GPU Partitioning for Concurrent LLM Execution

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration