Unsloth MoE Training Speed and Memory Optimizations
Not all expert models are created equal... but now they're all equally fast
Mixture of Experts (MoE) is an architecture that has gained popularity for its ability to scale a model's «intelligence» without a radical surge in computational costs. Simply put: instead of running every request through the entire massive neural network, the system selects a few «experts» from a large pool and activates only them. It sounds elegant, but there's a catch: until recently, training such models was painstakingly slow and required a massive amount of memory.
The Unsloth team has released a suite of optimizations that speed up MoE model training by approximately 12x compared to traditional approaches, reduce VRAM requirements by over a third, and support context windows six times longer. All this, without any loss in accuracy.
Challenges of Training Mixture of Experts Models
Why MoEs used to be a headache to train
Imagine a model with a hundred experts – separate neural networks, each with its own weights. When a token arrives, the system decides which of these experts to activate (usually 6–8 of them). The rest remain on standby.
The problem was that expert weights were stored as a list of separate layers. To process data, you had to loop through this list: first one expert, then the second, then the third. It was incredibly slow.
Recently, PyTorch introduced the grouped_mm function – a way to perform multiple matrix multiplications at once instead of sequential passes. The Transformers library version 5 began using this function, yielding a sixfold speedup. Unsloth went further by writing custom Triton kernels that double the speed again and cut memory consumption by over a third.
Optimizing MoE Training with Split LoRA
How it works: Split LoRA instead of materialization
Usually, when training an MoE using LoRA (Low-Rank Adaptation – a fine-tuning method that adds small adapters to a model instead of updating all weights), the following happens: the system first merges the adapter with the base weights and then runs the data through this merged version. The problem is that such merging requires storing an intermediate matrix for every single expert. If there are 128 experts (as in Qwen3-30B), you run out of memory instantly.
Unsloth uses a different order of operations. Instead of assembling the full weight first and then applying it to the data, the system applies the LoRA adapter to the data first and then multiplies the result by the second part of the adapter. Mathematically, the result remains the same (matrix multiplication is associative), but this order of operations eliminates the need to store bulky intermediate matrices.
This saves a massive amount of memory and time. For Qwen3-30B-A3B, this approach allows for context windows up to 32,000 tokens. For gpt-oss, it's even higher.
MoE Training Performance Benchmarks on NVIDIA GPUs
Benchmarks: From data center hardware to consumer GPUs
On the NVIDIA B200, training gpt-oss with an 8,192-token context is now 7x faster compared to Transformers v5, with memory consumption down by 36%. At a 16K context, Transformers v5 throws an «out of memory» error, while Unsloth keeps chugging along.
Qwen3-30B-A3B on the B200 shows a roughly 1.7x speedup with a 35% memory saving. On the H100, speed increased by 1.77x; notably, at an 8K context, Unsloth uses less memory than the base version does at 4K.
GLM 4.7 Flash is a model with 64 experts and one shared expert (DeepSeek-style configuration). On an RTX PRO 6000, Unsloth delivers a 2.1x speed boost and 15% memory savings.
Crucially, these optimizations aren't limited to server-grade GPUs like the H100 or B200; they also work on consumer-grade cards like the RTX 3090 or older A100s. Support starts from the NVIDIA T4 architecture.
Automatic backend selection
Unsloth automatically selects the acceleration method based on your hardware. If you're using an H100 or newer, it enables grouped_mm, the optimized PyTorch function. If you have an A100 or an older version of PyTorch, Unsloth's custom Triton kernels kick in, which run 2.5x faster than grouped_mm on the A100. If the hardware supports neither, it falls back to base PyTorch, though all memory optimizations are still preserved.
Modes can be toggled manually via the UNSLOTH_MOE_BACKEND environment variable, but the system defaults to the optimal choice.
Supported MoE Models and Architectures
Which models are supported
The update affects the Qwen3 family (including Thinking, Instruct, VL, and Coder versions), gpt-oss (20B, 120B, safeguard), GLM (4.5, 4.6, 4.6-Air, 4.7, 4.7-Flash), and DeepSeek (V3, R1, V3.1, V3.2). Unsloth should also support other MoE models, even if they aren't explicitly listed in the documentation.
Bonus: Additional updates
Alongside the MoE speedup, the team introduced several other improvements:
- Gemma-3 now defaults to Flex-Attention, which reduces memory complexity from O(N2) to O(N) and speeds up training more than threefold. At an 8K context, it saves 24.8 GB, and at 16K, the model no longer crashes due to memory limits.
- Fine-tuning visual models now supports mixed data: you can feed text and images interchangeably.
- Windows is now officially supported without needing WSL.
- Compatibility with trl==0.27.1 and transformers==5.1.0 has reached 80% (across all 120 Unsloth notebooks, up from 30%). Full support is expected in the coming days.
Benefits of Accessible MoE Model Fine-Tuning
Why this matters
MoE models allow for the creation of massive neural networks without a proportional increase in compute costs. However, until recently, training them remained expensive and slow. Unsloth's new optimizations make MoE more accessible: now, fine-tuning 30B-parameter models is possible on a single consumer GPU rather than just on clusters. This significantly lowers the barrier for researchers wanting to experiment with such architectures.
In short: «Faster training, lower memory usage, and long context support – all without sacrificing model performance».