Published January 28, 2026

MiniMax M2-her: Voice Model for 39 Languages Explained

MiniMax M2-her: How the Voice Model That Speaks 39 Languages Works

We delve into the inner workings of the new MiniMax voice model, which can simultaneously understand speech, recognize the speaker, and generate a response.

Technical context Development
Event Source: MiniMax Reading Time: 5 – 7 minutes

MiniMax has released the M2-her voice model – a system that can listen, understand, and respond with voice in near real-time. Moreover, it does this in 39 languages, including Russian. But the most interesting part is how it is built on the inside.

What Is M2-her and Why Is It Unique

What M2-her Is and How It Stands Out

M2-her is not just a language model with a speech synthesizer bolted onto it. It is a system that works with voice directly: it receives audio, processes it, and generates a response in audio form as well. There is no intermediate step involving text.

Previously, voice assistants worked according to a scheme: first recognize speech into text, then process text with a language model, and then synthesize the answer back into voice. Here, everything happens inside a single model, and this provides several advantages: lower latency, more control over intonation, and the ability to account for non-verbal cues – pauses, tone, emotions.

M2-her is built on the foundation of the MiniMax-01 large language model, which already knows how to work with text in different languages. Now, a voice layer has been added to it.

How M2-her Processes and Synthesizes Speech

How the Model Understands and Creates Speech 🎤

All the magic happens thanks to two components: an audio encoder and decoder.

The Encoder accepts audio and turns it into a set of tokens – discrete units that the language model can work with. An architecture called Grouped Residual FSQ (Finite Scalar Quantization) is used for this. Simply put, sound is compressed into a compact representation that preserves important information: what was said, who is speaking, and with what intonation.

The encoder was trained on 200,000 hours of audio in 39 languages. This helped it learn to distinguish not only words but also accents, speech mannerisms, and background noise.

The Decoder does the reverse: it takes tokens from the language model and turns them back into sound. Here, SpeechFlow is used – a diffusion-based architecture that generates audio stage-by-stage, refining details at each step. This allows for more natural speech with correct pauses and intonations.

-"

Training the Model: Three Stages

M2-her was trained in three stages, and each solved a specific task.

The first stage involved teaching the model to understand the connection between text and voice. Massive datasets with «text – audio» pairs were used, and the model was trained to predict audio tokens based on text. Here, the model learned not only to pronounce words but also to choose the correct pace, tone, and timbre.

The second stage was teaching the model to conduct a dialogue. Synthetic data was used: conversations were generated based on text datasets, and then turned into audio using speech synthesizers. The model learned to understand the context of the conversation, remember previous replies, and answer to the point.

The third stage was fine-tuning using human feedback. An approach similar to RLHF (Reinforcement Learning from Human Feedback) was used here. People were asked to rate the model's responses based on various criteria: how useful, natural, and context-appropriate they were. Based on these ratings, the model corrected its behavior.

Key Capabilities of the M2-her Model

What the Model Can Do

M2-her shows impressive results in several areas:

  • Speech recognition. The model can transcribe audio in 39 languages. On the LibriSpeech benchmark (English language), it showed a result of 1.74% WER – this is the level of professional recognition systems.
  • Speaker identification. The model can distinguish between different people's voices and understand exactly who is speaking at a given moment. Accuracy on the VoxCeleb1 benchmark is 0.22% EER, which is close to the best specialized models.
  • Speech generation. The model can speak in different languages, mimic a specific person's manner of speech, and change intonation depending on the context. On the SEED-TTS benchmark, the speech quality score is 4.48 out of 5, and the similarity to the original voice is 4.32 out of 5.
  • Conducting a dialogue. The model understands the conversation context, can answer complex questions, and clarify details. In tests on dialogue capabilities, it scored 7.95 out of 10 points – which is higher than many competitors.

Applications and Potential Uses of M2-her

Where This Can Be Useful

Voice models of this level open up several interesting possibilities.

Next-generation voice assistants. Instead of mechanical answers – a natural conversation with pauses, intonations, and context understanding. It will be possible to not just issue commands, but to talk as if with a human.

Multilingual support. The model can communicate in 39 languages, and this isn't just text translation. It understands cultural nuances, accents, and the manner of speech of each language.

Voiceover and dubbing. It is possible to clone an actor's voice and use it for dubbing in other languages. Moreover, the model will preserve not only the timbre but also the manner of speech and emotions.

Education and accessibility. Voice interfaces can help people with disabilities, as well as those learning a new language – the model can maintain a conversation, correct mistakes, and adjust to the interlocutor's level.

Future Development and Plans

What's Next

MiniMax plans to continue developing the model. Immediate plans include improving speech generation quality, expanding the set of languages, and reducing response latency.

The company is also working on making the model more controllable. It is already possible to set parameters like speech speed, timbre, and emotional tone. In the future, it may become possible to fine-tune the model's behavior for specific tasks even more precisely.

Another important point is safety and usage ethics. The model can copy voices, and this creates risks of misuse. MiniMax says it is working on protection mechanisms: synthetic speech detection, speaker authentication, and monitoring the use of cloned voices.

For now, M2-her is more of a research project demonstrating the technology's capabilities. But if you look at the pace of voice model development over the last year, one can assume that mass market products based on them will appear quite soon.

#technical context #educational content #neural networks #ai linguistics #engineering #interfaces #generative models #multimodal models
Original Title: A Deep Dive into the MiniMax-M2-her
Publication Date: Jan 26, 2026
MiniMax www.minimax.io A Chinese AI company developing large language and multimodal models for dialogue and content generation.
Previous Article How to Index Huge Repositories in Seconds Instead of Hours Next Article How to Run an AI Coding Agent on AMD Instinct GPUs

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.5 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.5 Anthropic
2.
Gemini 3 Pro Preview Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 3 Pro Preview Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

Want to dive deeper into the world
of neuro-creativity?

Be the first to learn about new books, articles, and AI experiments
on our Telegram channel!

Subscribe