MiniMax has released the M2-her voice model – a system that can listen, understand, and respond with voice in near real-time. Moreover, it does this in 39 languages, including Russian. But the most interesting part is how it is built on the inside.
What Is M2-her and Why Is It Unique
What M2-her Is and How It Stands Out
M2-her is not just a language model with a speech synthesizer bolted onto it. It is a system that works with voice directly: it receives audio, processes it, and generates a response in audio form as well. There is no intermediate step involving text.
Previously, voice assistants worked according to a scheme: first recognize speech into text, then process text with a language model, and then synthesize the answer back into voice. Here, everything happens inside a single model, and this provides several advantages: lower latency, more control over intonation, and the ability to account for non-verbal cues – pauses, tone, emotions.
M2-her is built on the foundation of the MiniMax-01 large language model, which already knows how to work with text in different languages. Now, a voice layer has been added to it.
How M2-her Processes and Synthesizes Speech
How the Model Understands and Creates Speech 🎤
All the magic happens thanks to two components: an audio encoder and decoder.
The Encoder accepts audio and turns it into a set of tokens – discrete units that the language model can work with. An architecture called Grouped Residual FSQ (Finite Scalar Quantization) is used for this. Simply put, sound is compressed into a compact representation that preserves important information: what was said, who is speaking, and with what intonation.
The encoder was trained on 200,000 hours of audio in 39 languages. This helped it learn to distinguish not only words but also accents, speech mannerisms, and background noise.
The Decoder does the reverse: it takes tokens from the language model and turns them back into sound. Here, SpeechFlow is used – a diffusion-based architecture that generates audio stage-by-stage, refining details at each step. This allows for more natural speech with correct pauses and intonations.
-"
Training the Model: Three Stages
M2-her was trained in three stages, and each solved a specific task.
The first stage involved teaching the model to understand the connection between text and voice. Massive datasets with «text – audio» pairs were used, and the model was trained to predict audio tokens based on text. Here, the model learned not only to pronounce words but also to choose the correct pace, tone, and timbre.
The second stage was teaching the model to conduct a dialogue. Synthetic data was used: conversations were generated based on text datasets, and then turned into audio using speech synthesizers. The model learned to understand the context of the conversation, remember previous replies, and answer to the point.
The third stage was fine-tuning using human feedback. An approach similar to RLHF (Reinforcement Learning from Human Feedback) was used here. People were asked to rate the model's responses based on various criteria: how useful, natural, and context-appropriate they were. Based on these ratings, the model corrected its behavior.
Key Capabilities of the M2-her Model
What the Model Can Do
M2-her shows impressive results in several areas:
- Speech recognition. The model can transcribe audio in 39 languages. On the LibriSpeech benchmark (English language), it showed a result of 1.74% WER – this is the level of professional recognition systems.
- Speaker identification. The model can distinguish between different people's voices and understand exactly who is speaking at a given moment. Accuracy on the VoxCeleb1 benchmark is 0.22% EER, which is close to the best specialized models.
- Speech generation. The model can speak in different languages, mimic a specific person's manner of speech, and change intonation depending on the context. On the SEED-TTS benchmark, the speech quality score is 4.48 out of 5, and the similarity to the original voice is 4.32 out of 5.
- Conducting a dialogue. The model understands the conversation context, can answer complex questions, and clarify details. In tests on dialogue capabilities, it scored 7.95 out of 10 points – which is higher than many competitors.
Applications and Potential Uses of M2-her
Where This Can Be Useful
Voice models of this level open up several interesting possibilities.
Next-generation voice assistants. Instead of mechanical answers – a natural conversation with pauses, intonations, and context understanding. It will be possible to not just issue commands, but to talk as if with a human.
Multilingual support. The model can communicate in 39 languages, and this isn't just text translation. It understands cultural nuances, accents, and the manner of speech of each language.
Voiceover and dubbing. It is possible to clone an actor's voice and use it for dubbing in other languages. Moreover, the model will preserve not only the timbre but also the manner of speech and emotions.
Education and accessibility. Voice interfaces can help people with disabilities, as well as those learning a new language – the model can maintain a conversation, correct mistakes, and adjust to the interlocutor's level.
Future Development and Plans
What's Next
MiniMax plans to continue developing the model. Immediate plans include improving speech generation quality, expanding the set of languages, and reducing response latency.
The company is also working on making the model more controllable. It is already possible to set parameters like speech speed, timbre, and emotional tone. In the future, it may become possible to fine-tune the model's behavior for specific tasks even more precisely.
Another important point is safety and usage ethics. The model can copy voices, and this creates risks of misuse. MiniMax says it is working on protection mechanisms: synthetic speech detection, speaker authentication, and monitoring the use of cloned voices.
For now, M2-her is more of a research project demonstrating the technology's capabilities. But if you look at the pace of voice model development over the last year, one can assume that mass market products based on them will appear quite soon.