Published January 22, 2026

How Agentic Models Are Trained After Base Training

MiniMax has discussed its approach to fine-tuning language models that do more than just answer questions – they execute complex tasks by interacting with tools.

Technical context Development
Event Source: MiniMax Reading Time: 6 – 8 minutes

When we talk about AI agents–models that don't just answer questions but execute complex tasks through chains of actions–base intelligence isn't the only thing that matters. We also need to teach the model how to use tools correctly, plan steps, and maintain focus throughout a long dialogue.

The MiniMax team published a detailed breakdown of their approach to the post-training of agentic models. In short: after the model has undergone base training on textual data, it is further tuned to operate as an agent–with function calls, external API usage, and multi-step planning.

What Post-Training Is and Why It's Needed

What Post-Training Is and Why It's Needed 🚠

A base model can generate text, answer questions, and reason. But to become a helpful agent, it needs to be taught to:

  • understand when to call an external tool (e.g., search, calculator, or API);
  • correctly formulate requests for these tools;
  • interpret results and integrate them into subsequent logic;
  • plan a sequence of actions to solve a complex task.

This is exactly what post-training deals with. MiniMax uses several stages: supervised fine-tuning (SFT), reinforcement learning (RL), and combinations thereof. The goal is to make the model not just smart, but practically applicable in real-world scenarios.

How Data Is Collected for Training an Agent

One of the key challenges is getting high-quality examples of how an agent should work. MiniMax uses several sources:

Synthetic data. The model generates tool usage examples itself, which are then filtered and verified. This allows for quick scaling of the dataset but requires strict quality control.

User data. Real dialogues and requests help identify which tasks occur most frequently in practice. Anonymization and filtration are crucial here–not all user requests are suitable for training.

Expert labeling. For complex scenarios, humans are brought in to manually label the correct sequences of actions. This is expensive but yields high-quality examples.

MiniMax notes that the balance between these sources is critically important. Too much synthetic data, and the model might overfit on artificial patterns. Too much unfiltered real data, and noise and errors appear.

Supervised Fine-Tuning: Learning by Example

At the first stage, the model learns from labeled examples. It is shown: here is the task, here is the correct sequence of steps, and here is how the answer should look.

Here, it is important not to just «feed» the model more data. One must ensure task diversity, making sure examples cover different types of tools and scenarios. MiniMax uses curriculum learning–starting with simple tasks and gradually increasing their complexity.

Another point: formatting. Agentic models work with structured function calls–JSON objects, special tokens. A format error can break the whole chain of actions, so at the SFT stage, the model is trained to strictly follow the required syntax.

Reinforcement Learning: Teaching Through Rewards

Reinforcement Learning: Teaching Through Rewards 🎯

After SFT, the model already knows how to call functions and follow examples. But it's not yet optimal–it might choose inefficient paths, take unnecessary steps, or sometimes make planning mistakes.

For fine-tuning, reinforcement learning is used. The model receives a task, tries to solve it, and then gets a reward depending on the result. If the task is solved correctly and efficiently, the reward is high. If there is an error or too many unnecessary actions, it is low.

MiniMax experimented with different reward functions. It turned out that it's important to consider not just the final result, but intermediate steps too. For example, if the model called the right tool but formulated the request inaccurately, this also needs to be factored into the reward.

Another problem with RL is instability. The model can suddenly «unlearn» what it knew before if a specific metric is optimized too aggressively. Therefore, techniques like reward shaping and KL-penalty are used so the model doesn't drift too far from its initial behavior.

What About Long Action Chains

One of the main challenges for agents is working with multi-step tasks. Imagine: you need to find information online, process it, call an API, analyze the result, and give an answer. This can take dozens of steps.

The longer the chain, the higher the probability of error. MiniMax discovered that models often «lose the thread» on long tasks–forgetting intermediate results or starting to repeat the same actions.

To solve this, they added special techniques:

  • Intermediate checkpoints. The model periodically «summarizes» the current state of the task–what has already been done, what is left.
  • Explicit planning. Before starting execution, the model first generates a plan of action and then follows it. This helps not to get lost in the process.
  • Error recovery. If the model realizes it made a mistake, it can roll back and try a different path.

These mechanisms aren't always needed for simple tasks but are critically important for complex scenarios.

Quality Assessment: How to Know the Agent Is Working Well

With ordinary language models, everything is relatively clear–there are benchmarks, metrics like perplexity, and human evaluation. With agents, it's harder.

We need to evaluate not just the quality of the final answer, but also:

  • correctness of tool selection;
  • efficiency of the action sequence;
  • correctness of API request formation;
  • ability to handle unexpected situations (e.g., when an API returns an error).

MiniMax uses a combination of automated metrics and human evaluation. Automation checks formal correctness–proper call format, absence of syntax errors. Humans evaluate the meaningfulness of actions and the quality of the task solution.

Another important point is stress testing. The agent is checked against edge cases: incomplete information, contradictory data, tool unavailability. How will the model behave if a search API suddenly returns an empty result? Will it break or try a different approach?

What's Next

MiniMax sees several directions for the development of agentic models:

Multimodality. Currently, most agents work with text. But tasks often require processing images, video, and audio. We need models that can naturally work with different data types.

Personalization. An agent must account for user context, preferences, and interaction history. This requires new approaches to training and long-term memory storage.

Safety. Agents that can call external APIs and perform actions in the real world carry risks. Control mechanisms are needed to ensure the model doesn't do something undesirable.

Post-training of agentic models is not just fine-tuning on additional data. It is a distinct engineering task with its own challenges: from collecting high-quality examples to stabilizing reinforcement learning. But the result is worth it–models that don't just talk, but actually help solve tasks.

#technical context #methodology #neural networks #ai training #engineering #model architecture #data #model_scaling #generative agents
Original Title: Post-Training Experience and Insights for Agent Models
Publication Date: Jan 21, 2026
MiniMax www.minimax.io A Chinese AI company developing large language and multimodal models for dialogue and content generation.
Previous Article How Salesforce's 20,000 Developers Switched to Cursor and What Happened Next Next Article How “Snoozing” Data Helps Save on AI Training Costs

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.5 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.5 Anthropic
2.
Gemini 3 Pro Preview Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 3 Pro Preview Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

Don’t miss a single experiment!

Subscribe to our Telegram channel —
we regularly post announcements of new books, articles, and interviews.

Subscribe