Published on March 10, 2026

How AI Helps Find Failures in Large Model Training

AMD has shared how to automate failure diagnostics in large-scale AI model training using an LLM-based agent system.

Infrastructure / Technical context 4 – 6 minutes min read
Event Source: AMD 4 – 6 minutes min read

When a company or research group trains a large language model, it uses hundreds or even thousands of graphics processing units (GPUs) working in tandem. This is a complex system, and like any complex system, it occasionally fails. Something goes wrong with the hardware, network, memory, or software. And that's when the real headache begins: figuring out what exactly happened.

Usually, an engineer handles this. They analyze logs, collect metrics, and form hypotheses. This is time-consuming and requires deep expertise. The AMD team proposed a different approach: handing over the diagnostics to an agent system based on a language model.

Why Diagnostics Is a Separate Problem

Training a large model isn't just about “running a script and waiting.” It's a continuous process that can last for weeks. During this time, the system generates vast amounts of data: GPU load, temperature, network latency, training speed, and errors on nodes. If something goes wrong, it needs to be figured out quickly, or hours or even days of computation could be wasted.

The problem is that there's a lot of data; it's fragmented and requires contextual understanding. An engineer encountering such an incident for the first time might spend several hours just figuring out where to start. Moreover, the culprit could be anything: a slow network node, a faulty GPU, a software error in the configuration, or simply a temporary load spike.

A Single Point of Data Collection

At the core of the system is a unified time-series repository – a database where all metrics from the GPUs, host machines, network, and the training process itself are collected. Everything is written to disk, so data isn't lost even during failures. Simply put, if something breaks, the system has a complete picture of what was happening at the moment of failure and before it.

This approach is useful in itself: instead of manually collecting data from various sources, an engineer gets a single snapshot of everything happening at the right moment in time.

The Agent as an Analyst

But the key idea here is not just to store data, but to analyze it automatically. For this, an LLM agent is connected to the repository: a language model that can not only answer questions but also purposefully investigate a problem.

How does this work in practice? The agent receives a description of the incident – for example, “training slowed down by 40% during a specific interval” – and starts querying the data. It formulates its own queries to the metrics database, looks at the results, refines its hypotheses, and moves on. This isn't a linear search through a checklist, but an iterative process: the agent narrows down the list of suspects step by step.

Ultimately, the system provides an explanation: what most likely happened, on which node or component, and what data confirms it.

Not a Replacement for an Engineer, but Significant Help

It's important to understand that the agent doesn't make decisions or apply fixes on its own. It diagnoses and presents its findings to a human. But even this changes the situation dramatically: instead of spending several hours on data collection and initial analysis, an engineer receives a structured hypothesis and can immediately move on to verification and fixing the problem.

This is especially relevant for teams working with large clusters. The more nodes there are, the higher the probability of failure at any given moment – and the more costly every delay in diagnostics becomes.

Scale as a Separate Challenge

Scaling up training is not just about “adding more GPUs.” As the cluster size grows, so does the complexity: more points of failure, more interdependencies, and more data to analyze. A human physically cannot monitor everything manually.

This is precisely where the agent-based approach shows its value. The system doesn't get tired, doesn't miss metrics, and doesn't lose its train of thought when switching between data sources. It follows the same logic for a failure on 10 nodes as it does for a failure on a thousand.

What's Behind This

AMD demonstrates this approach using training with MaxText and the Slurm task scheduler – standard tools in research and industrial clusters. But what's more important than the specific tech stack is the idea itself: using a language model not for text generation, but for solving operational tasks like diagnostics, analysis, and finding the root causes of failures.

This is part of a broader trend where LLMs are beginning to function not as chatbots, but as automation tools within an infrastructure. Such systems are called “agentic” precisely because they don't just answer a question, but independently go through several steps to achieve a goal.

Open Questions

Like any approach, this one has unresolved issues. How reliable is the agent with non-standard failures that weren't in its “experience”? How does it handle situations where data is insufficient or contradictory? To what extent can its conclusions be considered reliable without additional verification?

These questions remain open for now. Agent-based diagnostics is not a silver bullet, but a tool that must be used consciously and with an understanding of its limitations. But as a first step toward automating the routine parts of an engineer's job, it's a step in a sensible direction.

Original Title: Agentic Diagnosis for LLM Training at Scale – ROCm Blogs
Publication Date: Mar 9, 2026
AMD www.amd.com An international company manufacturing processors and computing accelerators for AI workloads.
Previous Article Hume AI Open Sources TADA – A Model for Synchronizing Text and Audio Next Article LeRobot v0.5.0: Bringing Robotics Closer to Everyone

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to know about new
experiments first?

Subscribe to our Telegram channel — we share all the latest
and exciting updates from NeuraBooks.

Subscribe