Published on March 10, 2026

How AI Helps Find Failures in Large Model Training

AMD has shared how to automate failure diagnostics in large-scale AI model training using an LLM-based agent system.

Infrastructure / Technical context 4 – 6 minutes min read

Event Source: AMD 4 – 6 minutes min read

When a company or research group trains a large language model, it uses hundreds or even thousands of graphics processing units (GPUs) working in tandem. This is a complex system, and like any complex system, it occasionally fails. Something goes wrong with the hardware, network, memory, or software. And that's when the real headache begins: figuring out what exactly happened.

Usually, an engineer handles this. They analyze logs, collect metrics, and form hypotheses. This is time-consuming and requires deep expertise. The AMD team proposed a different approach: handing over the diagnostics to an agent system based on a language model.

Why Diagnostics Is a Separate Problem

Training a large model isn't just about “running a script and waiting.” It's a continuous process that can last for weeks. During this time, the system generates vast amounts of data: GPU load, temperature, network latency, training speed, and errors on nodes. If something goes wrong, it needs to be figured out quickly, or hours or even days of computation could be wasted.

The problem is that there's a lot of data; it's fragmented and requires contextual understanding. An engineer encountering such an incident for the first time might spend several hours just figuring out where to start. Moreover, the culprit could be anything: a slow network node, a faulty GPU, a software error in the configuration, or simply a temporary load spike.

A Single Point of Data Collection

At the core of the system is a unified time-series repository – a database where all metrics from the GPUs, host machines, network, and the training process itself are collected. Everything is written to disk, so data isn't lost even during failures. Simply put, if something breaks, the system has a complete picture of what was happening at the moment of failure and before it.

This approach is useful in itself: instead of manually collecting data from various sources, an engineer gets a single snapshot of everything happening at the right moment in time.

The Agent as an Analyst

But the key idea here is not just to store data, but to analyze it automatically. For this, an LLM agent is connected to the repository: a language model that can not only answer questions but also purposefully investigate a problem.

How does this work in practice? The agent receives a description of the incident – for example, “training slowed down by 40% during a specific interval” – and starts querying the data. It formulates its own queries to the metrics database, looks at the results, refines its hypotheses, and moves on. This isn't a linear search through a checklist, but an iterative process: the agent narrows down the list of suspects step by step.

Ultimately, the system provides an explanation: what most likely happened, on which node or component, and what data confirms it.

Not a Replacement for an Engineer, but Significant Help

It's important to understand that the agent doesn't make decisions or apply fixes on its own. It diagnoses and presents its findings to a human. But even this changes the situation dramatically: instead of spending several hours on data collection and initial analysis, an engineer receives a structured hypothesis and can immediately move on to verification and fixing the problem.

This is especially relevant for teams working with large clusters. The more nodes there are, the higher the probability of failure at any given moment – and the more costly every delay in diagnostics becomes.

Scale as a Separate Challenge

Scaling up training is not just about “adding more GPUs.” As the cluster size grows, so does the complexity: more points of failure, more interdependencies, and more data to analyze. A human physically cannot monitor everything manually.

This is precisely where the agent-based approach shows its value. The system doesn't get tired, doesn't miss metrics, and doesn't lose its train of thought when switching between data sources. It follows the same logic for a failure on 10 nodes as it does for a failure on a thousand.

What's Behind This

AMD demonstrates this approach using training with MaxText and the Slurm task scheduler – standard tools in research and industrial clusters. But what's more important than the specific tech stack is the idea itself: using a language model not for text generation, but for solving operational tasks like diagnostics, analysis, and finding the root causes of failures.

This is part of a broader trend where LLMs are beginning to function not as chatbots, but as automation tools within an infrastructure. Such systems are called “agentic” precisely because they don't just answer a question, but independently go through several steps to achieve a goal.

Open Questions

Like any approach, this one has unresolved issues. How reliable is the agent with non-standard failures that weren't in its “experience”? How does it handle situations where data is insufficient or contradictory? To what extent can its conclusions be considered reliable without additional verification?

These questions remain open for now. Agent-based diagnostics is not a silver bullet, but a tool that must be used consciously and with an understanding of its limitations. But as a first step toward automating the routine parts of an engineer's job, it's a step in a sensible direction.

#applied analysis #technical context #neural networks #ai training #engineering #computer systems #multi-agent systems #large model training optimization

Link to Original: https://rocm.blogs.amd.com/software-tools-optimization/maxtext-slurm-agentic-diagnosis/README.html

Original Title: Agentic Diagnosis for LLM Training at Scale – ROCm Blogs

Publication Date: Mar 9, 2026

AMD www.amd.com An international company manufacturing processors and computing accelerators for AI workloads.

Previous Article Hume AI Open Sources TADA – A Model for Synchronizing Text and Audio Next Article LeRobot v0.5.0: Bringing Robotics Closer to Everyone

How AI Helps Find Failures in Large Model Training

Why Diagnostics Is a Separate Problem

A Single Point of Data Collection

The Agent as an Analyst

Not a Replacement for an Engineer, but Significant Help

Scale as a Separate Challenge

What's Behind This

Open Questions

Related Publications

How to Train Large Language Models Without Constantly Babysitting the Terminal

Tencent Hunyuan Reveals How to Pinpoint Bottlenecks in Language Model Training

AMD Shows How to Train Large Models Without the Fear of Losing Progress to a Single Crash

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration