When a company or research group trains a large language model, it uses hundreds or even thousands of graphics processing units (GPUs) working in tandem. This is a complex system, and like any complex system, it occasionally fails. Something goes wrong with the hardware, network, memory, or software. And that's when the real headache begins: figuring out what exactly happened.
Usually, an engineer handles this. They analyze logs, collect metrics, and form hypotheses. This is time-consuming and requires deep expertise. The AMD team proposed a different approach: handing over the diagnostics to an agent system based on a language model.
Why Diagnostics Is a Separate Problem
Training a large model isn't just about “running a script and waiting.” It's a continuous process that can last for weeks. During this time, the system generates vast amounts of data: GPU load, temperature, network latency, training speed, and errors on nodes. If something goes wrong, it needs to be figured out quickly, or hours or even days of computation could be wasted.
The problem is that there's a lot of data; it's fragmented and requires contextual understanding. An engineer encountering such an incident for the first time might spend several hours just figuring out where to start. Moreover, the culprit could be anything: a slow network node, a faulty GPU, a software error in the configuration, or simply a temporary load spike.
A Single Point of Data Collection
At the core of the system is a unified time-series repository – a database where all metrics from the GPUs, host machines, network, and the training process itself are collected. Everything is written to disk, so data isn't lost even during failures. Simply put, if something breaks, the system has a complete picture of what was happening at the moment of failure and before it.
This approach is useful in itself: instead of manually collecting data from various sources, an engineer gets a single snapshot of everything happening at the right moment in time.
The Agent as an Analyst
But the key idea here is not just to store data, but to analyze it automatically. For this, an LLM agent is connected to the repository: a language model that can not only answer questions but also purposefully investigate a problem.
How does this work in practice? The agent receives a description of the incident – for example, “training slowed down by 40% during a specific interval” – and starts querying the data. It formulates its own queries to the metrics database, looks at the results, refines its hypotheses, and moves on. This isn't a linear search through a checklist, but an iterative process: the agent narrows down the list of suspects step by step.
Ultimately, the system provides an explanation: what most likely happened, on which node or component, and what data confirms it.
Not a Replacement for an Engineer, but Significant Help
It's important to understand that the agent doesn't make decisions or apply fixes on its own. It diagnoses and presents its findings to a human. But even this changes the situation dramatically: instead of spending several hours on data collection and initial analysis, an engineer receives a structured hypothesis and can immediately move on to verification and fixing the problem.
This is especially relevant for teams working with large clusters. The more nodes there are, the higher the probability of failure at any given moment – and the more costly every delay in diagnostics becomes.
Scale as a Separate Challenge
Scaling up training is not just about “adding more GPUs.” As the cluster size grows, so does the complexity: more points of failure, more interdependencies, and more data to analyze. A human physically cannot monitor everything manually.
This is precisely where the agent-based approach shows its value. The system doesn't get tired, doesn't miss metrics, and doesn't lose its train of thought when switching between data sources. It follows the same logic for a failure on 10 nodes as it does for a failure on a thousand.
What's Behind This
AMD demonstrates this approach using training with MaxText and the Slurm task scheduler – standard tools in research and industrial clusters. But what's more important than the specific tech stack is the idea itself: using a language model not for text generation, but for solving operational tasks like diagnostics, analysis, and finding the root causes of failures.
This is part of a broader trend where LLMs are beginning to function not as chatbots, but as automation tools within an infrastructure. Such systems are called “agentic” precisely because they don't just answer a question, but independently go through several steps to achieve a goal.
Open Questions
Like any approach, this one has unresolved issues. How reliable is the agent with non-standard failures that weren't in its “experience”? How does it handle situations where data is insufficient or contradictory? To what extent can its conclusions be considered reliable without additional verification?
These questions remain open for now. Agent-based diagnostics is not a silver bullet, but a tool that must be used consciously and with an understanding of its limitations. But as a first step toward automating the routine parts of an engineer's job, it's a step in a sensible direction.