Training language models through feedback – whether from humans or other AI – has long been the standard. This is how models learn to provide helpful, safe, and accurate responses. Simply put: a model does something, receives feedback, and becomes slightly better based on that feedback. Repeat this thousands of times, and the result is an aligned, «well-behaved» model.
But now, the industry is entering a new phase. Models are increasingly working not as chatbots answering a single question, but as agents – systems that perform long chains of actions: they search for information, run tools, make intermediate decisions, and only then produce a result. And this is where the old training scheme starts to fail.
When a Single Step Becomes a Marathon
In a classic scenario, a model generates a response and immediately receives a signal: good or bad. Everything is quick and clear. In an agentic scenario, there can be dozens of steps between the first action and the final result. The model calls an external service, gets data, processes it, calls another service, processes it again – and only then does it become clear whether it has completed the task.
This changes everything. Training becomes significantly more expensive: it's necessary to store the context of the entire chain and evaluate the whole reasoning path, not just a single response. The computational load grows non-linearly. This means researchers need new, more efficient approaches that don't require vast resources for each training step.
This is precisely what the Salesforce AI Research team has tackled. They described how they are redesigning the model training process for an agentic reality – and the specific problems they had to solve.
Three Bottlenecks That Slow Down Agent Training
The researchers identified several key challenges that reinforcement learning faces in an agentic context.
First, context length. An agent operates with a long history of interactions. The longer the chain, the more information it needs to keep «in mind» at each training step. This directly affects memory usage and processing speed.
Second, the sparsity and delay of the reward signal. In typical tasks, the model gets feedback almost immediately. In agentic ones, the final result might only appear after many steps. This complicates figuring out which specific actions led to success or failure. Imagine trying to teach someone to cook a dish, but only giving your assessment of whether it's «tasty or not» after the guest has already left the table.
Third, the cost of a single training example. To train a model on a single agentic episode, you need to run the entire chain of actions, collect signals, and calculate gradients. This is significantly more expensive than training on a single response. At an industrial scale, such costs become a major limitation.
What Salesforce Proposes
The team is working on several fronts simultaneously, trying to make agent training more practical – without sacrificing quality for speed or breaking the bank on computation.
One idea is to more intelligently manage which of the agent's steps are used in training. Not every intermediate step is equally useful for the feedback signal. If we can learn to select the most informative moments, it's possible to significantly reduce the load without losing training quality.
Another direction involves working on how the reward signal is formed and transmitted. In agentic tasks, instead of waiting for the signal at the very end, it's possible to construct intermediate evaluations – a kind of «checkpoint» – that give the model more frequent and accurate feedback at each stage of the journey.
In parallel, they are exploring how to better distribute computation across multiple agents or runs so the system can learn more concurrently without creating bottlenecks.
All of this sounds like engineering optimization – and in a way, it is. But behind it lies a fundamental question: Can we even train agents on realistic tasks if we don't solve the efficiency problem? Without this, agentic AI risks remaining the domain of companies with unlimited computing budgets.
Why This Matters Beyond Salesforce
The topic of agentic reinforcement learning is currently a hot one across the industry. Major labs – from OpenAI to DeepMind – are all facing the same limitations one way or another. Agents based on language models are already being used in business process automation, coding, and research tasks. And the more complex the task, the longer the chain of actions – and thus, the more acute the problem of efficient training becomes.
At the same time, the issue of safety hasn't taken a backseat. When an agent performs dozens of actions in a row, the cost of an error increases, as a single wrong decision early on can trigger a whole chain of consequences. This makes the careful tuning of training signals not just a technical issue, but a substantive one. Incidentally, this very problem – how to prevent an agent from «breaking something» in its pursuit of a result – is addressed by a separate field known in academia as Safe Reinforcement Learning. Its essence is to define constraints alongside the training's objective function: the agent must not only achieve the goal but do so within the bounds of acceptable behavior.
The work by Salesforce AI Research is one of the public examples of how research teams are trying to make agentic training scalable. It's not a revolution, but it's an important step toward making AI agents practically applicable tools – not just impressive conference demos.
What Remains an Open Question
Despite the progress, there are still more questions than answers. How can we evaluate an agent's quality on tasks where there's no single «correct» answer? How can we ensure training stability when the external environment is unpredictable? How do we transfer approaches that work in lab conditions to real-world products?
These questions aren't unique to Salesforce – they face the entire industry. And the fact that major companies are starting to speak openly about their approaches to solving these problems is itself a signal: the agentic era is coming, and preparations for it are beginning in earnest.