LinkedIn shared its experience training its GPT-OSS model – a system that helps developers work with open source. In short: they applied a Reinforcement Learning (RL) method, but not the classic one; rather, a so-called «agentic» one. This means the model doesn't just learn to generate text, but performs a sequence of actions in a real environment – for example, it reads repositories, suggests edits, and tests code.
The team published a detailed breakdown of how all this worked in practice. And importantly – they talk not only about what worked but also about where difficulties arose. This is useful because reinforcement learning in real-world tasks remains a complex process, especially when it comes to code.
Что такое агентное Reinforcement Learning и его применение
What Agentic RL Is and Why It's Needed Here
Usually, large language models are trained like this: they are shown a lot of text and then fine-tuned on labeled examples (supervised fine-tuning). This works, but the method has a limitation – the model learns from ready-made answers, not from what truly works in the environment.
Reinforcement learning allows the model to try different options, receive feedback (a reward or penalty), and improve based on the result. In the case of code, this is especially useful: you can run the code, check if it works, and adjust the model's behavior based on that.
In GPT-OSS, the model acts as an agent: it analyzes the repository, generates code or suggestions for changes, and then receives feedback – for example, whether the tests passed or the project built. This is the agentic approach: the model interacts with the environment rather than just predicting the next token.
Проблемы и сложности при внедрении в практику
What They Encountered in Practice 🛠️
LinkedIn honestly describes the difficulties. One of the main ones is training instability. Reinforcement learning processes can behave unpredictably: the model might start yielding good results, then suddenly «forget» what it learned. This is related to how weight updates work: if the environment changes or the reward is formulated imprecisely, the model can stray from the right path.
Another problem is computational cost. Reinforcement learning requires running the model many times in a real environment. In the case of code, this means constantly running tests, building projects, and checking dependencies. All this takes time and resources. The team writes that they had to seriously optimize the infrastructure so that all this would work within reasonable timeframes.
The third point is the formulation of the reward function. It is necessary to clearly define what is considered «good» model behavior. If you simply check whether the code works, you can get solutions that are formally correct but useless or too simple. Therefore, LinkedIn added several criteria: code quality, readability, and compliance with the project style. But every new criterion complicates the training.
Полученные результаты и улучшения системы
What the Result Was
Despite the difficulties, there is a result. The model has learned to better understand the context of repositories, suggest more relevant changes, and break existing code less often. LinkedIn writes that agentic reinforcement learning yielded a noticeable improvement compared to the baseline model trained only on labeled data.
The effect is especially noticeable in situations where complex dependencies within the project need to be accounted for – for example, when changing one module affects others. The model learned to track this because it received feedback not only on the local section of code but on the entire project as a whole.
Открытые вопросы и дальнейшие исследования
What Remains Open
The team honestly admits: the method works, but it doesn't yet scale easily. To apply it to a new type of task, you need to reconfigure the environment, reward function, and infrastructure from scratch. This is not a universal solution that you can just pick up and apply anywhere.
Another question is how to make training more stable. LinkedIn experimented with different approaches: changing update frequency, trying different algorithms (PPO, REINFORCE), and adding regularization. But there is no ideal recipe yet. Every project requires tuning.
And finally, the question of interpretability (understandability) remains. When a model trains with reinforcement, it's not always clear why it made a specific decision. This is normal for RL, but in the case of code, one would like more transparency – especially if the model is to be used in production.
Почему опыт LinkedIn важен для разработчиков ИИ
Why This Matters
LinkedIn's publication is interesting not for the results as such, but because they openly share their experience. Agentic reinforcement learning for code is not a new idea, but there are few real cases where it is applied on an industrial scale. And there are even fewer who are ready to tell what exactly doesn't work and why.
For those working with code generation or trying to teach models to interact with complex environments, this experience can be useful. It shows that reinforcement learning is not a magic pill, but with the right tuning, it can yield a result that is difficult to achieve by other means.
Overall, the LinkedIn material is a good example of how technical breakdowns should look: without unnecessary hype, but with specifics and honesty. If you work with models that must act in a real environment, it is worth studying their approach.