When a company that develops some of the most powerful AI systems in the world publishes an update to its internal safety policy, it's not just a corporate document. It's a signal about how the industry as a whole perceives the risks posed by increasingly powerful models.
Anthropic, the company behind Claude, recently released the third version of its Responsible Scaling Policy (RSP). The document governs the conditions under which the company continues to develop more powerful models – and when it must pause.
Simply put, the RSP is an internal set of rules that answers the question, “How do we know if we've gone too far?” Anthropic operates on the assumption that as AI systems' capabilities grow, they can become a source of serious risks – primarily in areas such as creating weapons of mass destruction or independently influencing critical systems.
The idea isn't to stop progress. The idea is for progress to advance alongside the development of safeguards, not outpace them. The RSP sets specific thresholds: if a model reaches a certain level of capability, the company must either implement corresponding protective measures or refrain from further development.
The first version of the policy appeared in 2023, and the second in 2024. The third version is the result of accumulated experience and a clearer understanding of where the boundaries lie.
At the core of the RSP is a system of so-called AI Safety Levels (ASL). This is a scale that describes how dangerous a specific model's capabilities could be.
Two key levels are currently relevant:
- ASL-2 – The current level of most of Anthropic's existing models. Models at this level can provide information on sensitive topics, but not in enough detail or with sufficient accuracy to serve as a real “accelerant” for those seeking to cause large-scale harm.
- ASL-3 – The next threshold. These are models that are already capable of significantly assisting in the creation of weapons of mass destruction or demonstrating enough autonomy to act against their creators' interests.
If testing a new model reveals that it is approaching ASL-3, strict security requirements come into force regarding the storage of the model's weights, access to it, and its deployment procedures.
Clearer Criteria for Transitioning Between Levels
One of the main changes is that Anthropic has clarified exactly how a model's capabilities are evaluated before it is assigned a certain level. Previously, the wording was more vague. Now, the company describes specific indicators: what exactly a model must or must not be able to do to fall into a particular category.
This is important because without clear criteria, any policy becomes a mere declaration of intent. Specificity makes it a working tool.
Autonomous Systems Brought into Separate Focus
The new version pays special attention to so-called agentic systems – AI that doesn't just operate in a “question-and-answer” mode but performs multi-step tasks, interacts with external tools, and makes decisions during its operation.
This reflects reality: agentic capabilities are developing rapidly, and the risks here are somewhat different from those of standard chat models. If a model can independently run code, manage files, or interact with services, the question of what it is doing and how controllable it is becomes fundamental.
Cybersecurity Requirements
The third version of the RSP introduces explicit requirements related to cybersecurity for the first time. It specifies that ASL-3 and higher models must not actively assist in carrying out cyberattacks on critical infrastructure, and this prohibition is now enshrined as a separate requirement, not just a consequence of general principles.
Additionally, standards are introduced for protecting the model's weights themselves – that is, its “core,” which determines how it works. A leak of a powerful model's weights is a distinct risk vector that the company now explicitly regulates.
Independent Auditing
Perhaps one of the most significant changes in version 3.0 is the emphasis on external verification. Anthropic has stated its intention to engage independent auditors to verify whether the RSP requirements are being met in practice.
This is an attempt to move away from a situation where the company audits itself. Self-regulation is better than nothing, but independent auditing brings a different level of trust. This is especially true as the industry increasingly debates the need for external regulatory mechanisms to oversee the development of powerful AI systems.
Anthropic's RSP is one of the few publicly available, detailed documents of its kind in the industry. The company is deliberately making it public as part of a broader strategy: to show that responsible AI development is not just talk, but a set of concrete commitments with measurable parameters.
Of course, this approach has its limitations. The policy remains an internal company document – it has no legal force in the traditional sense and relies primarily on reputational and ethical incentives to keep its own promises. Anthropic itself decides when and how to update the RSP and determines whether a model has passed the threshold tests.
The independent auditing mentioned in version 3.0 is a step towards greater transparency. But the question of what comprehensive external oversight for the development of powerful AI systems should look like remains open – and not just for Anthropic.
The RSP didn't appear in a vacuum. It's a response to a real sentiment in the industry: models are becoming more powerful faster than safety practices for working with them can be established.
Anthropic positions itself as a company that is aware of the risks of its own developments, and that's precisely why it continues them: with the rationale that it's better for powerful systems to be created by those who think about safety than by those who don't.
This is a controversial position, and it contains an internal contradiction that the company itself acknowledges. But the RSP is one of the tools Anthropic is using to try to resolve this contradiction: not by halting development, but by building concrete barriers against what could become truly dangerous.
Version 3.0 is not the final word. The document itself assumes that the policy will continue to be updated as new capabilities and new knowledge about risks emerge. This is perhaps one of the most honest things written in it: the admission that no one has the final answers yet, only more or less well-thought-out approaches to finding them.