When a company trains a powerful language model, it invests enormous resources: computing power, data, and expert time. However, a similar result can be achieved with minimal expense – by simply asking the original model a vast number of questions and training a new model on its answers. This is precisely what is known as a distillation attack.
Anthropic, the company developing the AI assistant Claude, has studied this threat and shared what is already being done and what still needs to be done to combat it.
What Is Distillation and Why Is It a Problem?
Distillation in itself is a perfectly legitimate technique in machine learning. Simply put, it's when a large, intelligent model «teaches» a smaller one: the smaller model observes the larger one's answers and learns to reproduce them. This allows for the creation of a compact model that behaves almost like the large one but requires fewer resources to run.
The problem arises when this is done without permission – when someone intentionally «feeds» another's model thousands or millions of queries to collect data and train their own system on it. This is a distillation attack.
This approach violates the terms of service of most AI services. However, it's not just about the legal side. If models can be replicated this way, it undermines the economics of AI development: why invest resources in research if the result can be copied for pennies?
It's Already Happening
One of the most discussed examples is the DeepSeek R1 model, which, according to available information, may have been partially trained using output from other models, including OpenAI's. OpenAI subsequently announced that it had detected suspicious activity and was investigating the incident.
This isn't a hypothetical threat – it's already a real-world practice, and the industry is just beginning to develop countermeasures.
How Can It Be Detected?
Anthropic describes several lines of work for detecting distillation attacks.
First is the analysis of query patterns. When someone systematically tries to «siphon» knowledge from a model, it looks different from normal usage. The queries might be unnaturally uniform, cover an overly broad range of topics, or repeat certain structures. This can be tracked.
Second are so-called «watermarks». The idea is to embed signals into the model's responses that are imperceptible to humans but algorithmically detectable. If a competing model with similar behavior is later discovered, it can be checked for traces of these signals. This is technically challenging and not yet an industry standard, but research is actively underway.
Third is the detection of anomalous behavior at the API level. If a single account or source generates an unusually high volume of queries with targeted topic coverage, it is grounds for additional review.
How Can It Be Prevented?
Detection is one thing, but it's more important to prevent the attack itself or, at the very least, make it significantly more difficult.
One approach involves restrictions at the policy and monitoring level. This is not a technical solution, but it establishes a legal and procedural framework for response.
Another approach is to intentionally alter responses when the system suspects automated data collection. This doesn't mean the model starts lying to users – it's about providing less «distillable» answers in suspicious contexts. This is a fine line to walk, because any degradation in quality also affects legitimate users.
Finally, collaboration between companies plays a crucial role. If multiple AI developers share information on attack patterns, it enables them to more quickly identify and block malicious actors – even if those actors switch from one service to another.
There's No Perfect Solution Yet
Anthropic frankly admits that no foolproof method exists to defend against distillation attacks. It's a cat-and-mouse game where one side devises protection methods and the other finds ways to bypass them.
Part of the problem lies in the very nature of language models: they are designed to be helpful and provide high-quality answers. Any limitation that reduces a model's «distillability» also potentially reduces its utility.
Another open question is the boundary between legitimate distillation and an attack. Researchers, developers, and students might all use models intensively and systematically without any malicious intent. Overly aggressive protective measures risk penalizing these very users.
Nevertheless, the very fact that major players like Anthropic have started to publicly discuss this threat and outline specific approaches to addressing it is a clear sign the industry is taking the problem seriously. This is not just a technical task, but a question of the sustainability of the entire AI development economy.