Specifics
Practicality
International engagement
Imagine this: an unknown number calls, and you hear the voice of a loved one pleading for help. Your heart stops, but a minute later you find out it was an artificially generated voice. Welcome to the age of audio deepfakes, where speech synthesis technology is developing so fast that it's becoming increasingly difficult to distinguish a fake from the real thing.
As an engineer working on this problem, I see it from a specific perspective. It’s not just a technical challenge – it's a race between those who create increasingly convincing fakes and us, who develop detection systems. And in this race, the winner is the one who better understands the nature of sound and human speech.
Why Traditional Methods No Longer Work
Until recently, systems for detecting fake audio were trained on carefully curated datasets – like students preparing for an exam, knowing all the possible questions in advance. They performed excellently in laboratory conditions but failed in the real world.
The problem is that fraudsters don't limit themselves to one language or one technology. They use dozens of different speech synthesis systems, apply compression, add noise, and transmit audio via messengers – in short, they do everything to confuse detectors. And our systems, trained only on English and clean recordings, turned out to be useless.
This is why we decided to fundamentally change our approach. Instead of teaching a computer to recognize one type of fake, we created a real «school of hard knocks» for it – a multilingual database with the most diverse types of synthetic speech.
What is the SAFE Challenge and Why It’s Important
The SAFE (Synthetic Audio Forensics Evaluation) Challenge is like the Olympics for audio deepfake detection systems. But unlike regular competitions, there are no predefined rules. The organizers created three tasks, each simulating real-world scenarios:
Task 1 tests whether a system can recognize a «pristine» fake – audio that has just come out of a generator, without any additional processing. This is the basic level, but it’s not as simple as it seems.
Task 2 complicates the situation: the fake audio is compressed, re-compressed, and distorted – exactly as it happens when transmitted over the internet or through messengers. This tests the system's resilience to real-world conditions.
Task 3 is the toughest challenge. This is «laundered» audio, specially processed to deceive detection systems. Fraudsters play recordings through car speakers, add reverberation, and use various acoustic tricks. It's a true nightmare for detectors.
Our Experiment: Four Iterations to Success
We approached the problem methodically, as engineers should. We conducted four sequential experiments, each time improving and enhancing our system.
Iteration 1: A Modest Beginning
We started with the standard ASVspoof 2019 LA dataset – 25,380 samples in English. This is like teaching a child to speak by only showing them pictures of dogs. The result was predictable: the system worked decently in controlled conditions but quickly got lost when it encountered something new. The accuracy was only 53% for pristine audio and 49% for laundered – virtually a coin toss.
Iteration 2: The Multilingual Revolution
This is where we made a qualitative leap. We added recordings in eight other languages to the English data – from German to Hindi. We used five different sources:
- M-ALLABS: 20,000 real human voices in different languages
- MLAAD: 47,200 synthetic samples created by 91 different systems
- CodecFake: specialized fakes based on neural codecs
- Famous Figures: voices of well-known individuals – this is especially important given recent incidents with fake recordings of politicians
- SpoofCeleb: data from real, noisy conditions
The results exceeded our expectations – accuracy for pristine audio jumped to 74.5%. It was as if a student who had only studied math suddenly mastered physics, chemistry, and biology. The knowledge became deeper and more universal.
Iteration 3: Time Matters
In the third experiment, we focused on the time factor. We increased the duration of the analyzed audio fragments from 4 to 12 seconds. Why is this important? Some synthesis artifacts don't appear immediately but accumulate over time – like a person's voice getting tired during a long conversation.
The results were immediate: the accuracy of detecting processed audio increased by 30% – to 76.5%. It turned out that a longer context really helps «catch» complex processing artifacts.
Iteration 4: Strategic Optimization
In the final iteration, we carefully balanced our data «cocktail.» We added 100,000 samples from SpoofCeleb – a dataset with real, noisy recording conditions. We increased the number of multilingual examples to 60,000. We paid special attention to political figures – in our era, this is critically important.
The final system was trained on 200,000 samples from nine languages, created by more than 70 different synthesis systems. It’s like assembling a team of experts from all over the world – each bringing their unique experience.
The Technical Details: How It Works Internally
Our system is built on the principle that «two brains are better than one.» The first component consists of self-learning neural networks, WavLM and MAE-AST, which learned to understand the structure of sound by analyzing millions of hours of unannotated audio. They act like experienced sound engineers who can hear the slightest unnaturalness in a recording.
The second component is a specialized AASIST network, which analyzes the spectro-temporal characteristics of the sound. If WavLM is the «ear», then AASIST is the «brain» that makes the final decision.
Interestingly, different models «hear» different types of fakes. WavLM is great at catching modern, high-tech synthesizers, while MAE-AST handles laundered audio better – apparently, its «trained ear» is less sensitive to acoustic distortions.
Results That Speak for Themselves
Our four iterations yielded impressive results. The detection accuracy for pristine synthetic audio grew from 53% to 81% – an improvement of more than 50%. For processed audio, the result was even better – 82%.
In the international SAFE Challenge, we took second place in two of the three categories among teams from all over the world. But an even more important metric was testing on the «In-The-Wild» dataset, which contains real deepfakes from social networks. Here, our improvement was more than 400% – errors dropped from 35.6% to 8.4%.
Where the System Fails
Honesty requires us to admit: no system is perfect. Our analysis showed several problem areas:
High-quality audio from unexpected sources sometimes stumped the system. For example, Japanese audiobooks or low-quality Russian recordings were often mistakenly flagged as fakes. This is similar to a person who has only heard their native language all their life and suddenly encounters an unfamiliar accent.
«Laundered» audio remains the biggest challenge. Fraudsters have learned to process fakes in very sophisticated ways – playing them through car speakers, adding road noise, and reverberation. In such conditions, even our advanced system performs no better than 50%.
Some modern generators, especially Cartesia and Metavox, create fakes of such high quality that they are difficult to distinguish even for our system. This shows that the arms race between deepfake creators and detectors continues.
Practical Conclusions for the Real World
What do these results mean for ordinary people and companies?
First, multilingual data is critically important. If you are developing a defense system, don't limit yourself to one language. Fraudsters certainly don’t.
Second, the length of the audio fragment matters. Short recordings are harder to analyze – artifacts accumulate over time. If you have a choice between a 3-second and a 10-second fragment to analyze, choose the longer one.
Third, be especially careful with audio that has undergone multiple stages of processing. «Laundered» deepfakes are a serious threat that requires additional verification by other methods.
Fourth, technology evolves quickly. A system that worked great six months ago might be outdated today. You need to constantly update your training data and retrain your models.
Looking to the Future
Our research has shown that a properly built system can achieve high accuracy in detecting audio deepfakes, but absolute protection does not exist. It is a constant race between offense and defense.
The next steps for development that I foresee are:
Specialization by attack type: different systems for different use cases. A detector for bank calls will be configured differently than a system for verifying recordings in court.
Integration with other modalities: combining audio analysis with video, text, and metadata. A fake often gives itself away not only by its sound but also by its context.
Adaptive learning: systems that learn on the fly, automatically adapting to new types of attacks.
Better «laundering» methods: yes, this sounds paradoxical, but to defend against laundered deepfakes, you need to better understand how they are created.
Conclusion
The fight against audio deepfakes is not just a technical challenge but also a matter of trust in the digital age. Our research showed that a multilingual approach and a variety of training data provide a significant advantage in this fight.
We have achieved good results – second place in an international competition and a four-fold reduction in errors on real-world data. But this is just the beginning. Speech synthesis technologies are evolving, and our defense methods must evolve with them.
Ultimately, the goal is not to create absolute protection – that's impossible. The goal is to make the creation of convincing fakes so difficult and expensive that most malicious actors will choose to abandon the idea. And judging by our results, we are moving in the right direction.
As they say in my profession: the best system is one that works not only in the lab but also in the real world. And our system has already proven its effectiveness in the most challenging conditions.