March was a busy month for the Sony AI team. Researchers made progress on several fronts: a book explaining the mathematical foundations of generative models was released, more than ten papers were accepted at a key conference on audio and speech processing, and the head of AI ethics was named to a list of the one hundred most influential women in AI.
Below is more detail about each of these events.
The Book That Needed to Be Written
Diffusion models are one of the key tools in modern content generation. They form the basis of systems that create images, audio, and much more from text descriptions. However, despite their widespread use, navigating this field can be difficult: different research communities have arrived at similar ideas via their own paths, leading to an accumulation of overlapping terminology and competing formulations.
The book The Principles of Diffusion Models is an attempt to bring order to this field. It was written by Sony AI researcher Chieh-Hsin «Jesse» Lai in collaboration with Yang Song, Dongjun Kim, and Stefano Ermon. The authors demonstrate that behind various approaches – such as DDPMs, score-based models, and flow-based methods – there is a unified mathematical logic. Simply put, they are not several different technologies, but different ways of describing the same underlying concept.
In an interview, Jesse explained that he wants readers to be able to navigate the field after reading the book, not just reproduce specific techniques. He believes that the fundamental ideas in this area have a longer lifespan than the specific methods built upon them.
Over Ten Papers at ICASSP 2026
The IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) is one of the premier venues for research in audio, speech, and signal processing. This year, the conference will take place from May 4–8, 2026, in Barcelona, and Sony AI will be presenting an impressive slate of accepted papers.
The topics cover a fairly broad spectrum.
How Models «Hear» Musical Structure
One paper investigates how well pre-trained audio models can analyze musical structure – for example, whether they can distinguish between a verse and a chorus. The findings show that self-supervised learning on music data using so-called masked language modeling is particularly effective for this task.
Sound and Picture – Together and in the Right Place
Another paper addresses a problem that is easy to overlook: in systems that generate audio and video simultaneously, the sound and image are often not spatially aligned. The researchers proposed a new method for measuring this discrepancy and created a dedicated benchmark, SAVGBench.
Blind Data Cleaning
The quality of training data is critical for any model. In music separation – the task of splitting a mixed audio track into its individual sources – datasets often contain hidden artifacts. The authors have proposed two cleaning methods that work without knowing the specific type of contamination present in the data.
Separating Sounds Using Video or Text
MMAudioSep is a generative model capable of extracting a desired sound from a mix, guided by either a video or a text description. It is based on a pre-trained video-to-audio generation model that has been adapted for this new task.
Real-Time Foley
FlashFoley is the first open-source, accelerated model for sketch-to-audio generation. In filmmaking, Foley refers to sound effects (like footsteps, a creaking door, or rain) that are artificially created and added to a film in post-production. FlashFoley enables this process to be done interactively and in real time.
Finding Samples in Music
Another paper tackles the task of automatically identifying whether a track contains a sample from another piece of music and, if so, which one. The approach is based on self-supervised learning and, according to the authors, significantly surpasses previous methods.
Automatic Mixing
MEGAMI (Multitrack Embedding Generative Auto Mixing) is a generative framework for the automatic mixing of multitrack music. Unlike deterministic methods, it takes into account the subjectivity of creative choices: professional sound engineers can mix the same recording in various ways, and this is perfectly normal.
Controllable Drums
Break-the-Beat! is a tool for rendering drum parts (MIDI) using the timbre from a reference audio file. In simpler terms, you can set a drum pattern and specify which «sound» to perform it with – for example, a sound taken from a specific recording.
A Benchmark for Evaluating Foley Models
FoleyBench is the first large-scale benchmark designed specifically for evaluating models that generate Foley-style sounds from video. It contains 5,000 «video-audio-text» triplets, providing broad coverage of typical Foley sounds.
Synchronizing Lyrics and Audio
WEALY is a pipeline for aligning lyrics with an audio recording. It utilizes embeddings from the Whisper model, and the approach is designed to be reproducible: the authors have intentionally provided transparent and open baseline comparisons.
Speech Recognition with Limited Data
The last of the accepted papers enhances the Summary Mixing approach for speech recognition in low-data scenarios. This new development reduces peak video memory usage by 40%, making model training more resource-friendly.
Alice Xiang: Ethics as a Tool, Not a Declaration
Sony AI researcher Alice Xiang, head of AI ethics at Sony Group, was named to AI Magazine's Top 100 Women in AI for 2026 list. She also appeared on the «Me, Myself and AI» podcast from MIT Sloan Management Review, where she discussed the FHIBE project.
FHIBE (Fair Human-Centric Image Benchmark) is the first publicly available, fully-consented dataset designed specifically for evaluating bias in computer vision systems. The dataset is globally diverse and covers a broad spectrum of human recognition tasks. It has been published in Nature Magazine, is freely available, and is already in use across the industry.
The problem FHIBE solves is quite specific: to verify how fairly a computer vision system works, one needs representative data collected with the consent of the people included in it. For a long time, such data was simply not publicly available. FHIBE fills this gap.
In a conversation with podcast host Sam Ransbotham, Alice explained why the absence of ethically sourced datasets is not an abstract problem but a practical obstacle to the fair evaluation of AI systems.