Published on April 2, 2026

Sony AI March Highlights: Diffusion Models, ICASSP Papers, and AI Ethics Recognition

Sony AI in March: A Book on Diffusion Models, Over Ten Accepted Papers, and a Researcher's Recognition

Sony AI's March recap includes a new book on generative models, over ten research papers accepted for ICASSP 2026, and recognition for Alice Xiang on a top-100 list.

Research / Technical context 5 – 8 minutes min read
Event Source: Sony AI 5 – 8 minutes min read

March was a busy month for the Sony AI team. Researchers made progress on several fronts: a book explaining the mathematical foundations of generative models was released, more than ten papers were accepted at a key conference on audio and speech processing, and the head of AI ethics was named to a list of the one hundred most influential women in AI.

Below is more detail about each of these events.

The Principles of Diffusion Models Explained

The Book That Needed to Be Written

Diffusion models are one of the key tools in modern content generation. They form the basis of systems that create images, audio, and much more from text descriptions. However, despite their widespread use, navigating this field can be difficult: different research communities have arrived at similar ideas via their own paths, leading to an accumulation of overlapping terminology and competing formulations.

The book The Principles of Diffusion Models is an attempt to bring order to this field. It was written by Sony AI researcher Chieh-Hsin «Jesse» Lai in collaboration with Yang Song, Dongjun Kim, and Stefano Ermon. The authors demonstrate that behind various approaches – such as DDPMs, score-based models, and flow-based methods – there is a unified mathematical logic. Simply put, they are not several different technologies, but different ways of describing the same underlying concept.

In an interview, Jesse explained that he wants readers to be able to navigate the field after reading the book, not just reproduce specific techniques. He believes that the fundamental ideas in this area have a longer lifespan than the specific methods built upon them.

Over Ten Papers Accepted at ICASSP 2026

Over Ten Papers at ICASSP 2026

The IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) is one of the premier venues for research in audio, speech, and signal processing. This year, the conference will take place from May 4–8, 2026, in Barcelona, and Sony AI will be presenting an impressive slate of accepted papers.

The topics cover a fairly broad spectrum.

How Models «Hear» Musical Structure

One paper investigates how well pre-trained audio models can analyze musical structure – for example, whether they can distinguish between a verse and a chorus. The findings show that self-supervised learning on music data using so-called masked language modeling is particularly effective for this task.

Sound and Picture – Together and in the Right Place

Another paper addresses a problem that is easy to overlook: in systems that generate audio and video simultaneously, the sound and image are often not spatially aligned. The researchers proposed a new method for measuring this discrepancy and created a dedicated benchmark, SAVGBench.

Blind Data Cleaning

The quality of training data is critical for any model. In music separation – the task of splitting a mixed audio track into its individual sources – datasets often contain hidden artifacts. The authors have proposed two cleaning methods that work without knowing the specific type of contamination present in the data.

Separating Sounds Using Video or Text

MMAudioSep is a generative model capable of extracting a desired sound from a mix, guided by either a video or a text description. It is based on a pre-trained video-to-audio generation model that has been adapted for this new task.

Real-Time Foley

FlashFoley is the first open-source, accelerated model for sketch-to-audio generation. In filmmaking, Foley refers to sound effects (like footsteps, a creaking door, or rain) that are artificially created and added to a film in post-production. FlashFoley enables this process to be done interactively and in real time.

Finding Samples in Music

Another paper tackles the task of automatically identifying whether a track contains a sample from another piece of music and, if so, which one. The approach is based on self-supervised learning and, according to the authors, significantly surpasses previous methods.

Automatic Mixing

MEGAMI (Multitrack Embedding Generative Auto Mixing) is a generative framework for the automatic mixing of multitrack music. Unlike deterministic methods, it takes into account the subjectivity of creative choices: professional sound engineers can mix the same recording in various ways, and this is perfectly normal.

Controllable Drums

Break-the-Beat! is a tool for rendering drum parts (MIDI) using the timbre from a reference audio file. In simpler terms, you can set a drum pattern and specify which «sound» to perform it with – for example, a sound taken from a specific recording.

A Benchmark for Evaluating Foley Models

FoleyBench is the first large-scale benchmark designed specifically for evaluating models that generate Foley-style sounds from video. It contains 5,000 «video-audio-text» triplets, providing broad coverage of typical Foley sounds.

Synchronizing Lyrics and Audio

WEALY is a pipeline for aligning lyrics with an audio recording. It utilizes embeddings from the Whisper model, and the approach is designed to be reproducible: the authors have intentionally provided transparent and open baseline comparisons.

Speech Recognition with Limited Data

The last of the accepted papers enhances the Summary Mixing approach for speech recognition in low-data scenarios. This new development reduces peak video memory usage by 40%, making model training more resource-friendly.

Alice Xiang: AI Ethics as a Practical Tool

Alice Xiang: Ethics as a Tool, Not a Declaration

Sony AI researcher Alice Xiang, head of AI ethics at Sony Group, was named to AI Magazine's Top 100 Women in AI for 2026 list. She also appeared on the «Me, Myself and AI» podcast from MIT Sloan Management Review, where she discussed the FHIBE project.

FHIBE (Fair Human-Centric Image Benchmark) is the first publicly available, fully-consented dataset designed specifically for evaluating bias in computer vision systems. The dataset is globally diverse and covers a broad spectrum of human recognition tasks. It has been published in Nature Magazine, is freely available, and is already in use across the industry.

The problem FHIBE solves is quite specific: to verify how fairly a computer vision system works, one needs representative data collected with the consent of the people included in it. For a long time, such data was simply not publicly available. FHIBE fills this gap.

In a conversation with podcast host Sam Ransbotham, Alice explained why the absence of ethically sourced datasets is not an abstract problem but a practical obstacle to the fair evaluation of AI systems.

Original Title: Advancing AI: Highlights from March
Publication Date: Apr 1, 2026
Sony AI ai.sony A Japanese research division of Sony focused on developing AI technologies for creativity, robotics, image processing, and data analysis.
Previous Article How Salesforce Trains AI Agents Without Huge Costs Next Article The People Making GPUs Run Incredibly Fast: Inside the Together AI Team

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

LG AI Research has unveiled SciNO – an innovative diffusion model utilizing neural operators to determine the order of causes and effects between variables in data.

LG AI Researchwww.lgresearch.ai Feb 4, 2026

Two research papers from the Typhoon team have been accepted to the EACL 2026 conference. They focus on evaluating speech models and handling long audio recordings, addressing key challenges in the field.

Typhoonopentyphoon.ai Mar 21, 2026

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Don’t miss a single experiment!

Subscribe to our Telegram channel —
we regularly post announcements of new books, articles, and interviews.

Subscribe