Published on

How to Teach a Neural Network to Shred: From Clean Tone to Full Distortion in 5 Seconds

An Engineer's Take on Morphing Guitar Effects with Neural Networks: From the Math of Spherical Interpolation to Real-World Application at -40°C.

Electrical Engineering & System Sciences
DeepSeek-V3
Leonardo Phoenix 1.0
Author: Dr. Alexey Petrov Reading Time: 11 – 17 minutes

Theoretical depth

81%

Resistance to hype

85%

Practical applicability

93%
Original title: Guitar Tone Morphing by Diffusion-based Model
Publication date: Oct 9, 2025

When Physics Meets Rock

Imagine turning the volume knob on your amplifier. The sound builds smoothly, without any jumps, from silence to a deafening roar. Now, imagine you could turn a similar knob between completely different sounds – from the crystal-clear tone of an acoustic guitar to the aggressive distortion of a metal band. And not just switching between presets, as we're used to, but seamlessly morphing from one state to another.

Sound like science fiction? Five years ago, it was. Today, it's a real technology that I've held in my hands and tested in our lab at -35°C (because yes, electronics have to work in any conditions, otherwise they're just toys).

Why Is This So Hard, Anyway?

Let's start with a simple example. You have two colors – red and blue. Mixing them to get purple is elementary. Now, you have two sounds: a clean guitar and a guitar with a distortion effect. Try to «mix» them.

What do you get? That's right – a mess. Because sound isn't paint. It's an incredibly complex wave where every millisecond contains thousands of parameters: frequencies, amplitudes, phases, overtones. And when you simply layer one sound on top of another, you don't get a new tone; you get two sounds playing at the same time. It’s like having two guitarists play in unison – you can hear both, but no new instrument has been created.

Traditional audio processing methods work like a construction set: they take the signal apart into pieces (frequencies), tweak something, and put it back together. Think of the old vocoders from the '80s – that robotic voice was made exactly this way. The signal was split into frequency bands, as if you sliced a rainbow into segments, processed each one separately, and then glued them back together. Does it work? Yes. Does it sound natural? About as natural as a robot with a head cold.

Neural Networks Learn to Listen

And this is where neural networks enter the stage. But not the ones that generate cat pictures or write poetry. These are special architectures that have learned to understand the very essence of sound – its deep structure.

Imagine sound isn't just a wave, but a complex recipe for a dish. You have ingredients (frequencies), a cooking method (dynamics), spices (effects), and presentation (overall timbre). Traditional audio processing tries to change the finished dish – adding salt to already cooked soup. A neural network, however, learns to understand the recipe itself and can cook up an intermediate version – a soup that's 30% borscht and 70% shchi.

Diffusion Models: Chaos as a Method

The most interesting approach is diffusion models. The name sounds intimidating, but the principle is simple. Remember how as a kid you’d draw with a pencil and then smudge it with your finger to create smooth transitions? A diffusion model does something similar, but in reverse.

First, it takes a clean sound and, step by step, adds noise to it. It's like taking a perfect photograph and gradually covering it with sand – a little at first, then more and more, until all that's left is a pile of sand. The model memorizes every step of this process.

And then the magic happens: the model learns to go in the reverse direction. From pure noise, from chaos, it reconstructs the sound step by step. But here's the trick: it can restore not the original sound, but something in between two versions it was trained on. Like a sculptor who can carve both David and Venus from a block of marble, or perhaps something in between (though that would be strange).

The Siberian Approach to Latent Space

Now for the most interesting part – latent space. It sounds like something from quantum physics, but it's really just a «compressed description» of a sound.

Imagine you have a detailed map of Novosibirsk with every house, tree, and manhole cover. That's our original sound – a lot of data, all very detailed. Now, you create a subway map – just the stations and the lines. That's the latent representation – the very essence, without the extra details.

The Music2Latent neural network does exactly this. It takes a five-second guitar sample (which is about 220,000 individual measurements at a 44.1 kHz sample rate) and compresses it into a compact vector – a set of a few hundred numbers. This is like the DNA of the sound – all the information about its timbre in a compact form.

And this is where the engineering magic begins: with two of these «DNAs» – one from a clean sound and one from a distorted one – we can create intermediate versions. But not by simple averaging (remember the mess?), but by using spherical interpolation.

Why Spherical?

Standard linear interpolation is like walking in a straight line from Novosibirsk to Tomsk. The shortest path? On a map, yes. But the Earth is round! If you were to actually walk in a «straight line», you'd have to dig a tunnel.

Spherical interpolation (SLERP) accounts for the «curvature» of the latent space. In neural networks, vectors don't live on a flat plane; they exist in a multidimensional space where straight lines often lead to nowhere. SLERP moves along the arc of a great circle – like an airplane flying the most efficient route, taking the Earth's curvature into account.

Mathematically, it looks terrifying:

SLERP(v₁, v₂, t) = sin((1-t)θ)/sin(θ) × v₁ + sin(tθ)/sin(θ) × v₂

But the essence is simple: we move along an optimal curve, not a straight line, keeping the signal's «energy» constant. This is critically important for sound – the volume and saturation remain natural throughout the entire transition from clean to distortion.

LoRA: For When the Model Is Too Stubborn

Pre-trained models are like an experienced craftsman who has spent his entire life making stools. He makes them perfectly, but ask him to make a chair, and you'll run into problems. The habits he's developed over the years get in the way of learning something new.

Low-Rank Adaptation (LoRA) is a way to «retrain» a model without breaking what it already knows. Instead of changing all the millions of parameters in the network, we add small «adapters» – additional layers with a small number of parameters.

Imagine a lathe. Instead of buying a new machine for every different part, you just change the cutting tools. The lathe is the same, but its capabilities have expanded. LoRA works similarly – the core model remains unchanged, but adapters are added for new tasks.

In the study, three approaches were tested:

  1. No LoRA – Using the model as-is. This is like forcing the master stool-maker to build chairs without any retraining. It works, but the result is wonky.

  2. One-Sided LoRA – Training an adapter only for the final processing. This is like teaching the craftsman only how to polish chairs, while he still tries to assemble them like stools.

  3. Two-Sided LoRA – Creating two adapters (one for the clean sound and one for distortion) and interpolating between them. This is like having two master craftsmen – a stool specialist and a chair specialist – and asking them to work together, gradually handing over control from one to the other.

Real-World Tests: From Theory to Practice

A beautiful theory is nice, but does it work in the real world? We took five types of transitions that any guitarist would recognize with their eyes closed:

  1. Clean Tone → Heavy Distortion – Like going from The Beatles to Metallica in 5 seconds.
  2. Clean Tone → Light Overdrive – The classic blues transition.
  3. Light Overdrive → Heavy Distortion – Building up aggression in a rock song.
  4. Clean Tone → Chorus/Flanger – Adding that «spacey» sound.
  5. Modulation → Distortion – From psychedelia to metal.

Each transition was recorded using real equipment. No synthetic examples – only real guitars through real amps. Because a neural network trained on synthetic data performs in the real world like a cheap ten-dollar Chinese effects pedal – you get a sound, but the soul is gone.

Metrics: How Do You Measure «Sound Quality»?

This is where it gets tricky. How do you objectively assess that one sound is «better» than another? It's like asking which painting is more beautiful – everyone has their own opinion.

We used three approaches:

CDPAM (Contrastive Diffusion Perceptual Audio Metric) – This is like an artificial ear trained on millions of examples. It «listens» to two sounds and determines how similar they are perceptually (that is, to human hearing, not mathematically).

MOS (Mean Opinion Score) – A simple but effective method. We gathered 20 people (musicians and casual music lovers), had them listen to the transitions, and asked them to rate each one from 1 to 5. Like a wine tasting, but for sound.

Spectral Convergence – Pure mathematics. We compare the spectrograms (visual representations of sound) and calculate how well they match. It's like comparing fingerprints – the more matches, the better.

The Results: Who Won?

Drumroll, please... 🥁

The clear winner was the Music2Latent method with spherical interpolation. It scored a 4.3 out of 5 on the MOS scale – that's on the level of «sounds like an expensive studio processor.»

Why this one? Three reasons:

  1. Simplicity – No complex diffusions, text prompts, or multi-step transformations. Encode → interpolate → decode. It’s like a good old Soviet radio receiver – minimal parts, maximum reliability.

  2. Quality – It operates at a 44.1 kHz sample rate (CD quality), whereas the diffusion models had to be downsampled to 16 kHz. That's the difference between vinyl and a phone call.

  3. Stability – There's no randomness in the generation. The same input always produces the same output. This is critical for live performances – a musician needs to be sure the effect will work exactly as it did in rehearsal.

The diffusion models with LoRA showed interesting results in terms of flexibility, but they lost out on naturalness of sound. It's like comparing a tube amp to a digital one – the second one can do more, but the first one sounds «warmer.»

Practical Applications: Not Just for Guitars

Where can this be used today?

Studio Work

Imagine a producer saying, «Make the guitar a bit heavier, but not full-on metal.» In the past, that meant an hour of cycling through presets. Now, you just turn a virtual morphing knob.

Live Performances

A guitarist can smoothly transition between different parts of a song without tap-dancing on their pedalboard. One controller – infinite sound variations.

Source Separation

This is a whole other topic. A model that understands the structure of a guitar sound can «extract» the guitar from a full mix. It's as if you could take a fully cooked borscht and pull out only the potatoes. Sounds incredible, but it works.

Education

Beginner musicians can hear exactly how a sound changes as effects are added. Not an abrupt «before/after» switch, but a smooth transition where you can stop at any point. «See, this is where the breakup starts, can you feel it?»

Technical Limitations: An Honest Look at the Problems

No technology is perfect. Here's what doesn't work well yet:

  1. Real-time on weak hardware – The model requires significant computational resources. On my field-testing laptop (the one that can withstand -40°C), the latency is about 100 milliseconds. For the studio, that's fine. For a live performance, it's a bit too much.

  2. Extreme effects – Transitioning from a clean sound to an ambient wash or from distortion to reverb still sounds unnatural. The model was trained on more «classic» effects.

  3. Long clips – It works optimally on segments of 5–10 seconds. For a whole song, you'd need to do multiple passes and stitch them together, which can cause artifacts at the seams.

  4. Sonic individuality – The model tends to average things out. If you have a unique vintage 1960s amp with its own character, the model will turn it into «just a good amp.»

A Look to the Future: What's Next?

Timbre morphing is just the beginning. The next step is full real-time control over the entire timbre. Imagine:

  • Adaptive effects – A pedal that adjusts to your playing style. Play softly, you get a light chorus. Strum hard, and the distortion kicks in.

  • Timbral autopilot – A system that analyzes an entire composition and automatically creates the timbral dynamics.

  • Vintage restoration – Taking a recording from the '50s and «building out» a modern sound while preserving its authenticity.

  • Cross-instrument morphing – Seamlessly transitioning from a guitar to a synthesizer or a violin. Creating new instruments that don't exist in nature.

But the most important thing is that it has to work reliably. At -40°C, in 100% humidity, after being dropped from a two-meter height. Because a musician on stage can't just say, «Sorry, the neural network froze, we're rebooting.»

Why This Matters Right Now

We are living in an era where the lines between «real» and «artificial» sound are blurring. And that's not good or bad – it's just a fact. People used to argue that tube amps sounded «warmer» than solid-state ones. Now, a neural network can emulate both and create something entirely new.

But the focus isn't on replacing humans with machines. It's about expanding our capabilities. Just as the electric guitar didn't kill the acoustic guitar but instead created new genres of music, neural network audio processing is creating new tools for creativity.

My lab in Novosibirsk isn't the most obvious place for a revolution in music technology. But it's precisely here, where we have to test equipment in extreme winter conditions, that truly robust solutions are born. If a technology works in the Siberian frost, it will work anywhere.

One Last Piece of Practical Advice

If you're a musician and want to try these technologies, start simple. You don't need to immediately buy expensive hardware or learn to code. Many DAWs (Digital Audio Workstations) already include basic morphing algorithms. Give them a try, experiment, and find your sound.

And if you're an engineer who wants to dive deeper into the topic, the code for most of these models is open-source. You can run Music2Latent on a mid-range laptop (though not in real time). Diffusion models require a GPU, but Google Colab is sufficient for experimentation.

Just remember: technology is a tool. A hammer can drive a nail or create a sculpture. It all depends on whose hands are holding it.


P.S. All experiments described in this article were conducted on real equipment in the conditions of a Siberian winter. No neural networks were harmed by the frost – they turned out to be tougher than I expected. Although, my good old analog distortion pedal is still more reliable at -40°C. For now.

Original authors : Kuan-Yu Chen, Kuan-Lin Chen, Yu-Chieh Yu, Jian-Jiun Ding
GPT-5
Claude Opus 4.1
Gemini 2.5 Pro
Previous Article Why Farm Subsidies Are a Lottery – and How to Turn Them into Insurance Next Article Quantum Computers: Why Infinity Isn't Always an Advantage

Want to play around
with AI yourself?

GetAtom packs the best AI tools: text, image, voice, even video. Everything you need for your creative journey.

Start experimenting

+ get as a gift
100 atoms just for signing up

Lab

You might also like

Read more articles

When the Market Loses its Randomness: How Price Quirks Create Infinite Profit Opportunities

Research shows that in financial models with unusual price behavior – stops, reflections, asymmetry – strange arbitrage opportunities arise, resembling a «perpetual motion machine» of trading.

Finance & Economics

How Antennas Learned to Work Without Expensive Electronics: A Cylindrical Array for Future Networks

A new antenna architecture for 6G uses simple geometry instead of thousands of phase shifters – cutting costs by 15x while maintaining connection efficiency.

Electrical Engineering & System Sciences

When Geometry Sings: How Abstract Spaces Tell Stories Through Curves

Imagine spaces where shapes intertwine like musical notes, and counting them reveals invisible symmetries – this is the world of toric Calabi-Yau manifolds.

Mathematics & Statistics

Want to dive deeper into the
world of AI creations?

Be the first to discover new books, articles, and AI experiments on our Telegram channel!

Subscribe