A better way to break up a sound wave
Sound source separation has long fascinated scientists. In 1953, the British cognitive scientist Colin Cherry coined the phrase “cocktail party effect” to describe the human ability to zero in on a single conversation in a crowded, noisy room. Engineers first tried to isolate a song’s vocals or guitars by adjusting the left and right channels in a stereo recording or fiddling with the equalizer settings to boost or cut certain frequencies. They began experimenting with AI to separate sounds, including those in musical recordings, in the early 2000s.
Today, the most commonly used AI-powered music source-separation techniques work by analyzing spectrograms, which are heat map-like visualizations of a song’s different audio frequencies. “They are made by humans for other humans, so they are technically easy to create and visually easy to understand,” says Defossez. Spectrograms may be nice to look at, but the AI models that use them have several important limitations. They struggle in particular to separate drum and bass tracks, and they also tend to omit important information about the original multitrack recording (such as when the frequencies of a saxophone and guitar cancel each other out). This is principally because they attempt to corral sounds into a predetermined matrix of frequency and time, rather than dealing with them as they actually are.
Spectrogram-based AI systems are relatively effective in separating out the notes of instruments that ring or resonate at a single frequency at any given point in time, such as mezzo piano or legato violin melodies. These show up on a spectrogram as distinct, unbroken horizontal lines running from right to left. But isolating percussive sounds that produce residual noise, such as a drum kit, bass slapping, or even staccato piano, is a much tougher task. Like a flash of lightning, a drumbeat feels like a single, whole event in real time, but it actually contains various parts. For a drum, this includes an initial attack that covers a broad range of higher frequencies, followed by a pitchless decay in a smaller range of low frequencies. The average snare drum “is all over the place in terms of frequency,” says Defossez.
Spectrograms, which can only represent sound waves as a montage of time and frequency, cannot capture such nuances. Consequently, they process a drumbeat or a slapped bass note as several noncontiguous vertical lines rather than as one neat, seamless sound. That is why drum and bass tracks that have been separated via spectrogram often sound muddy and indistinct.