Sound Separation Research

From October 2021 to March 2022, I worked on a Machine Learning project supervised by Google’s Sound Separation research team under John Hershey, Kevin Wilson and Scott Wisdom. I was attempting to separate sound that originated from within a threshold distance from a microphone from sound that originated further away.

The resulting paper was accepted at the Interspeech 2022 conference, was nominated for best student paper (perhaps a backhanded compliment as I wasn’t a student :) ), and the idea itself is being patented by Google.

I approached the Sound Separation team because I am interested in ML applications in Hearing Aid technology. My father uses hearing aids and is frustrated by the experience, and from my experience at Waymo it seemed like something that AI could really help with- I was surprised that there seemed to be so little ML used in today’s devices.
I spoke to John Hershey’s team about open AI+Sound problems that could be directly applicable to hearing aids. We discussed how hard it is for hearing aid users to distinguish the voice(s) of the people they are talking to from background noises and voices, especially in crowded places.
We settled on the idea of separating nearby sound from sound that comes from further away, as that seemed like it could be immediately useful. If my dad could twist a dial that changed a threshold such that everything beyond that threshold was silenced, he could use it in places like restaurants to block out background noise.
My day job at Waymo was entirely ‘classic’ Computer Science. My focus was optimization algorithms, and making them work in a reasonably online, low latency fashion. I didn’t use any Machine Learning, and what I knew about AI was learned via osmosis from hanging around the ML-heavy teams at Waymo.
This meant that I was pretty intimidated by the field of ML, and decided I needed to catch up. I took Andrew Ng’s famous Machine Learning Coursera course mid 2021 and surprised myself by enjoying it and finding it far more approachable than I anticipated.
Through the course, and then being able to apply my new skills in my Sound Separation project, I became comfortable with the fundamentals and common techniques in AI. I now have a good sense of which problems are well suited to ML, and where its limitations are, and most importantly I lost my apprehension of AI. Like many things, you don’t need to call on a deep technical knowledge of the maths involved to be able to use it as a tool, and it is often the ‘second order’ skills that make or break the project- problem scoping, efficient infrastructure, data preparation, debugging intuition, etc.
You can find the paper on arxiv here, or on the Google Research github page here. I recommend the latter because it includes real life demo recordings using real recordings in my living room!

What’s the Deal in Plain English?

One of my main takeaways from the experience of publishing this paper was how dense and inaccessible academic writing is. Of course, rigour and specificity are extremely important when writing about science, and I understand the need for precise language. Having said that, I think it is a real pity that anyone that hasn’t learned the skill of parsing the language is kept away from engaging with science at the source of new discoveries.

Worse, I think the writing is so intimidating that many potential researchers are probably frightened away, thinking they need to be comfortable with reading existing research before doing their own. I certainly felt like it would be a long time before I could contribute anything meaningful, and was settling in for a strenuous uphill climb to become a researcher. To my surprise, my supervisors assured me I was more than equipped to understand the necessary concepts, and encouraged me to jump in. I was really lucky to find supervisors that were so welcoming, and sure enough everything I did in this research has a “plain english” explanation that anyone could understand, but you wouldn’t know that from reading my paper!

So, in the interest of being the change I want to see in the world, I’m including a plain english explanation of my paper here, and I hope that doing so will one day be an expected accompaniment to all academic papers.

Why Distance-Based Separation?

Today’s hearing aids don’t include much Machine Learning, for the simple reason that many ML models are prohibitively computationally expensive to run on hardware as small as hearing aids. However, it is possible to make smaller ML models, as long as they don’t need to do anything too complicated. Cutting edge sound research is often very complex- think semantic understanding of language- but if we could come up with a project that only needed to use superficial characteristics of sound, it would stand a better chance of being usable on devices.

Thus the idea of separating sound purely by distance was born. We hoped that the change in how someone’s voice sounds to you as they alter their distance from you would be generally learnable by a deep network. It is already proven and understood that properties of sound do change depending on distance. A particularly interesting property to us was the “Direct to Reverberant Ratio” of sound, which is a measure of how much of someone’s voice you’re hearing directly from them to you (direct), versus sound that goes from them, bounces off walls and other things, and only then arrives at your ears (reverberant). As someone moves away, you get a higher proportion of reverberant sound.

Knowing about DRR, and other cues for distance (like volume level) was enough motivation to try an experiment, because we knew that in theory there would be information about distance embedded in sound that a model could learn. Because this is Machine Learning, we didn’t tell the model about any of these cues, instead letting it figure out how to separate the sound however it liked, but we can make an educated guess that it was probably relying on things like DRR to do so.

Experiment Setup

My supervisors already had several models they had designed for various sound separation problems, and to a certain extent an ML model can be considered “general purpose” within its genre of machine learning. For example, they already had a model which would take in a mixture of sound and split it into two clips (called “estimates”), which they had been training to separate speech from non-speech.

This meant they would create mixtures by combining a clip of clean speech and one of background noise, and feed it to the model, which would output an estimate for the speech in the clip, and an estimate for the background noise. Then the model would measure how accurate each estimate was, adjust itself to do better, and try again.

To measure the accuracy, the model would compare its estimate for the speech to the real “answer”, which would be the clean speech clip before it was mixed with noise. It would then do the same for the noise estimate, comparing it to the noise clip before it was mixed with speech to be input into the model.

By changing the “answer” for the comparison, you could then change what the model was training itself to do. Instead of “speech” and “non speech” I could instead provide all the “nearby” sounds as one estimate’s answer, and all the “far away” sounds as the other answer.

This meant that I didn’t have to design or create a whole ML model from scratch, and in fact it would have been a waste of time to do so, because we were trying to show that a new type of problem (distance-based sound separation) could be solved with ML, and any decent learning network should be able to perform well enough to show that. If distance-based sound separation became a field of its own and the goal instead was to show how well you could separate sound by distance, then you might want to create highly specialized models to gain some extra performance.

So, armed with my team’s models, my task then became preparing all the training data for my new task.

Creating Training Data

Creating training data was the bulk of this project. There aren’t big datasets of recorded speakers tagged with their distance from the microphone for us to use- you can imagine the time it would take to record the millions of examples you need to train a model. If we did use a real dataset of speakers, it would be for the testing phase of the project, where we could see how well the trained model performed across real examples, where having ~100s of real examples can be very useful.

So, instead we created “synthetic” training examples by taking a big dataset of individual speakers reading clips of audiobooks (a canonical audio dataset called Libri-light), and modifying the clips to sound like they were coming from a certain distance from the microphone. There is an accepted way to do this, by modeling a “room” of a varied size with varied acoustic properties (picture soft walls vs echoey walls), and adding reverberation to the clip consistent with a location in the synthetic room.

We started by generating loads of these synthetic rooms, with randomized sizes and properties. Then we generated locations within those rooms, 1 for the microphone and 5 for potential speaker locations. We generated locations such that the distribution of speaker distances from the microphone was uniform, because that way when we moved the threshold distance between “near” and “far” we were affecting the distribution of speakers between “near” and “far” in an understandable way. This did have a tradeoff in that the 2D distribution of speaker locations wasn’t very realistic- as you can see below it would be strange to have a high concentration of speakers right next to you.

*Distribution of Generated Speaker Sources*

It was important that we kept things orderly and understandable like this across the experiment because this was a new task, and so we needed to be able to make statements about which parameters mattered to the model and the task. If, for example, changing the distance threshold had also had a major effect on the distribution of speakers that we didn’t account for, we may have misattributed results of the experiment to just changing the threshold location, when in fact it might have been the speaker distribution affecting results.

We waited until training time to combine the clean speech clips with the generated rooms. This was so that we didn’t have to store every single example in a database, and could instead load up batches of rooms and randomly assign speakers to them, create clips, and then reuse the rooms and speakers in different configurations to inflate the number of examples.

Varying Number of Speakers

For each example, we used an important parameter called “Source Presence Probability” to govern how many speakers would be in each room. Each of the 5 locations would either have a speaker or not depending on the value of this parameter- so for an SPP of 1, there would always be 5 speakers present, and for 0.5 each speaker would have a 50% likelihood of being present. We would entirely train a model with a set SPP, and even though an SPP of 0.5 was more realistic (varied number of speakers), being able to have a model with an SPP of 1 provided another way to remove variance and understand the performance.

Finished Examples and Silent Targets

When we created the training examples on the fly at training time, for the locations that were assigned speakers, we would combine the clean speech sources with reverberation consistent with their location in the synthetic room. Then we would bucket the speakers into “near” and “far” within their room, relative to the distance threshold, which would then create two “answers” (known as targets): one for near and one for far. Finally we would add the two answers together to create the “mixture” that we would feed to the model.

You can imagine that changing the distance threshold would have implications for the distribution of examples. For example, a very nearby threshold would mean that usually the majority of speakers would end up in the “far” bucket. Sometimes all the speakers would end up in one bucket, which would then mean one of the targets would be totally silent. The number of these silent targets was dependent on the distance threshold, but also on the SPP, as a lower SPP meant fewer speakers present in total.

Metrics

Once we had all our training data, we needed a way to assess the performance of the model. During training the model measures how similar its estimates are to the targets using a “loss function”, which is a measure of distance between two sound clips in spectrogram form (a visual representation of a sound). The model can then make adjustments to try and minimize that loss. Loss functions tend to also be standardized across a research area, because a good one will work well for many problems.

The loss function helps the model train, but as evaluators of its performance, we don’t only want to know how similar the spectrograms are, but we want to know how “good” the output sounds to us as humans. A common way to measure that is Improvement in Signal to Distortion Ratio. This figures out how much clearer the estimate is than the mixture, by measuring how “buried” the sound you’re interested in (eg: the nearby speakers) is in the input mixture - this is the signal to distortion ratio - and then doing the same for the estimate and taking the difference. So, if you were to do a perfect job then the SDR of your estimate would approach infinity because you had removed all the unwanted sounds. This also allows you to see if you perform better on examples that are “clearer” to begin with- it could be a bit unfair to expect the model to perform well on examples where the nearby sounds are very muffled and quiet and the faraway sounds are loud and clear (low SDR).

One thing to note is that this metric doesn’t work if you have a silent target (like those we discussed above). This is because your input mixtures will be either all signal or all noise depending on whether you’re looking at the near or far target, which then pushes the metrics to +/-infinity. So, for examples with a silent target we instead measured how well the model reduced the sound towards silence (basically, how well did it do on putting everything in one estimate).

The Model

We now had training data and means to evaluate our performance- it was time to run the experiment!

I’m going to stay very light on the technical details here and instead describe the phases at a high level. This is more reasonable than it sounds, because in today’s world of ML, model components act more like “drag and drop” sections of Rube Goldberg machines, and as an ML scientist you build an intuition for which components are likely to be helpful to your problem and then use existing libraries to attach them in an order that suits you. Of course this is an over-simplification, but you’re unlikely to code your own LSTM in the same way that you’re unlikely to code your own hashmap in regular programming. Instead, you’d use TensorFlow or similar to instantiate one and then adjust its settings/parameters to suit you.

In essence, the model did the following:

Split the clips up into 32 millisecond windows
Turned each window into a spectrogram, which creates an image that shows the intensity of different wavelengths of sound in that window.
Fed the spectrograms into an “LSTM” which stands for “Long Short Term Memory” and is a type of recurrent neural network that deals well with data that is sequential (like sound, in which its good to take into account sound you’ve heard earlier when deciding what the current sound you’re listening to means). Read more here if you like.
Used the output of the LSTM to create “masks” - one for near and one for far. These are images that you can multiply, pixel by pixel, with the spectrogram from step 2, to create an output. A pixel’s value in the mask dictates how much of the input mixture’s pixel intensity to let through, in other words how much of that frequency to keep around.
Created estimates using the masks. The near mask multiplied with the mixture spectrogram creates the near estimate, and likewise with the far.
Computed the training loss. This is a measure of how different the estimate spectrograms are from the target spectrograms. You can imagine finding the difference in intensity between each corresponding pixel and measuring that on average. Then the model can adjust weights inside the LSTM to create slightly different masks next time, hopefully with smaller loss.
Turned the spectrograms back into real sounds.
Repeat, a lot.

Results

TLDR: It works! We found that we could reliably improve SDR for our test examples. We did a lot of slicing of our results data, because as I said before this was a new task and we wanted to isolate the effects of various parameters and understand what the levers of performance were. Here are the main areas we investigated:

Effect of Distance Threshold

We found that performance generally degrades as you increase the distance threshold. This was a compelling result to us because increasing the distance threshold also decreased the number of silent targets the training saw, and when we isolated the effect of decreasing the number of silent targets we saw performance improve, which means the degradation with distance is likely a true artifact of distance.

The decrease in silent targets with increased distance was because it moved the distance threshold closer to the mean speaker distance from the microphone, which made it more likely to find speakers in both the near and far categories.

Effect of Silent Targets

As stated above, the more silent targets the model saw during training the poorer it did on examples where both targets were non-silent. In some training runs there were actually a majority of examples with silent targets, because of the combination of low SPP and a far-from-mean distance threshold.

Examples with a silent target were more of a “classification” problem as opposed to a sound separation problem, because the model basically had to decide if the clip was either near or far, and not separate it. For those examples it performs well, usually getting it right and putting most of the sound in the correct category, but it isn’t really what we were investigating.

Effect of Source Presence Probability

We isolated the effect of SPP in order to assess if the model was somehow relying on knowing there would be a consistent number of speakers present. You can imagine that if the model knows there’s always going to be 5 people, it could use that to perform the separation better. We didn’t find a big correlation in performance with an SPP of 1, which reassured us that this wasn’t the case.

Effect of Model Size

This one was very straightforward. You can adjust the number of layers in the LSTM, and the number of units in each layer. The bigger we made the model the better it performed, but it also made it more expensive to run, which isn’t something we want for our applications.

Conclusions and Audio Examples

Overall we managed to achieve an SDR improvement of >4 dB for examples with 1 nearby speaker in our default model, which used an SPP of 0.5.

Here you can see the results broken down by the number of speakers in near vs the number in far, and also by input SDR. So that means that we can see that for examples with good input SDR, we can reliably improve the output.

The 4 quadrants break down examples by how many speakers were in the “near” target and though performance gets worse the more speakers are in the near target, you can also see the number of examples drops off, which is a possible explanation for the worsening performance.

Check out Audio Examples