The Time Lord brought home the latest issue of Scientific American last night, just so I could peruse Graham Collins' article on "Solving the Cocktail Party Problem" — namely, studying how our brains separate various auditory streams when in a crowded room, like a restaurant or a cocktail party. (Personally, my brain has never been especially good at this. I find myself having to really concentrate when the noise levels reach a certain critical threshold.) Scientists have been pretty successful at studying how the brain accomplishes this amazing feat. They've been less successful at devising computer algorithms to do the same thing.
We've all experienced this phenomenon. Walk into a crowded bar, with music blaring, and your first impression is likely to be a shudder at the sudden wall of sound — which you will interpret at first as a single loud noise. But very quickly, you adjust, and different sounds begin to emerge. We navigate by tuning our neurons to specific voices, thereby tuning out others — like that irritating, leering would-be Lothario at the other end of the bar, or all that ambient noise.
A few years ago, at an acoustics conference, I chatted with Shihab Shamma, a researcher at the University of Maryland, College Park. He believes this ability arises from auditory nerve cells in the brain that re-tune themselves to specific sounds as part of the adaptive process. It's kind of an auditory feedback loop that enables us to sort out confusing incoming acoustical stimuli.
He's surprised, however, by how quickly this process happens: auditory neurons in adult mammal brains make the adjustment in a few seconds. To Shamma, this suggests that the developed brain is even more "plastic" or adaptable than previously realized. We're literally changing our minds.
Scientists are still a bit in the dark in terms of understanding the mechanisms that cause this rapid tuning, but Shamma says that if we can mimic those abilities, it could lead to the development of more effective hearing aids and cochlear implants. In the shorter term, it might help improve automatic speech recognition systems by teaching them to filter out moderate levels of background noise and other acoustical "clutter."
And that brings us to the latest SciAm article (subscription required, sorry). Apparently a team of researchers at IBM's TJ Watson Research Center have managed to create an algorithm for the "cocktail party problem" that outperforms human beings. Why is it so hard, and therefore such a bit deal? It comes down the number of possible sound combinations which quickly becomes unwieldy. Here's how Collins phrases it:
"Whether one person is talking or many, the sound contains a spectrum of frequencies, and the intensity of each frequency changes on a millisecond timescale; spectrograms display data of this kind. Standard single-talker speech recognition analyzes the data at the level of phonemes, the individual units of sound that make up words… Each spoken phoneme produces a variable but recognizable pattern in the spectrogram. Statistical models … [specify] the expected probability that, for instance, an "oh" sund will be followed by an "n". The recognition engine looks for the most likely sequences of phonemes and tries to build up whole words and plausible sentences."
In other words, speech recognition works a bit like Auto-Correct — and we all know what can happen when Auto-Correct goes horribly, horribly wrong.
"When two people talk at once, the number of possibilities explodes. The frequency spectrum at each moment could come from any two phonemes, enunciated in any of the ways each person might use them in a word. Each additional talker makes the problem exponentially worse."
The good news is that such algorithms can simplify the search by focusing on the dominant speaker — c'mon, we all know there's at least one Loud Talker in any given crowd. A number of shortcuts have been devised in recent years by exploiting this kind of thing. A "bottom-up" approach looks for segments in a spectrogram without a dominant speaker, and sets those segments aside, literally removing them from the equation so the algorithm can focus on finding phoneme sequences in the "clean regions" — i.e., where there is a dominant speaker. That approach has been adopted by scientists at the University of Sheffield in England, apparently.
Alternatively, you can use a "top-down" approach, devising an algorithm that analyzes trial sequences of the most likely phonemes for all speakers in a given spectrogram. Finnish researchers at Tampere University of Technology exploit this approach by switching between each of two speakers. As Collins explains, "Given the current best estimate of talker A's speech, search for talker B's speech that best explains the total sound." Context is everything, baby. The IBM achieved their "superhuman" automated speech separation by tweaking a "top-down" approach and devising an algorithm to seek out areas on the spectrogram where one talker was bellowing so loudly s/he masked the voices of the other(s).
But you really shouldn't worry too much just yet about secret agents eavesdropping on your party guests: the new algorithms aren't that good. Maybe someday. In the meantime, please to enjoy this classic party scene from Breakfast at Tiffany's to illustrate just how tough the cocktail party problem is likely to be. As one of the YouTube commenters remarked, "It's not a party until someone is laughing and crying at themselves in the mirror."