"ClearBuds" is the code-name of the first "end-to-end hardware-software neural-network based binaural system using wireless synchronized earbuds," according to hardware engineer Maruchi Kim at the University of Washington.
Kim and his colleagues demonstrated a prototype of their speech-enhancing/noise-reducing devices at the ACM International Conference on Mobile Systems, Applications, and Services (ACM MobiSys2022, held in Portland, OR June 27-July 1 ).
The "first" claimed by the researchers is the pairing of binaural (dual) microphones—one in each ear's ClearBud—with two neural networks in an app on a smartphone, resulting in a superior user-experience of voice isolation and noise cancellation during telephone conversations, according to test subjects.
"While neither dual mics nor neural network software is unique or innovative, the combination has value since it reportedly provides an experience that the users liked," said Fan Gang Zeng, a professor of otolaryngology and director of the Hearing and Speech Laboratory at the University of California, Irvine. A researcher in auditory science and technology who was not involved with the research, Zeng added, "Also, there is no technical barrier for others to develop or use the same combo."
To assist other researchers and even commercial telephony equipment providers to use the ClearBud approach, the researchers open-sourced their hardware, software, and neural network architectures. Details are provided in their paper, as well as in their audio demonstrations (which also contain links to the open-source hardware, including the printed circuit board layout, the software code for binaural transmission over Bluetooth, and the code and architectures of the neural networks).
How It Works
ClearBuds leverage binaural sound processing of simultaneous audio streams from microphones in each earbud using deep machine learning (ML) software running in real time. The result demonstrated by the researchers at the conference showed that background noise was cut dramatically (as in this video demonstration), as were nearby competing voices, when making routine phone calls on a ClearBud-empowered iPhone.
Unlike Apple's AirPods, which pair two microphones on each earbud, ClearBuds dedicate one microphone in each earbud for binaural processing. The AirPods use both microphones on a single earbud to perform beam-forming/steering (thus requiring only a single AirPod), while ClearBuds require the use of microphones on both earbuds simultaneously. The increased spatial distance between the binaural mics is credited by the researchers for the enhanced listening experience of more noise-free speech compared to AirPods.
"The use of the dual microphones across the two ears allows the system to localize and isolate the sound of interest, which is the wearer's voice since it is proximal to the sensors, as well as centered to the left and right positioning of the earbuds on the ears," said Mounya Elhilali, a professor of electrical and computer engineering at Johns Hopkins Whiting School of Engineering, who was not involved with the research.
This binaural approach requires both data streams from each earbud microphone to be synchronized to within 64 microseconds of each other. The required hardware in each ClearBud runs on a coin battery, which can last as long as 40 hours. The neural network algorithms, on the other hand, run on the smartphone. The researchers claim their more spatially separated microphones and real-time neural networks result in higher-resolution data than when using Apple AirPods and built-in iPhone apps (even though the iPhone accesses superior high-speed processing on graphics processing units resident on Apple's cloud).
However, the researchers had to use non-standardized home-brewed Bluetooth algorithms in order to process the binaural dual-channel audio streams. "Right now, there is no Bluetooth standard set up for binaural sound processing, but Bluetooth version 5.2 will help by at least supporting dual-channel streaming," explained Kim. "We made do by time-multiplexing the two channels into one Bluetooth channel, which resulted in our primary achievements: both improved speech enhancement and improved noise cancellation."
Traditional earbuds send a single audio stream to a smartphone app, even though AirPods and its competitors use multiple microphones. ClearBuds are designed to mine more information from multiple microphones by virtue of streaming dual (binaural) audio data streams to the smartphone app. Analytic deep machine learning algorithms use neural networks to massage the dual audio streams to identify noise and reduce it to barely perceptible levels in just 20 milliseconds, according to the researchers. The real-time dual neural networks—one for sound source separation and a second for artifact reduction—processed ClearBud's 22.4-millisecond-long packets on an iPhone 12 Pro in 21.4 milliseconds, resulting in a total latency of less than 50 milliseconds (well below the maximum of 400 milliseconds recommended by International Telecommunication Union Telecommunication Standard ITU-T G.114).
"We also utilize the iPhone's Neural Engine chip to both reduce the run-time and lower the power consumption of our neural-network software," said machine learning researcher Vivek Jayaram.
The neural networks first isolate the speaker's voice (which comes into both earbuds' microphones at approximately the same level), then they use algorithms that the researchers liken to the way the brain calculates the direction from which a sound is coming (by comparing the arrival time of signals reaching each ear). The iPhone app also displayed a graph of the raw data, as well as the extent to which noise was suppressed.
"This is very similar to how human hearing works. Our brain compares the signals arriving to our right and left ears to judge from where a sound is coming. By homing in on the person's voice, the system is able to filter out background noises, as well as external voices that are further away from the microphones," said Elhilali. "The software is a hybrid system that combines a temporal convolutional neural network for powerful speech separation performance, with a masking network that corrects artifacts introduced by the lighter footprint of the original system, all implemented in real time on a mobile device. The algorithm is able to operate in near-real time [<50ms latency]. As such, the algorithm is able to scale complex heavy computational models into a nimble implementation achievable on a mobile device."
The researchers tested their ClearBuds in the wild by having 37 people rate 1,041 10-to-60 second clips of eight people reading the same text in different noisy environments, including a coffee shop and a busy street. The participants rated the ClearBud hardware and its neural network software as providing better noise suppression and overall listening experience compared to single-channel solutions such as Apple's AirPods, as well as software-only systems such as Facebook's Denoiser, Google's Meet, and online meeting assistant Krisp.
"Other researchers do all their testing in the lab on synthetic databases, but we wanted to prove that the combining of our [the researchers'] different backgrounds culminated in a system that could excel in the wild even though it was born in the lab," said Microsoft HoloLens Senior Hardware Architect Ishan Chatterjee, who also is a postgraduate researcher at the University of Washington's Ubiquitous Computing Lab.
One limitation of the design is that both earbuds must be charged and operating in order for the ClearBud system to work. Apple's AirPods, for instance, use beam forming and steering that works with single earbuds.
To improve ClearBuds in the future, the team is attempting to shoehorn the neural network algorithms into the earbuds themselves, so they will work on any telephony device compatible with wireless earbuds. The team is also attempting to use two microphones on each ClearBud, so that AirPod-like beam forming/steering can be used before the binaural voice separation and noise suppression algorithms.
Other application areas being pursued by the researchers include smart watches, augmented reality glasses, smart speakers, and acoustic activity recognition for swarm-robot localization and control.
R. Colin Johnson is a Kyoto Prize Fellow who has worked as a technology journalist for two decades.