Earlier this year, a new song claiming to be by singers and songwriters Drake and The Weeknd went viral on social media. However, it soon was found that these artists were not involved in the music and that the track was a deepfake: their voices had been synthesized using generative artificial intelligence (AI), algorithms that learn patterns from data they are trained on to produce similar output. (You can hear the song below.)
Similarly, fraudsters have recently used this technology to create sophisticated phone scams involving the cloned voice of a family member or friend urgently asking for money to help with an emergency situation.
"Recent synthetic voice models can create realistic human voices with just a half a minute or one minute sample of somebody's voice," says Siwei Lyu, a computer science professor and head of the Media Forensic Lab at the University of Buffalo, State University of New York. "This technology is finally drawing people's attention."
Computer-generated voices have positive uses, too. People who have lost their voices due to ailments such as motor neuron disease (MND) could now have it recreated with a few past speaking samples. It could also help singers speed up the process of producing a new song, for example by allowing them to clone their voices and quickly generating a demo track instead of having to record it.
However, as AI-synthesized voices improve in quality and become more pervasive in our daily lives, ethical issues and techniques to fight their misuse will need to be considered.
Jin Ha Lee, a professor at the University of Washington's Information School in Seattle, became interested in researching the ethics of voice cloning technology after seeing it being used in innovative ways. In 2021, for example, the late Korean rockstar Shin Hae-shul was recreated as a hologram with a synthesized voice to perform alongside the South Korean boy band BTS. "It was this interesting collaboration between living and deceased artists overcoming the boundary of time," she says.
However, Lee became aware of deeper issues that need to be addressed in such scenarios. For example, even if a deceased artist's family has given permission for their voice to be synthesized and they have been compensated, is it really ethical to use it without having the actual person's permission? "Going forward, I think we need to think about not just ways to protect all artists that are living now, but also those who have passed away," says Lee.
In recent work, Lee and her colleagues investigated how the general public, and synthesized speech developers and researchers, perceived AI-generated singing voices. To gather opinions from the public, they analyzed more than 3,000 user comments on online videos of Korean television shows that presented use cases such as recreating the voices of living and dead artists using AI, and using technology to manipulate their voices or make them sing in a different language. The team also interviewed six researchers who were developing voice synthesis technology about the ethical issues they take into account, and what precautions should be implemented, for example.
Lee and her colleagues found the public often has a negative view of AI-synthetized singing voices, and wondered whether it should be developed at all. She thinks that public rejection of the technology stems from dystopian representations of AI in movies and popular culture. On the other hand, developers largely seemed to be more optimistic, partly because they thought current technology was not as advanced as it might seem, and that countermeasures were being developed in tandem. "They [also] really focused on the idea that it is going to support people rather than replace them," says Lee.
Other research groups are more focused on developing methods to detect deepfake voices. One strategy is to look for artifacts that are generated when AI-synthesized voices are produced. These are largely produced in the final step, when a specialized type of neural network called a neural vocoder is used to reconstruct a voice from a time-frequency representation. In the past, artifacts could be hissing sounds, but those have become less perceptible as vocoders have improved. "It's [now] very hard to hear them just with our ears," says Lyu. "On the other hand, when we plot them as a two-dimensional time-frequency representation, they become more obvious."
In recent work, Lyu and his colleagues used a deep learning model called RawNet2 to distinguish between real and synthetic voices based on neural vocoder artifacts, and to classify a voice as real or not from those results. To train and test their model, they created a new dataset using more than 13,000 real audio samples and generated over 79,000 fake voice samples from those originals using six different state-of-the-art vocoders. Over 55,000 samples from the dataset were used for training purposes, while more than 18,000 were set aside for testing.
Lyu and his team found the model performed well, in terms of classifying a voice as real or fake. However, clear audio is needed so artifacts are not masked by background noise. The system performed less well when tested on fake audio from vocoders that were not represented in the dataset. Lyu is also concerned that crafty attackers could remove traces of vocoder artifacts by processing the audio to defeat the technique. "We're fully aware of the limitations," he says. "To a certain extent, we can [improve the performance] by enlarging the datasets and by designing network model architectures to handle subtler artifacts."
Another team is now taking a different approach to the detection of deepfakes, which involves tapping into their slightly more predictable characteristics compared to those of natural speech. Hafiz Malik, a professor of electrical and computer engineering at the University of Michigan in Dearborn, hypothesized that real voices have more variability in terms of how quickly someone speaks, pauses, or changes pitch, for example, compared to their synthesized counterparts. The differences would be subtle, however, and not always apparent to the human ear.
Malik and his colleagues are now testing the hypothesis using deep learning algorithms. They have been creating a huge dataset for training and testing purposes using audio of well-known people giving speeches, talks, and interviews. Using commercially available tools, they also are synthesizing the voices of those people so the resulting two-dimensional waveforms can be compared to the originals. "So far, [our hypothesis] is pretty solid," says Malik. "When we do an analysis, the [differences] are distinct."
Malik acknowledges that it is a changing-goalpost type of situation, in which current strategies may not work as cloned audio improves in quality. However, he expects more proactive measures to be implemented in the future, such as embedding some type of watermark or monitoring the provenance of synthetic content. He is passionate about fighting misinformation and hopes the tools he is developing will play a part.
"Deepfakes have been out of control for the last 10 years or so," says Malik. "Contributing to letting people see the truth is very close to me."
Sandrine Ceurstemont is a freelance science writer based in London, U.K.