The research team used a text-to-speech algorithm on two data sets that generated 50 deepfake speech samples. The researchers used both English and Mandarin speech "to understand if listeners used language-specific attributes to detect deepfakes."
The speech samples were then tested on 529 people who were asked if they believed a sample was an actual human speaking or if the speech was computer-generated.
Participants were only able to accurately identify deepfake speech 73% of the time, while results only improved "slightly" after participants were trained on how to recognize computer-generated audio, according to the study.
"Our findings confirm that humans are unable to reliably detect deepfake speech, whether or not they have received training to help them spot artificial content," Kimberly Mai, an author of the study, said in a statement.
"It’s also worth noting that the samples that we used in this study were created with algorithms that are relatively old, which raises the question whether humans would be less able to detect deepfake speech created using the most sophisticated technology available now and in the future."
The study is considered to be the first of its kind to investigate how humans detect deepfake audio in a language other than English.
English and Mandarin-speaking participants showed roughly the same rate of detection, with English-speakers citing they relied on listening to breathing to help determine if the audio was real or computer-generated. Mandarin-speakers said they paid attention to a speaker’s cadence and word pacing to help correctly identify audio.
"Although there are some differences in the features that English and Mandarin speakers use to detect deepfakes, the two groups share many similarities. Therefore, the threat potential of speech deepfakes is consistent despite the language involved," the researchers wrote.
The study comes as a "warning" that "humans cannot reliably detect speech deepfakes," with researchers highlighting that "adversaries are already using speech deepfakes to commit fraud," and the tech will only become more convincing with the recent advancements in AI.
"With generative artificial intelligence technology getting more sophisticated and many of these tools openly available, we’re on the verge of seeing numerous benefits as well as risks. It would be prudent for governments and organizations to develop strategies to deal with abuse of these tools, certainly, but we should also recognize the positive possibilities that are on the horizon," study author and University of London computer science professor Lewis D. Griffin said in a statement published by the university.
Audio deepfakes have already been used repeatedly across the U.S. and Europe to carry out crimes.
The study pointed to a scam in 2019, for example, that left a U.K.-based energy firm roughly $243,000 in the red after a fraudster hopped on the phone with the firm's CEO and pretended to be the boss of the organization's Germany-based parent company.
The scammer was able to use AI technology to capture the boss' slight German accent and "melody" of the man's voice while demanding the CEO immediately transfer money to a bank account, the Wall Street Journal reported at the time.
Stateside, victims are sounding the alarm on phone scams that often target elderly Americans. The Federal Trade Commission warned last month that scammers are increasingly relying on voice cloning technology to convince unsuspecting victims to fork over money. The criminals can take a soundbite or video of a person that’s posted online, clone the voice and call the person’s loved ones while pretending to be in a dire situation and in the need of fast money.
Many victims later tell police that the cloned voice sounded so similar to their loved one that they didn’t immediately suspect it was a scam.
Mai told Fox News Digital that the research shows that training people to spot AI-generated speech will unlikely "improve detection capabilities, so we should focus on other approaches," pointing to a handful of other avenues to potentially mitigate risks associated with the tech.
"Crowdsourcing and aggregating responses as a fact-checking measure could be helpful for now. We also demonstrate even though humans are not reliable individually, detection performance increases when you aggregate responses (collect lots of decisions together and make a majority decision)," Mai explained.
"In addition, efforts should focus on improving automated detectors by making them more robust to differences in test audio. In addition, organizations should prioritize implementing other strategies like regulations and policies."