According to a recent study in AI and crime science, humans are not all that good at detecting speech deepfakes. Read the sciencedaily.com summary here, and the full paper here.
A few brief comments:
First off, if you’re not familiar with the concept of a deepfake, now would be a good time to learn.
Second, I’ll observe that while the online world has lately spilled a lot of digital ink over AI-generated writing and imagery, I haven’t heard as much about AI-generated speech.
One scary result from the study is that even after training, people were still not great at the identification task:
“Participants were only able to identify fake speech 73% of the time, which improved only slightly after they received training to recognise aspects of deepfake speech.” (Emphasis mine.)
According to the ScienceDaily report, this investigation is the first to “assess human ability to detect artificially generated speech in a language other than English”; researchers looked at both English and Mandarin. I believe this should serve as a reminder that we need more research on AI in a variety of languages – especially if we’re trying to make generalizations and predications about how AI tools can best serve (or potentially harm) people around the world.
Towards the end of the ScienceDaily summary:
“[…] there are growing fears that such technology could be used by criminals and nation states to cause significant harm to individuals and societies. Documented cases of deepfake speech being used by criminals include one 2019 incident where the CEO of a British energy company was convinced to transfer hundreds of thousands of pounds to a false supplier by a deepfake recording of his boss’s voice.”
As terrifying as this real-life scenario is, it made me chuckle inadvertently, for I instantly thought of a scene from “Star Trek: The Next Generation” where Wesley Crusher creates some small piece of technology with which he impersonates Captain Picard’s voice, causing comedic confusion and dismay amongst ship crew members (in “The Naked Now”, season 1, episode 3). Curious, how creators of the show reliably envisioned our technological trajectory over 30 years ago. (Even though I’m a child of the ‘80s and ‘90s, I didn’t really watch Star Trek growing up…but I’ve recently begun watching “The Next Generation,” and its lighthearted nerdiness is refreshing at the end of a long day.)
The deepfake speech crime above also made me think of how forensic linguists (probably in conjunction with computer scientists) may have their work cut out for them.
The full paper has a bunch of interesting points and I recommend reading it if you have the time/inclination. The word cloud graphics (of study participants’ freeform text responses) and corresponding discussion are particularly thought-provoking.
Playing catch-up
Here’s yet another, more recent article on the topic of audio deepfakes: AI Audio Deepfakes Are Quickly Outpacing Detection.
One of the big takeaways from the interviewee is how it’s now incredibly easy to create “convincing audio deepfakes,” and yet the skills and tech needed for identifying AI-generated speech are much farther behind. Such a mismatch between creation and detection is obviously problematic for many reasons. Crucially, the mismatch contributes to our growing difficulties distinguishing what is real and worthy of trust.
Photo attribution: BandLab