Samsung Labs’ AI team just released videos demonstrating that they can emulate someone talking from a single still image. Add that to the recent Joe Rogan voice emulation released by AI startup Dessa and it’s easy to imagine a near future of incredibly authentic looking (and sounding) forgery in disinformation campaigns.
The Samsung video below starts about four minutes in and uses single photos of historical figures and then paintings to demonstrate the technology’s capabilities to generate realistic speech gestures. You can rewind to the beginning to get the full science and technology around the machine learning used to create the effect, along with more sophisticated emulations.1
Now here’s the Joe Rogan voice simulation:
Technology’s going to do what technology’s going to do: rifle through pandora’s box to figure out the possible.
This is all fun and games when integrated into a phone app, or game, or utilized to generate characters in a future film. But we know enough about state and non-state actors to fear what this might mean for society, security and political futures.
Here’s Dessa in a blog post accompanying the release of their Rogan voice simulation (emphasis mine):
As AI practitioners building real-world applications, we’re especially cognizant of the fact that we need to be talking about the implications of this.
Because clearly, the societal implications for technologies like speech synthesis are massive. And the implications will affect everyone. Poor consumers and rich consumers. Enterprises and governments.
Right now, technical expertise, ingenuity, computing power and data are required to make models like RealTalk perform well. So not just anyone can go out and do it. But in the next few years (or even sooner), we’ll see the technology advance to the point where only a few seconds of audio are needed to create a life-like replica of anyone’s voice on the planet.
It’s pretty f*cking scary.
I write this as a garden variety manipulated video of Nancy Pelosi is making the rounds. In this case, the video’s simply slowed to make her appear to drunkenly slur her words. Give it time. A few months, a few years, and disinformation campaigns will generate words never spoken over video never shot.
F*cking scary, indead.
Venture Beat, Samsung’s AI animates paintings and photos without 3D modeling
ArXiv (PDF), Few-Shot Adversarial Learning of Realistic Neural Talking Head Models
Motherboard, This AI-Generated Joe Rogan Voice Sounds So Real It’s Scary
Lawfare, Deep Fakes: A Looming Crisis for National Security, Democracy and Privacy?
Lead Image: Volcan Fuego, Antigua, Guatemala by Ben Turnbull.