President Joe Biden had an announcement to make to his fellow Americans. It was Feb. 19, and the audio of the speech told a tale of government mismanagement.
Biden had been scrolling through Disney+ and came across the 2011 Matt Damon movie We Bought a Zoo. Inspired by the story, he bought a zoo of his own. But now he had regrets. “Owning a zoo sucks,” Biden says in the two-minute audio clip, which is layered over static images of the president. “This shit is so hard. It looked much easier in the movie.”
The video, viewed over a million times, isn’t likely to fool anyone — even Biden’s most ardent opponents. But the eerily accurate cadence of the deepfaked version of the president does highlight the ability of AI-generated audio tools to mimic well-known individuals. It’s far from the only example: TikTok has been taken over by videos showing what would happen if a squad made up of current and former presidents gathered on Discord to play games together.
Such scenes — which seem too good to be true because they are — are becoming more and more common. The widespread availability of generative AI tools that can deepfake audio of people based on a small sample of their voice has been utilized by a number of everyday users. The examples mentioned in this story are benign, but the tech has already been deployed by 4chan users for more insidious means, like making Emma Watson read aloud a section of Mein Kampf.
The Biden zoo video was created by Zach Silberberg, 28, a digital content producer from New York, using software developed by ElevenLabs. But it took him some time to come around to the idea. “The implications of it are pretty scary,” Silberberg said. He’s seen the headlines about deepfaked nudes and AI-generated artwork. “At first I didn’t want to touch this with a 100-foot pole. This is deeply, deeply evil stuff.”
That all changed when he saw a video by YouTube creator ZimoNitrome purporting to be Joe Rogan talking to Jordan Peterson about Lego Bionicles. He loved it.
Silberberg started sending videos like it to a colleague, who discovered the app, by a startup called ElevenLabs, being used to make them. Silberberg’s colleague bought the two of them a basic ElevenLabs account to use for a month at the low price of $1. “We just spent the day generating funny things that Joe Biden was saying, and sending them back and forth to each other,” Silberberg said. Within the day, they’d used up the month’s credits.
In the end, the pair decided to split the cost of moving up to the next subscription level, which put them back $10 each. The first thing Silberberg used it for was to make a video of Joe Biden addressing the nation while being trapped in the house from the 2022 indie horror film Skinamarink. (Silberberg used other software for the video that the audio is overlaid onto.)
Silberberg has since made dozens of videos that use ElevenLabs tech — a process that isn’t always easy, as the app is often, as he put it, “pretty janky.” Many of the clips involve Biden getting into improbable situations, while others poke fun at the vapidity of Joe Rogan and Ben Shapiro, creating conversations between them that reference the movies Ratatouille and Old.
The videos have gained notoriety. Shapiro himself even quote-tweeted one approvingly — a fact that Silberberg did not like. “It’s so absurd and so funny,” Silberberg said. “I’m making fun of you. I don’t want you to enjoy this.”
Silberberg thinks Shapiro likely didn’t watch the Ratatouille video — which ends with a dig at the controversial personality being a failed screenwriter — all the way through. “With Ben Shapiro and Joe Rogan, at least for me, I have no respect for these men,” Silberberg said. “So imitating their speech patterns and mimicking what is ridiculous about them and heightening that is what’s really funny.”
The same video about Lego Bionicles that Silberberg saw inspired Pade, a video game YouTuber and Twitch streamer, to test out the deepfake audio tool. (Paden declined to share his last name with BuzzFeed News.) “I figured it would be a matter of time before people start making memes of the presidents doing ‘let’s play’–style YouTube videos,” the 27-year-old wrote via Twitter DM between college classes in Oklahoma.
He registered for a trial and got to work. “All you need is a few good voice clips isolated of anyone speaking for a couple minutes,” he said, “and it generates a voice.”
Pade chose to create videos focusing on recent and current US presidents in part because he found it funny when he tried getting Barack Obama to speak about Halo 3. He was also being pragmatic. “The jokes are more widely accessible, as [are] their voice samples to make the voice,” he said.
It takes around two hours for Pade to make each of the videos, which range from 30 seconds to one minute in length. He records gameplay and then generates the voices using AI. “I actually start with the TikTok or vertical video then I make the YouTube video after, and maybe add a couple more jokes that go on that video, since YouTube likes longer content,” he said.
He has some theories as to why the videos work. “I think the absurdity of seeing famous figures in any random gaming session is genuinely hilarious,” he said. “Surprisingly I see a lot of love for the ones where figures that might not get along in real life — like Trump and Biden — are having wholesome moments together. I think part of it is a relief from seeing figures that are always embroiled in controversy in a different light, even if it’s fake.”
It’s not just the leaders of the free world and the leading light in podcasting who are being mimicked using the power of AI. Joe Marotta has been using the tech to give new life to his personal interest: professional wrestling. Marotta, a 37-year-old podcaster from New Jersey, came across a use of AI-generated audio in early February on Twitter. He thought the tech would be a fun way to promote his retro pop culture podcast, Acid Washed Memories. The idea was to get 1980s pro wrestling commentators Gorilla Monsoon and Bobby “The Brain” Heenan to hawk it.
“I signed up for [ElevenLabs] and put Gorilla and Bobby’s voices in there to do a promo for Acid Washed Memories, and was happy with how it came out,” he wrote via Twitter DM. The success of the podcast promo skit pushed Marotta to test the tech further. “I figured, ‘Okay, well, what if Gorilla Monsoon had a podcast? What would he say?’”
The resulting foul-mouthed AI parody, first posted on Twitter on Feb. 6, has since been viewed nearly 320,000 times. It worked because it pokes fun at the gap between Monsoon’s friendly, laid-back onscreen demeanor and his short-tempered tendencies when the camera is off, and because of a recent trend of faded names from the history of wrestling launching their own podcasts.
Marotta is now on Part 25 of the Monsoon podcast clip series, with each two-minute video taking around an hour to produce. “I try to take real-life situations and play upon them with fiction, or just make up stories that sound plausible,” he said. “I think the fact that so many wrestling fans know the character and mannerisms of Gorilla, Bobby, and Gene [Okerlund], for example, makes it easy to imagine it’s them really saying these things.”
But both Marotta and Silberberg draw the line at using the power of these fake AI audio tools for nefarious means. “While ElevenLabs has really driven an explosion of this kind of memeable content, it’s not particularly new,” said Henry Ajder, a UK-based deepfake expert. But what the software has done is put high-quality AI audio generation into the hands of ordinary users. “It’s led to this trend of certain people being targeted,” Ajder said.
Ajder believes that for as long as the conceits of the audio snippets remain obviously outlandish, and target well-known people, the slippery slope toward extreme disinformation content — for instance, the kind of deepfakes that could spark a war — can largely be avoided.
“This is targeting Joe Biden, one of the most famous people in the world,” he said. “This kind of content is quite clearly fake, based on the context and how well-known the individual is. What interests me is when we think about slightly less well-known politicians or private individuals.”
Silberberg said he hopes that these audio deepfakes stay “in the realm of harmless bullshit, and it doesn't deviate from that.” But he’s also a realist: “I know that's not going to happen and already isn't happening.”