Earlier this month, Miles Zimmerman, a 31-year-old programmer from San Francisco, was messing around with Midjourney, the AI-powered tool that generates images with a simple text prompt, and having his mind blown.
One of his prompts, which he created with the help of ChatGPT, was extremely detailed: “A candid photo of some happy 20-something year-olds in 2018 dressed up for a night out, enjoying themselves mid-dance at a house party in some apartment in the city, photographed by Nan Goldin, taken with a Fujifilm Instax Mini 9, flash, candid, natural, spontaneous, youthful, lively, carefree, — ar 3:2.”
Within seconds, Midjourney spat out image after made-up image of attractive young people letting their hair down at a party.
At first, Zimmerman was astonished at the level of detail. Faces, skin, hair, and clothes looked photorealistic — although slightly plastic, as later pointed out by some observers — and the expressions were exactly what he had asked for. But the closer he looked, the weirder the pictures seemed. A smiling woman posing for a picture with a friend and holding a point-and-shoot camera had a bunch of extra fingers on her left hand. There were a total of nine, to be exact. Another one had the correct number of digits, except that they were freakishly long. Nearly everyone had too many teeth.
He posted the pictures to Twitter, where they quickly went viral.
“As I kept looking, it was hard not to laugh out loud at the absurdity of those hands and teeth,” Zimmerman told BuzzFeed News over Twitter DMs. “It didn’t cause a visceral reaction in me the way I think it did for a lot of others reacting in the tweet thread. To me, it was so in character of the AI to create these near flawless renders with such silly flaws that I found it funny.”
Over the last few months, services like Midjourney, Stable Diffusion, and DALL-E 2 have exploded in popularity. Using simple text prompts, these apps, powered by a radically new type of artificial intelligence known as generative AI, let anyone create nearly any kind of image they want, sparking excitement and backlash in equal measure.
The programs work because they are “trained” to recognize the relationships between billions of images scraped from across the internet and the text descriptions that accompany them, until eventually, the program “understands” that the word “dog,” for instance, relates to the picture of a canine. These images and their descriptions are known as “datasets.”
Art created using AI trained on such datasets is now winning competitions and being used by creators to illustrate articles and newsletters, among other things.
But despite rapid advances, AI-powered image generators still suck royally at one thing in particular: generating realistic-looking human hands.
Here’s what Stable Diffusion, DALL-E 2, and Midjourney, the world’s leading AI-powered image generators, churned out when I fed them all a simple prompt: human hands.
These sorts of outputs have inspired memes such as this one:
But why do these programs mess up hands (not to mention bare feet) so badly? It’s a question that many people have asked.
To find out, I emailed Midjourney; Stability AI, which makes Stable Diffusion; and OpenAI, which created DALL-E 2. Only Stability AI responded to my questions.
“It’s generally understood that within AI datasets, human images display hands less visibly than they do faces,” a Stability AI spokesperson told BuzzFeed News. “Hands also tend to be much smaller in the source images, as they are relatively rarely visible in large form.”
To understand more, I got in touch with Amelia Winger-Bearskin, an artist and an associate professor of AI and the arts at the University of Florida, who has been analyzing the aesthetics of AI art on her blog. “I am obsessed with this question!” Winger-Bearskin exclaimed on our video call.
Generative artificial intelligence that’s trained on billions of images scraped from the internet, Winger-Bearskin explained, does not really understand what a “hand” is, at least not in the way it connects anatomically to a human body.
“It’s just looking at how hands are represented” in the images that it has been trained on, she said. “Hands, in images, are quite nuanced,” she adds. “They’re usually holding on to something. Or sometimes, they’re holding on to another person.”
In the photographs, paintings, and screenshots that AI learns from, hands may be holding onto drapery or clutching a microphone. They may be waving or facing the camera in a way where just a few fingers are visible. Or they may be balled up into fists where no fingers are visible.
“In images, hands are rarely like this,” Winger-Bearskin said, holding up her hands with fingers spread apart. “If they were like this in all images, the AI would be able to reproduce them perfectly.” AI, she said, needs to understand what it is to have a human body, how exactly hands are connected to it, and what their constraints are.
Hands have a fundamental place within the art world — imprints of hands on cave walls are the very first kind of art that Homo sapiens created that we know of — and are considered to be some of the most difficult objects to draw or paint. In paintings from Ancient Greece and medieval Europe, representations of human hands were still flat and lacked intricacies.
It was only in the era of Renaissance art, between the 14th and the 16th centuries in Europe — when artists like Leonardo da Vinci started studying and sketching hands, including their structural elements like bones and ligaments — that human hands began to be represented in all their complexity. (This era also gave us one of the most recognizable frescos involving two hands — Michelangelo’s The Creation of Adam, which depicts God as a bearded man stretching out his right arm to touch Adam’s outstretched left.)
“Da Vinci was actually quite obsessed with hands and did many, many studies of hands,” Winger-Bearskin said. Meanwhile, when AI is trained on an image, “it’s just looking at that and saying, ‘Oh, in this case, there’s only half of a thumb,’ because the rest of it is hidden under fabric or grabbing on to something, and so when it reproduces it, it’s somewhat deformed.”
One day though, generative AI will get significantly better at rendering pictures of hands and feet and teeth. “It has to,” Winger-Bearskin said. “For AI to become a useful tool for humanity, it has to understand what it is to be human, and the anatomical reality of being human.”