In 1964, director Stanley Kubrick, fresh from the success of Dr. Strangelove and not yet quite 36, set out to make what he hoped would be “the proverbial ‘really good’ science fiction movie.” What has endured in the annals of popular culture since then — even for those who aren’t familiar with his resulting 1968 masterpiece, 2001: A Space Odyssey — is the talking computer named HAL, which quietly murders most of the human crew under its care. When the astronaut-protagonist Dave Bowman, played by Keir Dullea, asks HAL to open the pod bay doors to let him back into the spacecraft, it refuses, instead intoning the now-infamous phrase, “I’m sorry, Dave, I’m afraid I can’t do that.”
In 2018, HAL has real-life analogs that don’t defy or destroy its human interlocutors. They are the voice-controlled, AI-powered digital assistants from tech heavyweights Google, Amazon, and Apple, inspired by speculative fiction and made feasible by the technological shifts that have occurred in the five decades since 2001: A Space Odyssey premiered. These shifts include the invention of the internet, dramatic gains in computational power at diminishing cost, and the relatively recent renaissance of artificial intelligence research following periods of “AI winter” in the 1970s and 80s when the field fell into disrepute and defunding. Now, the computer systems that power voice assistants — systems distributed across server farms around the world and only partly residing on our personal devices — are capable of learning through algorithms designed to mimic human neural networks, accurately recognizing what we say and generating natural-sounding speech in response. Just today, Google previewed an experimental version of the Google Assistant with a feature called Google Duplex, which allows the assistant to call a specific business like a restaurant or a hair salon and make an appointment, talking naturally — with hesitation and vocal fillers to boot.
And so it seems a little like kismet that in these early months of 2018 — months leading up to this spring’s 50th-anniversary theatrical rerelease of 2001: A Space Odyssey — Google, Amazon, and Apple have each unleashed big-budget ad campaigns for their respective voice assistants. The object of these commercials is presumably to convince non-users, amounting to slightly over half of Americans, of the myriad ways in which the peppy, by-default female-voiced Google Assistant, Alexa, and Siri might each make modern living a little less anarchic, and a little less stressful.
I’m admittedly part of this demographic of non-users — too self-conscious to issue commands at my phone in public, too cheap to splurge on a device that will listen attentively to my requests in the physical privacy of my living room. And yet, I don’t exactly fit the typical profile of a technological laggard: I worked for Google for many years on its web browser and cloud-based operating system; I’m the kind of house guest who peppers my friends’ Amazon Echos, Google Homes, and Apple HomePods with syntactically crafty questions. My personal passions and ambivalences about voice AI are probably not widely shared by the demographic that these ads are likely targeted toward: the 61% of holdouts who told Pew Research Center last May that they don’t use a voice assistant because they’re simply not interested.
For the tech companies invested in the long game of getting all of us to embrace conversational AI, consumer apathy may only be a near-term challenge. In the long run, the techno-utopian vision that these companies have crafted for voice AI — one that revolves around machines that will help us become our better selves — must counteract a legacy of pop culture narratives populated by evil, or chillingly value-neutral, computers, in addition to broader societal misgivings about AI. (Already scientists, entrepreneurs, and think tanks have begun to worry about the use of AI in autonomous weaponry and its Strangelovian potential for precipitating nuclear war.) While HAL was just one of the many plot elements previewed in the original 1968 trailer for 2001: A Space Odyssey, in 2018 it is, tellingly, the thrust of the newly recut trailer for the rerelease. To begin rewriting the dystopian script of voice AI’s emergence in ordinary consumers’ everyday lives, its products and advertising oeuvre must negotiate the slippery terrain between the performance of human-like conversation and human-like agency, and make a case for voice AI amid growing unease about how it may forever alter our relationships with other people.
The idea of the conversational computer has made a home in popular imagination for so long that it does not occur to us how peculiar it is to talk to machines in the ways that we now do. Once, on a Lyft ride from the airport through rush-hour traffic, my driver — a friendly thirtysomething who flipped the driver-side sun visor down to reveal a photo of his 9-month-old twins — argued with the female navigation voice on Google Maps for the length of time it took to traverse several intersections. When the voice reminded him to turn right at the next red light as he eased to a stop behind several cars, he retorted, “Yes, lady, I know.”
Small children are particularly useful for bringing the strangeness of this dynamic into sharp relief. Friends have told me numerous anecdotes about their very young daughters, sons, nieces, or nephews who carry on earnest conversations with Alexa or engage Siri in Beckettian dialogue. (In fact, the Google Assistant is getting a Pretty Please feature in hopes of teaching kids to be polite, while Amazon’s Echo Dot Kids Edition rewards kids for saying “please” with an easter egg built into the product.) In the telling of these stories, my friends are more bewildered by the technological turn of events in their households than the children themselves, who are usually unfazed. As adults, we tend to attribute these alien yet naturalistic interactions to the naivete of children, rather than to the inherently bizarre nature of relations between computers and people.
In their seminal 1994 paper, Computers Are Social Actors, Stanford researchers Clifford Nass, Jonathan Steuer, and Ellen R. Tauber capture the contradiction that underlies our dynamic with conversational computers: It is a paradox, in which we as human users “exhibit behaviors and make attributions toward computers that are nonsensical when applied to computers but appropriate when directed at other humans.” But we, as human users, are generally aware of the fact that computers are not sentient beings. What causes us to express social behaviors toward machines, according to robotics researcher Leila Takayama, is not the “absolute status of an entity’s agency,” but our perceptions of its agency. “As objects become increasingly endowed with computational ‘smarts,’” Takayama writes, “they become increasingly perceived in-the-moment as agents in their own right.”
Whether or not computers can have agency is arguably the root of our dystopian fears about voice AI. Kubrick’s HAL is terrifying not only because it personifies our fear of surveillance (HAL spies on its two astronaut minders by reading their lips) and our fear of clinical, non-negotiable logic that is incompatible with our humanity; HAL is terrifying because of the way that Kubrick and his collaborator, the science fiction writer Arthur C. Clarke, wrote its capabilities and its conversations with the astronauts Dave and Frank — we perceive HAL to have personal agency, and a sense of self-preservation so fiercely human that it uses its omniscience and logic to commit cold-blooded murder.
Unlike HAL and its computational smarts, voice assistants today take on relatively rudimentary and innocuous tasks — turning off the bedroom lights, reading our email aloud, or helping to assemble our favorite playlist. Not only do the recent celebrity-graced commercials for the Google Assistant, Alexa, and Apple’s Siri-powered HomePod emphasize the simultaneous banality and convenience of these interactions, they also communicate each company’s worldview about how human users ought to relate to intelligent machines.
In Google’s TV commercial that aired during the Oscars broadcast in March, John Legend — who will be one of the six new voices that users can choose for the Google Assistant — makes a long mental to-do list; a woman sitting alone behind the wheel in rush-hour gridlock hears a cascade of inbox notifications and considers all the emails she can’t respond to; Kevin Durant contemplates ordering the athletic tape that the out-of-frame medic is wrapping around his foot; Sia, in a comically large, face-obscuring wig, fumbles for her phone so that she can record a melodic strain that popped into her brain while getting her makeup done backstage. These scenarios and several others — each played out in interior monologue and capped off with the refrain “Make Google do it,” superimposed on the scene in frame-filling text — are presented as situations in which the Google Assistant could be summoned: to make lists, read emails aloud when you have your hands full, remind you to turn off the kitchen stove so that your house doesn’t burn down.
At its most fully realized, a virtual assistant is not the occasional scribe or butler who is called upon to help clear the clutter that crowds our modern multitasking brains; it is an omnipresent augmentation of our limited selves. Google’s commercial operates as a 90-second instruction manual wrapped in high-production sheen, designed to address what a tech company might perceive to be the weakest link in our inexorable march to a cyborg future: us — which is to say if only our 21st-century selves would learn to outsource our most pressing tasks to a superintelligence.
In Amazon’s recent Super Bowl commercial, it’s not AI’s human users who will derail us from the promise of a technological future; rather, it’s the human surrogates who are asked to do a job that a voice assistant would otherwise perform. The commercial opens with the clever conceit of Alexa losing her voice, as a pretext for ushering in celebrity substitutes. Her stand-ins — Gordon Ramsay, Rebel Wilson, Cardi B, and Anthony Hopkins — manage to insult, embarrass, unsettle, and ultimately fail their users. When Alexa’s familiar voice returns during the commercial’s final beat to save the day, we are reminded that unlike her chaos-creating human replacements, Alexa is helpful and reliable; she’s also programmed with carefully calibrated elements of personality. The commercial taps into a familiar trope, in which a competent computer intelligence picks up after human fallibility. But could a voice AI ever be as charmingly mocking as Cardi B? Could its underlying mathematical models possibly detect the appropriate moment and subtext in a conversation in which to invoke Hannibal Lecter, and with just the right alchemy of creepy and funny?
Perhaps what’s most refreshing about Spike Jonze’s commercial for Apple’s Siri-powered HomePod is the absence of any comparative dynamic, overt or implied; the world that we’re invited to inhabit is one in which human beings and their voice assistants already live in consonance with each other. In fact, for most of the ad, which was released this past March, the device and its technology aren’t explicitly foregrounded. Our attention is focused on the talented multihyphenate FKA Twigs, who plays a young urbanite in an unnamed metropolis, commuting home through the pouring rain, exhausted from a long day at work. As she settles into her small city apartment, she speaks to Siri with the natural understatedness of someone asking a friend for a glass of water on a hot day: “Hey Siri, play me something I’d like.” The modal verb Twigs uses is a subtle but critical detail; she’s not asking her voice assistant to play a song that she likes — no, not just a song randomly chosen from her previously saved favorites — but a particular song that her voice assistant algorithmically predicts she would like at that particular moment, given the day she just had, given how she feels. Siri computes, says okay, and starts playing a new Anderson Paak track. This brief exchange between human and machine — implying a kind of quiet understanding, a mind meld between old friends — kicks off the ad’s subsequent 3 minutes and 10 seconds: a Jonzian dance-journey into magical realism, in which Twigs manipulates and mutates her surrounding environment into something expansive enough to hold her joy.
As with many of Apple’s most iconic ads, this one doesn’t just sell a device; it sells the possibility of self-transcendence. And here’s the genius of Apple’s simplifying conceit: In a hero’s journey to self-transcendence, the individual in all her uniqueness is its star, technology is ostensibly secondary — and so are other people. In the commercial, other people’s faces are perpetually obscured by umbrellas, blurred in the background, or turned away from the camera such that you see them only in the slightest intimation of a side profile or the backs of their heads. When Twigs finds herself confronted with her reflection in the mirror while in the midst of her technological trance, she literally steps through the looking glass and dances with a parallel-universe version of herself. The suggestion here is almost a kind of Emersonian self-reliance, a closing off from the world. There is no society, and thus no larger societal consequences, to consider.
In their commercials, Google, Amazon, and Apple’s marketers have chosen to sell the public on the usefulness of voice assistants not by enacting the verbal exchanges that we might have with our devices, but by drawing our attention either to the absence of them — what life is like without the Google Assistant or the “real” Alexa — or to the ecliptic power of another mode of expression — music, as curated by Siri. We are spared the interactions between humans and computers that might inspire us to attribute agency to these systems and thus imagine a HAL-like hell. As future generations of voice AIs gain new and vastly improved capabilities, their creators and marketers may find themselves reckoning with how much agency a machine is allowed to externally perform. I’m reminded of a question that the industrial designer Dieter Rams posed in 1980, as part of his list of 15 questions that a designer should ask when designing a product: “Is it so accomplished and perfect that it perhaps incapacitates or humiliates you?”
We’re not staring down the pit of humiliation — yet. For now, we’re rapidly wending our way out of the uncanny valley — a conceptual space described in 1970 by Japanese roboticist Masahiro Mori to refer to the shift in a person’s response away from empathy or affinity to a feeling of unease as a robot approaches, but doesn’t entirely achieve, a lifelike human appearance. For voice AI, the equivalent of lifelike human appearance is natural-sounding machine-generated speech.
In Star Trek: The Original Series, set in the 23rd century but aired from 1966 to 1969, its onboard computer has a stilted, monotone voice. A century later in Trek time — or two decades later in real life — the actor and producer Majel Barrett-Roddenberry, who voiced the onboard computer in the original series (as well as in many subsequent series and movies in the franchise) gave the computer in Star Trek: The Next Generation a more natural flow. In real life, the technology columnist Christopher Mims lamented not long ago in 2010 that synthesized speech still “sounds so awful,” even as engineers had made it possible for devices like the Kindle and iPad to read e-books aloud.
“To deliver truly human-like voice,” Google researcher Yuxuan Wang and software engineer RJ Skerry-Ryan explained in a blog post this March, a computer “must learn to model prosody, the collection of expressive factors of speech, such as intonation, stress, and rhythm.” Their teams developed a system that models and transfers stress, intonation, and timing from a reference audio file to different machine-synthesized voices. Building on this approach, they also developed a method to synthesize any sentence — for example, “United Airlines 563 from Los Angeles to New Orleans has landed” — in expressive, machine-generated voices that range in emotion from anger to melancholy. What these systems can’t do yet is automatically decide on the appropriate speaking style, intonation, and flow based on the meaning and context of a given sentence.
“What’s the point?” asked a commenter on Wang and Skerry-Ryan’s post. “Will we have sex with computers, too? Or have them write our books for us? Or sit in a room and talk with us for hours about life and love?” The recent voice assistant ads don’t speculate on a potential endgame for voice AI. When Google previewed Google Duplex this afternoon, commentators and users were simultaneously awed and incredulous, some deeming its capabilities “eerily lifelike” or “terrifying” and anticipating unintended dystopian consequences, others calling it “mind-blowing.” And earlier this year, a few days before the commercial for Alexa aired during the Super Bowl, Amazon announced the eight university teams that will be competing for this year’s Alexa Prize, which awards $1 million to the team that successfully builds a socialbot that can converse cogently with human evaluators on topics ranging from current events to popular culture — for at least 20 minutes. Since the challenge began last year, the million-dollar bounty hasn’t been claimed yet. But if 20 minutes of unconstrained small talk is the proverbial four-minute mile of conversational AI, it may take a little while before our voice assistants can achieve the mileage needed to talk to us meaningfully, for hours on end, about life and love.
For some, the potential for digital companionship quickly devolves into anxiety over our reliance on technology and, more pointedly, the role that this dependence might play on the dissolution of intimate relations with family and community. “This video is tragically prophetic,” says a YouTube user in response to Kate Bush’s 2011 music video for “Deeper Understanding,” in which a man, played by the actor Robbie Coltrane, is bewitched by a digital entity named “Voice Console,” visualized as a gigantic pair of lips on a computer screen, to the point of neglecting his family and friends (and later, committing murder). Another YouTube user declares that the video “makes me want to throw my phone in a river. It's like KB has captured everything that makes me worry about the future, the way culture is progressing, and provides some comfort in letting us know that yes, it is a legitimate concern and not just some ‘first-world problem.’” Bush first wrote and released the song in 1989 and resurrected a modified version 22 years later as the lead single to her album Director’s Cut. She sings, “As the people here grow colder / I turn to my computer / And spend my evenings with it / Like a friend. / I was loading a new program / I had ordered from a magazine. / Are you lonely, are you lost? / This voice console is a must.” When Bush released the music video, Siri — the first of the major voice assistants — was just a few months away from being integrated into the iPhone.
The makers and marketers of voice assistants do not want us to worry about the future. Whether or not our turning to technology is a source, or symptom, of the alienation we might feel from ourselves and our communities, whether or not children growing up with voice assistants tailored for kids will have a wholly different relationship with voice AI than we now do, they’d like us to embrace the uncomplicated upside of participating in our technological lives. For now, their commercials will be firmly rooted in the present — in the benign utility of asking the Google Assistant to take a selfie for us, Alexa to narrate the recipe for a grilled cheese sandwich, and Siri to play us something we’d like.●