You're Going To Talk To All Your Gadgets. And They're Going To Talk Back.
Amazon's Echo speaker made chatty, at-home bots mainstream — and now Apple, Google, Microsoft, and Samsung are all in on the battle for your voice.
It’s a familiar scene: a crowd of people poking slabs of illuminated glass, completely enraptured by their pocketable computers. They tap, tap, swipe while waiting for the bus, walking down the street, or slouched on a couch at a party made boring by all that inattention. Ever since the introduction of the iPhone a decade ago, touchscreens have kept our eyes cast downward instead of on the world around us.
Today, there are some 2 billion devices running Android, and another 700 million or so iPhones are in use based on analyst estimates. A generation of people, especially in markets like India and China, have come online with smartphones, bypassing mouse and keyboard–based desktop PCs altogether. Tap, tap, swipe is now more ubiquitous than type, type, click ever was.
That kind of growth has left device manufacturers anxious for another hit. But so far the touchscreen smartphone has proved too neat a trick to repeat. No matter what Next Big Thing comes its way — Google Glass, Apple Watch, Oculus Rift — people just seem to keep their heads down. Swipe, tap, poke, pinch.
But while we were paying attention to the things on our wrists and (ugh) faces, another major technological shift took hold. And this new interface, just now becoming mainstream, is the next era of computing, just as surely as the punch card, command line, mouse-and-keyboard graphical interface, and touch interface that came before it. It’s also the oldest interface in the world, our first method to communicate with each other, and even with other animals — one that predates letters or language itself.
It’s your voice. And it’s the biggest shift in human-computer interaction since the smartphone.
It was hard to see, initially, just how transformative the portable touchscreen computer was, precisely because it lacked a keyboard. But it was the ability to hold the world’s information in our hand, in a visually accessible way that we could manipulate with our fingers, that turned out to be so powerful. Pinch to zoom, pull to refresh, press and hold to record and share.
Voice-based computing will be everywhere, listening to what we say, learning from what we ask.
Right now, something similar is underway with voice. When voice-powered devices come up, the conversation often turns to what’s missing: the screen. Especially because for most of us, our first experience with a voice-based interface took place on a touchscreen phone. (Hey, Siri!)
But when you take away the keyboard, something very interesting happens, and computing becomes more personal than it ever has been. Soon, human-computer interaction will be defined by input methods that require little know-how: speaking, pointing, gesturing, turning your head to look at something, even the very expressions on your face. If computers can reliably translate these methods of person-to-person communications, they can understand not just what we say in a literal sense, but what we mean and, ultimately, what we are thinking.
In the not-too-distant future between now and Black Mirror, voice-based computing will be everywhere — in cars, furniture, immersion blenders, subway ticket counters — listening to what we say, learning from what we ask. Advanced supercomputers will hide under the guise of everyday objects. You’ll ask your router, “Hey Wi-Fi, what the hell is wrong with you?” Or your fridge, “What’s a recipe that uses all of the vegetables about to go bad?” Or just, to the room, aloud, “Do I need a jacket?”
Best of all, most people will be able to use this new species of gadgets, not just those with technological proficiency. Proxy devices, like keyboards and mice, require training and practice. But in this vision of the future, you’ll be able to use natural language — the kind of speech you’d use with a date, your kids, your colleagues — to access the same functions, the same information that typing and tapping can.
Make no mistake: The touchscreen isn’t going anywhere. But increasingly we’re going to live in a world that’s defined by cameras and screens, microphones and speakers — all powered by cloud services, that are with us everywhere we go, interpreting all our intents, be they spoken, gestured, or input via a touchscreen keypad.
Welcome to the age of ubiquitous computing, or the ability to access an omnipresent, highly knowledgeable, looking, listening, talking, joke-making computer at any time, from anywhere, in any format. In many ways, we’re already living with it, and it all starts with your voice.
The Rise of Alexa
Dotsy lives in Palm Beach, Florida, and she won’t tell me exactly how old she is. “You can google that!” she says, laughing. “It frightens me to say it out loud.”
The octogenarian is Really Cool. About a month ago, she picked up a new instrument — the drums — and recently “jammed” at a friend’s “crib” (her words). She’s an avid reader who tries to keep her schedule open and in flux. But there’s one thing she needs a little help with: seeing.
Dotsy owns not one but two of Amazon’s Echo devices: She keeps a smaller Dot in her bedroom (her daily alarm) and the larger flagship Echo speaker on the porch (where its AI personal assistant, Alexa, reads her Kindle books aloud). Alexa’s primary role in Dotsy’s life, though, is providing information that her eyes have trouble discerning. “I find it mostly helpful for telling the time! I don’t see very well, so it’s a nuisance for me to find out what time it is.”
Her vision isn’t good enough for her to use a smartphone (though she wishes she could) and she doesn’t use a computer. But Dotsy loves being able to ask Alexa questions, or have it read her books.
Overall, “I think it’s wonderful!” Dotsy proclaims. She sometimes even thanks Alexa after a particularly good response, which prompts the AI bot to say, “It was my pleasure” — a nice, human touch.
It turns out that the internet-connected, artificially intelligent Echo is more accessible and more powerful than a mobile device or laptop for someone who touchscreens have left behind.
Alexa and the Echo have lots of room for improvement. Dotsy still can’t change settings or add certain new “skills” on her own, for example, because some require the Alexa mobile or web app. ("Skills" is Amazon's term for capabilities developers can add to Alexa, like summoning an Uber or announcing what time the next bus will arrive.) But what Amazon showed with the Echo is where voice computing is best positioned to prevail: your private spaces.
Google and Apple had voice assistants long before Alexa came along, but those were tied to your phone. Not only does that mean you have to pull it out of your pocket to use it, but it’s also prone to running out of battery, or being left behind (even if behind is just the other room). The Echo, on the other hand, is plugged into a wall, always on, always at attention, always listening. It responds only to queries that begin with one of its so-called wake words such as “Alexa” or “Computer,” and is designed to perform simple tasks while you’re busy with your own. So when your hands are tied up with chopping vegetables, folding laundry, or getting dressed in the morning, you can play a podcast, set a timer, turn on the lights, even order a car.
“A lot of personal technology today involves friction,” said Toni Reid Thomelin, VP of Alexa experience and Echo devices. “We envision a future where that friction goes away, the technology disappears into the background, and customers can be more present in their daily lives.”
Amazon’s Alexa-powered speaker was a sleeper hit. The company, which rarely issues public numbers, won’t say officially how many Echo speakers are active or have been sold (just that it receives “several millions of queries every day” from “millions” of customers). A recent survey shows that sales have more than doubled since the product first launched in late 2014 and estimates that around 10.7 million customers own an Echo device, but that number, which doesn’t account for those with multiple devices, likely does not reflect how many Echoes have actually been sold.
“A lot of personal technology involves friction...We envision a future where that friction goes away, the technology disappears into the background, and customers can be more present in their daily lives.”
The Echo’s sales are still small compared with Siri and Google Assistant’s reach, but the device has garnered mainstream popularity (even the Kardashians have one) and legions of superfans in a way that other assistants simply haven’t. The flagship Echo has over 29,000 reviews on Amazon from “verified purchases” (people who actually bought their Echo through Amazon) and among those, nearly 24,000 are positive. The most telling numbers, though, are the ones on Reddit. The number of /r/amazonecho subreddit subscriptions (about 37,000) eclipses that of Google Assistant’s subreddit (280) and Siri’s (1,502). It’s also worth noting that there are even more (3,800) on /r/SiriFail.
The Echo is, for many of its highly satisfied users, the ideal at-home smart device. It’s so easy to use that toddlers who can’t read yet, and seniors who have never used a smartphone, can immediately pick up a conversation with Amazon’s AI-powered robot.
For years, the online bookstore turned e-commerce giant had been unintentionally working on the infrastructure for a voice-enabled AI bot. “We were using machine-learning algorithms internally at Amazon for a long period of time,” said Thomelin, who has been with the company for nearly two decades. “Mostly in the early days, for our Amazon.com recommendations engines. And seeing the success of our recommendations engines, we thought, How could you use those similar techniques in other areas throughout Amazon? That was a big piece of what helped bring the Echo to life.”
Leaps in cloud computing — or the ability to process data on a remote, internet-hosted server instead of a local computer — were also crucial to the Echo’s development. “About five years ago,” Thomelin said, “we saw internally how fast cloud computing was growing with AWS...so we wanted to capitalize on all that computing power being in our own backyard and bring it into a new device category like Echo.” (AWS, or Amazon Web Services, is a platform originally built to run Amazon’s own website, but now handles the traffic for some of the biggest internet companies in the world, including Netflix, Spotify, and Instagram.) When people talk about the cloud what they really mean is a bunch of Amazon server farms. It’s those machines that host Alexa’s knowledge and instantly sling its responses to millions of Echo devices simultaneously.
But the magic of the Echo isn’t that it’s particularly smart — it’s that it is an exceptionally good listener. The Echo can hear a command from across a room, even with a TV or side conversation running in the background. Alexa has a far better ear than anything that has come before, and stands out among all those other “Sorry, I didn’t catch that” assistants.
Rohit Prasad, VP of Alexa machine learning, said that a voice-first user interface was a far-fetched idea when the team began developing the technology. “Most people, including industry experts, were skeptical that state-of-the-art speech recognition could deliver high enough accuracy for conversational AI,” Prasad said. There are a lot of challenges when it comes to recognizing that “far-field” (or faraway) speech, and a particular one for the Echo is being able to hear the wake word, “Alexa,” while the device is playing music loudly. Advancements in highly technical areas — such as deep learning for modeling noisy speech and a uniquely designed seven-microphone array — made that far-field voice recognition possible.
And Amazon is now handing out that technology — that special sauce. For free! To anyone who wants to build it into a device of their own! One of Amazon’s priorities is getting its assistant in as many places as possible, and it is doing that by providing an API and a number of reference designs to developers, so that the Echo is just one of the many places Alexa can be found.
Amazon is moving fast, with thousands of people in the Alexa organization alone (up from one thousand at this time last year). The company is also investing huge sums of money in companies interested in building Alexa into their products, like the smart thermostat maker Ecobee, which got $35 million in a funding round led by Amazon. In April, Steve Rabuchin, VP of Alexa voice services, told me the team is focused on integrating the voice assistant with a breadth of devices, including wearables, automobiles, and appliances, in addition to smart home products. Amazon wants to make sure that users can ask, demand, and (most importantly) buy things from Alexa from anywhere, at any time.
This massive, almost desperate effort isn’t surprising. Amazon at last made the AI assistant people love to talk to. But it had a late start compared to Google, Apple, and even Microsoft. And what’s more, a big hurdle still stands in Alexa’s way to becoming the go-to assistant we access from anywhere, anytime. Because there’s already a device we carry with us everywhere, all the time: the smartphone.
At the end of 2016, Apple beat out Samsung to become the number one smartphone maker in the world, with 78.3 million iPhones sold that holiday quarter alone, compared with 77.5 million Samsung handsets in the same period. It’s a massive device advantage over not just Samsung but everyone. And what’s more, every one of those devices shipped with Siri.
Apple acquired Siri, a voice-command app company, in 2010 and introduced an assistant with the same name built into the iPhone 4S in 2011. When Siri hit the market, it instantly became the first widely used voice assistant. Google Now, the predecessor to Google Assistant, wouldn’t ship for another year, and Alexa and Microsoft Cortana for another four.
The only problem? It sucked.
The voice assistant’s high error rate at launch has plagued Siri’s perception to this day, even though its recognition capability has improved significantly (by a factor of two, thanks to deep learning and hardware optimization).
After the fifth anniversary of Siri’s launch last October, app developer Julian Lepinski nailed why users can’t get into the assistant: because they just can’t trust it. “Apple doesn’t seem to be factoring in the cost of a failed query, which erodes a user’s confidence in the system (and makes them less likely to ask another complex question),” Lepinski writes. Instead of asking clarifying questions or requesting more context, “Apple has a bias towards failing silently when errors occur, which can be effective when the error rate is low.”
Siri is by far the most widely deployed assistant, with a global reach that spans 21 different languages in 34 countries. Google Assistant supports seven languages, while Alexa supports just two. Still, Siri usage isn’t what it could be. Apple says that it receives 2 billion non-accidental requests a week, which means that — if the 700 million active iPhones estimate is correct — that’s only 2–3 queries per phone, every seven days.
Meanwhile, it’s under assault on its own devices. There are now multiple options for assistants on the iPhone, all vying to be the AI of choice for iOS users: Amazon baked Alexa into its main shopping app for iPhone and Google launched an iOS version of Assistant this year.
Of course, neither are as as deeply integrated or accessible on the iPhone as Siri. So Google and Amazon are racing to prove that their assistants are worth the extra taps. They’re also trying to do end runs around the phone altogether, by releasing tools that let developers build their assistants into all the devices we surround ourselves with. It starts with speakers, but cars and thermostats and all manner of other things are on deck.
Apple, meanwhile, has been trying to change Siri’s image — so that no matter what’s around you, you’ll say “Siri” and not “Alexa” or “Okay, Google.” In an August 2016 interview with the Washington Post, CEO Tim Cook was asked whether Apple can catch up with Facebook, Google, and Amazon’s AI capabilities, to which Cook responded, “Let me take exception to your question. Your question seems to imply that we’re behind.”
Siri is under assault on Apple's own devices.
The CEO went on to tout the fact that Siri is with you all of the time (a dig at the Echo, perhaps?), and its prioritization of privacy, given Siri’s ability to perform tasks on the phone itself instead passing them to the cloud (another dig, this time at Google’s data collection practices). A few months later, in an earnings call in January 2017, Cook highlighted that, thanks to Apple’s HomeKit platform, Siri is already the smart home hub that Echo wants to be. Cook noted that he says “good morning” to Siri to turn the house lights on and start brewing coffee, and, in the evening, summons Siri to turn on the fireplace and adjust the lighting.
But Siri is in some ways playing catchup with Alexa, especially with third-party integrations. In 2015, Amazon allowed developers to create custom voice capabilities for Alexa and now has over 12,000 “skills” available, from being able to request an Uber to listening to daily Oprah affirmations. Apple is now slowly, cautiously opening its assistant to developers, in an attempt to change people’s minds about how useful Siri can be. A year after Amazon unveiled the Alexa Skills Kit and a half a decade after shipping Siri, Apple finally began allowing third parties to create non-Apple app voice integrations. If you were one of WhatsApp’s 1 billion users, for example, you couldn’t use Siri to dictate a message until September 2016.
Siri has also grown to Apple TV, the Mac desktop operating system, CarPlay, and the AirPods, Apple’s new wireless earbuds, which is about the closest thing to the voice-based future in Spike Jonze’s movie Her that currently exists.
Apple is even reportedly planning to expand Siri’s reach with the announcement of its own internet-connected smart speaker soon, maybe even at the WWDC developers conference keynote on Monday. It's possible that, like with the iPhone, Apple won’t need to totally re-invent the wheel to create the Most Revolutionary, Most Beautiful, Most Popular Wheel ever made. (Watch your back, Alexa.)
In a conference room at Google’s corporate headquarters in Mountain View, California, Gummi Hafsteinsson, the product director of Assistant and former VP of product at Siri (the same company acquired by Apple), was waving his hands around to make a point.
“Conversation isn’t just voice. If I wanted to, I could use my hand now,” he said, gesturing in the air, “or point at something. It’s a way to have a back-and-forth exchange to have common understanding of an intent. And that’s what we’re trying to build,” he explained.
Google had just made a whirlwind of announcements onstage at the Google I/O developers conference, including two new updates for Assistant for mobile. The first update is the ability to type a query when you can’t talk out loud (the “gesturing” Hafsteinsson was talking about), giving users more flexibility in the ways they can interact with Assistant. The second is Google Lens, or the ability to use your phone’s camera to help Assistant “see” — in other words, Hafsteinsson’s “pointing” — and automatically join a Wi-Fi network by scanning the credentials on a router, or get tickets for an upcoming show featuring whatever band is on the marquee you just pointed your phone at.
The team is taking an approach that’s markedly different from Apple’s and Amazon’s. It’s focused on keeping the conversation going, Hafsteinsson told me, to keep people engaging across devices, all day long. Google doesn’t just want Assistant, which it claims is “available on more than 100 million devices,” to work in private spaces like your kitchen or your car, it wants the AI to adapt to whatever situation you’re in.
“The key is that, even if it’s really good in your home or in your car, it still has to work in all of the other places,” he said. “Once you leave the car, the benefit [of the assistant] tremendously goes down, it just drops off. It’s a really powerful thing to have the same thing follow you around all day and be with you.” And so Assistant, which is built into Android phones, wearables, Google Home, and, soon, Android TV, is designed in such a way that you can type “Remind me to take out the trash when I get home” in a meeting, and then your Google Home speaker will do just that.
Google has been working on an intelligent personal assistant (originally called Google Now) since 2012 and processing searches by voice — which Hafsteinsson launched — since 2008. Last fall, the company introduced its own smart speaker, Google Home. The small, portly device works similarly to the Echo, responding to commands that start with “Okay, Google” or “Hey Google” and powered by an AI bot that lives in a server farm.
This means Amazon and Google are now playing a cat-and-mouse, Instagram-and-Snapchat-style game for control of your living room. In February, Time reported that Amazon was developing a voice recognition feature called “Voice ID” that lets Alexa recognize different individuals. Then, two months later, Google announced Home would begin supporting voice recognition. Earlier this month, Amazon launched Alexa voice and video calling. A week later, Google announced that Home would support hands-free calling, too.
The main difference between Home and the Echo? Google’s Home is much smarter. The wealth of data and machine learning behind Assistant makes it the most powerful voice AI available, with a word accuracy rate of 95%. Artificial intelligence is what makes a bot seem natural, personable — and since 2001, Google has published over 750 papers on this exact topic. The tech behind Assistant is the same one that turns Hindi into German in Google Translate, that recognizes your family’s faces in Google Photos, that helps Android phones save battery and data. And now it’s in your living room, listening to you.
“Google has a unique blend of expertise in natural language understanding, deep learning, computer vision, and understanding user context, so we think we can make good headway. We can understand intent behind words to handle follow-up questions and complex, multi-step tasks,” said Hafsteinsson, touching on the crucial Siri flaw that Lepinski identified.
Unlike other assistants, Google has the unique ability to understand pronouns in the context of a line of questioning. So, if you ask, “Who is the president of India?” and then “When is his birthday?,” Google Assistant will know exactly who “his” is referring to. Alexa can’t do that yet.
The main difference between Home and the Echo? Google’s Home is much smarter.
The prowess of Google’s machine learning and neural-net capabilities extends beyond just trivia. Assistant can tell you how to say “thank you” in Vietnamese or “where’s the train station?” in Italian. Using Lens, you can even translate signs with Japanese or Chinese characters. Neither Alexa nor Siri are capable of translation.
Home can also analyze and identify individual voices, and use that identification to deliver a more personalized experience. A person's voice acts as sort of a password for their accounts. So, for example, when you ask Home to play your Discover Weekly Spotify playlist, it’ll play yours, and when your roommate asks, it’ll play hers.
Home can continue and start conversations, too. Later this year, you’ll be able to ask Assistant to give you visual responses, like showing your calendar or playing a YouTube video, on a TV connected to the Chromecast, Google’s media streamer. And speaking of Chromecast devices, you can group together multiple Homes and Chromecast Audio-connected speakers to play music or podcasts all over the house. Alexa can’t do this yet, though Sonos integration is apparently coming soon. Siri can’t either.
Google has the AI smarts, plus the services (email, calendar, YouTube, and maps) that people actually use. The company also has shopping, music, and media streaming services that are less popular, but may find an audience in those who can’t be bothered to sign up for Yet Another Thing.
The key to winning the battle for your voice, however, is thoughtful implementation, and there Home has its work cut out for it. For example, Google Home is limited to the main calendar associated with your Google profile. So if you have a shared Google calendar with a spouse, or if your employer uses Google Apps for Work, you won’t be able to add those calendars to Google Home. And if you live in a household with an Android phone then you may be in for an “Okay, Google” nightmare. Your Home and your phone won’t be able to tell which one you’re talking to, even though Google promised a solution to that at launch.
On top of all that, even though both devices are powered by Assistant, there isn’t feature parity between the two. Home can’t create calendar appointments or reminders, but Assistant on your phone can. Which means that while it’s the same AI following you around, it doesn’t seem like the same experience; it’s like dealing with multiple people.
And then there are more subtle things at play too. Saying “Okay, Google” all the time just kind of feels overly commercial — and it’s possible that Google Home will lose out on a piece of the smart speaker pie because it forces you to say “Google” so much. Alexa and Siri roll off the tongue much more easily, and don’t feel icky in the way that having to start every sentence with a ~brand~ does.
When we talk about things like Amazon Echo or Google Home or Apple AirPods or bots from Samsung, Microsoft, IBM, and Baidu, the conversation often turns to what powers them — artificial intelligence, microphone arrays, cloud services. That’s understandable, because these are the technologies that make voice computing possible. But these things are also just new versions of speeds and feeds; it’s the effect of those things in combination that matters, far more than whatever assistant can give the most accurate answer to any given question. And it is because all of these developments came together that we can, finally, have the Star Trek computer right in our own homes.
These assistants will only become much more competent, intelligent, and in tune with our needs. They’ll become more human — and that means that they’ll not only learn how to speak our language, they’ll eventually learn how to replace us. Today, voice personalities can make controlling your lights and reordering shampoo feel effortless, which is useful and great! But it’s possible that these increasingly human interfaces will one day become smart enough to be baristas, accountants, journalists, and insurance agents too.
These assistants will only become much more competent, intelligent, and in tune with our needs. They’ll become more human.
Meanwhile, precisely because these things are so easy to talk to, our children are growing up with them. In homes all over the US, and increasingly throughout the world, children are talking to Alexa and Siri and Google from the time they can speak. And those AIs are in turn learning from them, adapting to them, both in aggregate and personally. If that makes you feel uneasy...you’re not alone.
Amazon CEO Jeff Bezos, referring to the potential of voice-based artificial intelligence at last year’s Code Conference, said, “It’s hard to overstate how big of an impact it’s going to have on society over the next 20 years.”
That impact could mean no more car accidents due to touchscreen fiddling. It could mean never leaving a debate unsettled at the dinner table again. It could mean typing becoming obsolete in favor of voice dictation. Or it could mean taking jobs from humans and giving them to robots. A lot of what’s to come will be unexpected applications we can’t see yet.
Bezos also acknowledged that voice won’t replace touchscreens, it will merely take time away from them. “You know, people have eyes and as long as people have eyes they are going to want screens, and we have fingers and we want to touch things and so on. But it has been a dream ever since people started in the early days of sci-fi that you can have a computer that you can actually talk to in natural ways and ask it to do things for you. And that is coming true.”
It starts with speech, but it won’t end there. Rather than using a mouse to meticulously design a new 3D-printed thingamabob with AutoCAD, a second-grader will put on a VR headset and shape it with her bare hands. Instead of typing a word cloud of adjectives into a search bar, you’ll be able to point a camera at an animal and prove to your dad that that thing is the kind of catfish that kills and eats freaking pigeons.
The Next Big Shift in human-computer interaction is already here. You can’t see it, but you can certainly hear its effects. Voice is up first — and the rest is coming. We’ll call it the natural interface revolution. Our kids will just call it Alexa. ●