Living with (and building for) the Amazon Echo

We’ve had our Amazon Echo for just 15 days but it feels like we’ve jumped 15 years into the future.

Feb 07, 2016

We’ve had our Amazon Echo for just 15 days but it feels like we’ve jumped 15 years into the future.

Voice interfaces have been steadily improving for years — Siri being the mainstream’s first taste — but having an always-on, ambient, personal assistant around the house totally transforms their usefulness.

For those who’re yet to experience Alexa, she’s the entity that emanates from the Amazon Echo, a wine-bottle-sized cylinder of internet-connected speakers and microphones sent directly from the future. It’s a much more dramatic product than I’d expected.

The Echo is always listening, but it only springs into life when you ask Alexa for something:

"Alexa, will it rain today?"
"Alexa, remind me when it's 2:30."
"Alexa, what’s the meaning of life?"

42. Obviously.

In this short time, my wife and I have come to rely on the Echo more and more:

Its replaced AirPlay for music: The convenience of being able to have the Miles Davis Pandora station waft into your living room after a short incantation is some seductive magic. As is being able to say “Alexa, turn it up!”, “Alexa, skip this” or “Alexa, thumbs up”. Even if you get your music from someplace other than Spotify, Pandora, iHeart, TuneIn Radio or, er, Amazon Prime Music, the Echo still has a trick up its sleeve. I can just say “Alexa, connect to my phone” and I’m instantly streaming from any app on my Phone. The Echo is — on top of everything else — one of the best Bluetooth speakers you can buy.
Hands-free Timers: I’m a messy cook; the kitchen is not a safe place for our smartphones. Being able to say “Alexa, set a timer for 15 minutes” or “Alexa, what’s 200 degrees Centigrade in Fahrenheit” is saving both our palate and our expensive devices from certain destruction.
Leaving the House: It might be because we’re British and thus genetically predisposed to care about the weather. In being able to say “Alexa, will it rain today?” or “Alexa, will it be hot this afternoon?”, she’s able to effortlessly satisfy this stereotype.

However, the “Alexa” invocation isn’t foolproof. Last week my wife and I were watching the Iowa caucuses on CNN when suddenly, unexpectedly, Alexa gushed:

“How much wood would a woodchuck chuck if a woodchuck could chuck wood?”

I have no idea what Wolf Blitzer said to offend her.

But as an engineer and product person, the Echo really whets my appetite because you can build apps for it. Well, they’re called “Skills”.

Amazon have quietly been building a solid library of third-party skills. There’s now over 200 including new additions from Uber, Spotify and Dominos. And they’re clearly taking their new ecosystem seriously: on the developer/platform side there’s a new VP and a $100m Alexa Fund. On the consumer side there’s, well, a superbowl ad.

So to learn more, I built a Skill.

Here’s some observations from my experience:

Getting Started

Building voice interfaces is no easy task, let alone a framework for arbitrary commands for third-party apps — but the Alexa Skills Kit (as the programing interface is called) is an impressive bit of software.

As a developer you specify the ‘intents’ your Skill supports (think of these like controllers in a Rails app or Activities in an Android app), then specify the various phrases people might use to invoke that intent. You also specify any variables that you expect as part of the incantation. These can be standard types (dates, numbers, places), enums you specify, or arbitrary literals (not recommended, but sometimes necessary). Your code then gets passed clean structured data to act upon.

This programming model is flexible enough to make most things possible, but there’s a few limitations. It’d be great to have a programmatic way of updating enum of custom slot types — either via an API or have the Skills Kit read and cache the values from JSON served at a URL. It’d also like to see an expanded list of built-in types: it currently only support US cities, for example.

These nits aside, it’s clear a lot of work has gone into the ASK, and it’s super-easy to build pretty complex voice-driven interfaces really quickly.

Needs better support for asynchronous tasks

Not everything happens in an instant. Today, when you ask Alexa something, she can only reply with one block of speech. This works great if the Skill you’re interacting with has the answers ready in an instant, but that’s not always the case.

Imagine your Skill calls an API which takes 5 seconds to respond — not all that uncommon for complex operations. There’ll be an awkward 5-second pause after you pose the question before you hear a response. Granted, you know something’s happening as the Echo’s blue lights pulse in the meantime. But it’d be a much better experience if Alexa offered Skills the ability to respond immediately with something like “OK, let me look that up for you”, and then a few seconds later with the actual response.

A great use case is a hypothetical Lyft app. When you order a ride, it might take 10–60 seconds for real drivers in the real world to accept the job. In Lyft’s app, this latency is satisfied with a spinner. But to make this experience work on the Echo, a Skill needs to be able to reply instantly (“OK, let me get you a ride”), then keep you updated (“I’m still trying to connect you with a driver…”), before letting you know: “I got you a ride. Your car will arrive in 4 minutes”. That experience is not possible today and it desperately needs to be to enable a whole class of semi-asynchronous or long-running Skills.

Notifications Notification Notifications

Today, Alexa can only respond to commands you utter. There’s no way the Echo can notify you that something happened — you always have to ask. But events and alerts are critical agents in some of the most useful experiences.

Take that Lyft example again — wouldn’t it be useful if Alexa could tell you when your ride was one minute away? What if Alexa could let you know that the pizza you ordered had been dispatched, or for that matter, remind you that your latest Amazon order would be delivered sometime this afternoon? None of that’s possible today.

Now, I can totally understand why this isn’t in for the v1 — tasteful notifications are hard to get right — but the issues are all solvable. Access to notifications needs to be tightly controlled to prevent abuse, but Amazon already has a certification scheme in place for Skills. It’d also make sense to have a low per-Skill quota to prevent over-use. As a user, I’d also want to be able to set do-not-disturb periods to prevent interruptions.

I really hope the folks at Amazon are actively working on notifications right now. They’d dramatically expand the universe of what’s possible.

Access to long-form & streaming audio

Right now, Skills are able to play short (<30 sec) audio clips. This is really designed for audio branding — perhaps a sound trademark. But I can imagine whole classes of experiences that become possible if Skills are able to access live audio streams or play long files.

For example, I’d love to be able to ask Alexa to start streaming the sound from our baby monitor when we put our daughter to sleep. I’d love people to build Skills which access long-form audio content beyond podcasts — for example LBC’s back catalogue of programming stretching back nearly 10 years.

The built-in apps (TuneIn, Spotify, Pandora, Audible etc) are all able to play >30 sec audio files, and connect to live audio streams. It’d be great to see the same abilities made available to third-party Skills too.

Multiroom

Perhaps this is the ultimate first-world problem: I’d like my ambient voice-activated virtual assistant to be in every room of my home. Yes, I know, I’m lucky enough to have a home with enough distance between rooms so as to not be heard properly between them — let alone lucky enough to have an ambient voice-activated virtual assistant. But I’ve begun to expect — no, rely — on Alexa’s presence, so that I’m confused when I walk in to the bedroom and can’t verbally add diapers to our shopping list.

First, it’d be great if, in a multiple-Echo home, Alexa were smart enough that only the nearest device responded — like the Echo’s beam-forming mic on steroids. Though we only have one Echo, that’s probably not the case today.

It’d be even better if multiple Echos could work together. I’d love to be able to say, from the kitchen: “Alexa, play a lullaby in the Nursery”. Yes, I’m that good a dad.

A more natural invocation model for third-party Skills

While built-in apps like Amazon’s own or Pandora can be invoked with natural phrases like “Alexa, is it going to rain today?”, or “Alexa, play some Gregory Porter”, third-party Skills have a more rigid invocation format:

"Alexa, ask Tube Status if there are any delays"
"Alexa, ask Automatic where my car is"
"Alexa, ask TV Shows when is American Idol on?"
"Alexa, ask|tell|open {skill name} to|for|about|if|whether {some command}"

This results in some pretty awkward sentences, and the formal structure interrupts the illusion that you’re talking to a truly smart assistant. To really make Skills shine, Alexa needs to be clever enough to figure out what you’re asking, and delegate to the right Skill. The commands above should be as simple as:

"Alexa, are there any delays on the Tube?"
"Alexa, where’s my car?"
"Alexa, when is American Idol on?"

Now, again, I totally get why this is the state today — the formal structure makes it much easier for Alexa’s brain to invoke the right Skill and pass your command to it in a structured way. But we’re shooting for amazing here — and being able to invoke Skills using natural language and arbitrary sentence structure is critical to the illusion Alexa purveys.

Audio Out

The Echo is a really great little speaker — at least as good as the other Bluetooth speakers in its price range, and they just stream Bluetooth audio. But it’s not Hi-Fi. For me, the Echo is missing a line out jack that I can wire into a proper set of speakers to play back the streaming audio.

Of course, I could still use a laptop/phone/AirPlay to stream Spotify to my Hi-Fi but it’s testament to how awesome Alexa’s interaction model is that I want to use the Echo to control everything. Given that the current hardware doesn’t have an audio jack, a quick fix would be to let Alexa control another Spotify client — kind of like Spotify Connect in reverse. Or I could hack it....

That’s quite a list, and I feel a little guilty pointing these things out: I’m aware we’re at the very very early stages of what’s possible. But these suggestions aren’t borne out of frustration, they’re driven by the wide-eyed excitement of what could be possible.

While the door is ajar for great new experiences, it’s not yet been fully opened. But the expressiveness, flexibility and sheer novelty of this new platform is incredibly exciting. Much like the promise of VR the advent of natural, ambient and pervasive voice interfaces like Alexa already feels like we’re at a new frontier.

The Echo is not a product I expected from Amazon, but like the Kindle, I think we’re all about to realise they have a hit on their hands — and their success is well deserved. The Echo is badass.

Living with (and building for) the Amazon Echo

We’ve had our Amazon Echo for just 15 days but it feels like we’ve jumped 15 years into the future.

Getting Started

Needs better support for asynchronous tasks

Notifications Notification Notifications

Access to long-form & streaming audio

Multiroom

A more natural invocation model for third-party Skills

Audio Out

Discussion about this post