There’s a ton of interest these days in reducing friction and improving efficiency in user experiences. Streamlined experiences are being explored as part of IoT automation and no checkout shopping; we are seeing voice based interfaces used prominently in these user interactions.
Many of us are familiar with the Alexa/Echo product line from Amazon and the Google Home speaker. In addition to those front runners, Microsoft and Apple also offer voice assistants and APIs for building voice interfaces in their desktop and mobile products. There are also third party APIs which can be used to embed a voice interface in your own custom hardware or application.
In many ways all of these options are comparable. They all have a “wake” word (and sometimes alternatives as well) which triggers the assistant to listen to the user’s commands. They all also use Natural Language Understanding (NLU) technology to capture the speech and translate it into text that can be used as instructions for the system (or user “intents”). Developers can create applications on these platforms to add their own custom behaviors. The AI and machine learning space has really matured and they work sometimes surprisingly well in figuring out what the user said and what they meant, opening a whole new world for for developers.
If you’ve been on the sidelines waiting for consolidation or the dust to settle with these voice APIs and products, I’d recommend that you start getting your feet wet now. Despite each platform offering their own tools, they all use very similar conceptual models to setup and design the conversational interface and much of this experience is very transferable between vendors and hardware platforms.
Let’s explore both the “how” of setting up voice interfaces as well as gather some background context on where some of the remaining challenges in this space lie.
- For your initial voice based skill pick something which provides a lot of value to the user and is simple to use. You can add and extend the experience as you and your users get more experienced.
- I recommend you design the conversation on paper in a fair amount of detail before writing any code. Decision trees are one way to think about and design the conversational interaction. Keep it obvious and provide “outs” that will let them get help or start over.
- Expect to iteratively develop and refine the voice interaction model and the skill as you test and gather feedback.
Example Voice/Skill Interaction
Setting Up the Model
Hopefully you spent some time designing the voice interactions during your planning exercise. We’ll now use those to configure your skill with the model you expect: That means the intents you can handle with your skill and any extra spoken variables your skill needs to fulfill that intent.
Here’s an example of how you configure those for Alexa Skills:
LUIS (Language Understanding Intelligent Service) is Microsoft’s equivalent service to do NLU processing. Here’s an example of the UI for setting up intents in LUIS:
Training the Model
After you have specified the model you’ll need to train it. Training consists of entering a variety of phrases similar to what you expect users will speak when interacting with your skill. For each phrase you’ll identify the intent and any entities included. Variety is the spice of life and the path to a good experience with your NLU tool. Providing and categorizing a diverse set of example utterances will help the machine learning inside the NLU tool get better at identifying the intent and the corresponding entities you expect.
For Alexa, you enter sample utterances and tag them using a pure textual interface. It’s a bit more manual but if you get fancy and generate variations with a script, you can upload that data into your skill easily. Here’s what the interface looks like:
Alexa also has a new Beta Skill Builder service which is in beta and promises to offer a less bare metal approach to defining and configuring your model.
Microsoft’s LUIS already offers an interface with a few more bells and whistles. The UI tries to walk you through the setup process. In my own experience it was not easier to use than writing sentences and labelling them by hand, but the tool is quickly evolving.
Build and Deploy an Endpoint for the Skill
A bunch of work goes into picking a good skill and defining the model, but the rubber really hits the road when you grab the SDK and build the REST endpoint which handles the intents. Next, we will get a bit technical and take a look at some of the actual code that goes into creating a skill.
Cortana skills look similar but utilize the bot framework SDK. You can find a series of examples to review and get you going here.
Google examples are worth reviewing and tend to utilize firebase as a foundation. You can find examples here.
The emergence of server-less architectures has made launching and testing skills a lot easier. A popular approach is building a Node application using the appropriate skill SDK and deploying it to the hosted service offered by cloud providers like Amazon’s AWS and Microsoft’s Azure.
Each voice platform can work with any hosting solution, so you can make whatever hosting choice is most comfortable for you or your organization, whether that’s AWS or servers in your own datacenter. Your skill code just has to honor the API style expected by the voice assistant. Having said that, the server-less options work well and are a great way to get started (and often continue).
Once deployed, you will be ready to start testing out the commands or intents you have created in your skill application from your voice enabled device.
Each assistant also provides tools for testing your skill endpoint with any utterances you can imagine. As usual in our industry, test early and often and there is an approval process for each assistant to get the skill released into the wild.
With that, I’ve walked you through the basics of building a voice skill conceptually. As said initially, while there are specifics that are different between each Assistant platform, they are remarkably similar in style and model allowing for a lot of shared conceptual understanding when you move between platforms.
These are a few things that you’ll run into as you start to build real world conversational products:
The assistant platforms all offer a consistent user identifier across invocations. If your skill doesn’t require any extra authentication or authorization, it’s possible to just use that identifier and save any preferences in an appropriate persistent data store (eg. NoSQL DB) keyed with that id.
However, many skills can provide a much richer experience to the user if they can link the voice user to a new or existing account on their site. The obvious use case is situations where a purchase is made. Simple situations where your skill makes more relevant suggestions would also be possible with just a bit more knowledge about the user.
All three major platforms utilize the OAuth 2.0 grant flow to request and allow access to user profile data and third party APIs via their voice assistant. When setting up the skill you can choose to supply an OAuth 2.0 provider endpoint and connection details allowing the assistant to ask the user for permission and, if granted, interact with your service securely on behalf of the user.
The authorization flow that a user follows when linking their account to the voice skill is not possible over a purely voice interface today; however, with minimal friction, users can hop over to the related app to supply their account information and approve the link. After the accounts are linked (or connected as Microsoft calls it), the assistant will maintain the Oauth refresh token on behalf of the user and provide your skill with the authorization token for each request.
Obviously each of these assistants are different and have their own sweet spot(s). They vary based on the maturity of the natural language processing engine, the documentation around skill development and the hardware and software used to interact with them. Some of the things we’ve run into that you may want to keep an eye out for are:
- Cheaper microphone hardware may yield poor speech recognition results. In the same way you would test your mobile apps across devices and hardware generations, you should test your voice skill.
- Speech recognition results can vary between platforms for different voices depending on things like accent and gender. Make sure you account for that and potentially test early, depending on your audience.
- The speed of the interaction varies across platforms. Even with solid hardware sometimes the responsiveness of the conversation is less than you would hope for and is sometimes outside of your control.
- The Echo does a good job generally with speech recognition and has the most complete documentation. It also seems to have captured the most mind share at this point, although Cortana and Google are working hard to catch up.
- The Microsoft assistant feels the roughest of the three in terms of maturity of the offering. The tooling and UI for skill building has continued to change making it difficult to discover and stabilize best practices. However, they do offer a larger variety of use cases (with the exception of purpose built speakers) covering desktops, laptops, and various mobile devices.
- The google assistant provides possibly the best natural language understanding engine as well as a more streamlined setup process for linking user accounts.
Voice interfaces are here to stay and are improving rapidly. The three voice assistants we’ve talked about here are all mature enough to use in real products today. I hope this inspires you to jump in and experiment with your own voice skills. While you’re doing that, we’ll be following up with a blog post where we’ll dive in a little deeper on how best to design conversational interfaces.
Comparison of Platform Terms:
Feature image courtesy of Arif Wahid for Unsplash.