[Webinale] Voice: The New Frontend

Voice: The New Frontier Frontend Nara Kasbergen (@xiehan) #webinale Thursday,
December 10, 2020

What is NPR? a nationwide network of public radio stations

Why voice? Then: Now:

What we're covering today 1. Vision of the future 2.
How we got here 3. Where we are today 4. How to build for voice today 5. Lessons learned

Vision of the future 1

Prediction: In the future, we will all be voice developers.

How we got here 2

Natural Language Processing definition: the ability of a computer program
to understand human language as it is spoken. NLP is a component of artificial intelligence (AI). (source) circa 1960 first major advances in the 1980s

Machine Learning (ML) definition: Machine learning is an application of
artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves. (source) circa 1950s first major advances in the 1980s

https:/ /hackernoon.com/moores-law-is-alive-and-well-adc010ea7a63

1994 IBM Simon is the world's ﬁrst smart phone 1987
2009 Microsoft begins work on Cortana 2010 Siri launches as a standalone iOS app, later acquired by Apple 2012 Google Now Apple's Knowledge Navigator

2015 Amazon Alexa Skills Kit launches (June) 2014 2016 Google
Assistant (May) + Google Home (November) 2017 Cortana Skills Kit (May) + Samsung Bixby (August) 2018 Apple HomePod (February) Amazon Echo launches November 6

Rise of multimodal devices

Where we are today 3

The major players

The state at the end of 2019

Predictions for 2021

The request/response ﬂow Your code request response

The request/response ﬂow Your code request response P.S. all the
NLP and ML happens here

Voice for the web Web Speech API: • Speech recognition
• Speech synthesis (TTS) W3C Community speciﬁcation was published in 2012 SpeechRecognition interface currently only supported in Chrome, experimental feature Uses Google's servers to convert speech to text (requires Internet connection)

Mozilla CommonVoice "Common Voice is Mozilla's initiative to help teach
machines how real people speak." (source) * In maintenance mode since August 2020 Publicly open dataset Upload recordings of your voice Help reduce bias in Natural Language Processing (NLP) & ML

Mozilla Scout Mycroft Open source alternatives

The enterprise voice space NLP/ML/AI Consumer applications

How to build for voice today 4

disclaimer it's not about the code

The stack • node.js • serverless (AWS Lambda or Google
Cloud Functions) • lightweight database (e.g. Amazon DynamoDB) • CI server of choice (e.g. Travis, Jenkins, etc.) • unit testing framework of choice (e.g. Jest) • TypeScript…?

Use the oﬃcial SDKs Alexa node.js SDK: github.com/alexa/alexa-skills-kit-sdk-for-nodejs Actions on
Google node.js SDK: github.com/actions-on-google/actions-on-google-nodejs

Glossary Alexa Google / Dialogﬂow skill action / agent invocation
name intent slot entity sample utterance training phrase

“Alexa, ask NPR to play the latest news”

“Alexa, ask NPR to play the latest news” invocation name

“Alexa, ask NPR what's playing”

“Alexa, ask NPR what's playing” “Alexa, ask NPR what am
I listening to?”

“Alexa, ask NPR what's playing” “Alexa, ask NPR what am
I listening to?” intent sample utterance sample utterance

“Alexa, ask NPR to play Planet Money” “Alexa, ask NPR
to play Hidden Brain”

“Alexa, ask NPR to play Planet Money” “Alexa, ask NPR
to play Hidden Brain” intent slot (or entity)

Basic code architecture JSON event with intent name and optional
slot(s) Handler function mapped to that intent Use SDK to produce a response with speech, audio, etc.

Alexa "Hello World" skill

Google "Hello World" action

Next, add more features Simple • Launch requests • Add
intents • Wait for response • Handle slots • Add images to cards • Play simple audio • Use SSML Medium • User login (OAuth2) • Persistent data • State management • Contexts (Google) • Dialog management (Alexa) • Advanced audio (Alexa) Advanced • Customize display on visual devices • Monetization / transactions • Request user permissions • Notiﬁcations • Internationalization

Publishing a skill/action • By default, skills in development are
only available to you • Certiﬁcation process similar to mobile app store submissions ◦ ~48-hour turnaround on average ◦ Feedback is unpredictable! ◦ Respect existing brands • Can share a "beta" version with co-workers, friends, etc. ◦ Great for QA as well as hobby projects

Get started without code Alexa: • Alexa Skill Blueprints Google:
• Actions on Google Templates

Lessons learned 5

The challenges • The code is not hard • What
is hard: ◦ Learning about the platform limitations ◦ Managing stakeholder expectations ◦ Understanding & changing user behaviors ◦ QA • It's not just an engineering challenge!

Prediction: In the future, we will all be voice developers.

Thank you! Keep in touch: @xiehan [email protected]

[Webinale] Voice: The New Frontend

[Webinale] Voice: The New Frontend

More Decks by Nara Kasbergen

Other Decks in Technology

Featured

Transcript