What we're covering today 1. Vision of the future 2. How we got here 3. Where we are today 4. How to build for voice today 5. Lessons learned 6. AMA (Ask Me Anything)
Natural Language Processing definition: the ability of a computer program to understand human language as it is spoken. NLP is a component of artificial intelligence (AI). (source) circa 1960 first major advances in the 1980s
Machine Learning (ML) definition: Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves. (source) circa 1950s first major advances in the 1980s
1994 IBM Simon is the world's first smart phone 1987 2009 Microsoft begins work on Cortana 2010 Siri launches as a standalone iOS app, later acquired by Apple 2012 Google Now Apple's Knowledge Navigator
Chatbots definition: a computer program or an artificial intelligence which conducts a conversation via auditory or textual methods. Such programs are often designed to convincingly simulate how a human would behave as a conversational partner, thereby passing the Turing test. (source) circa 1966 (ELIZA) first major advances in the early 2000s
Voice for the web Web Speech API: ● Speech recognition ● Speech synthesis (TTS) W3C Community specification was published in 2012 SpeechRecognition interface currently only supported in Chrome, experimental feature Uses Google's servers to convert speech to text (requires Internet connection)
Mozilla CommonVoice "Common Voice is Mozilla's initiative to help teach machines how real people speak." (source) Publicly open dataset Upload recordings of your voice Help reduce bias in Natural Language Processing (NLP) & machine learning
The stack ● node.js ● serverless (AWS Lambda or Google Cloud Functions) ● lightweight database (e.g. Amazon DynamoDB) ● CI server of choice (e.g. Travis, Jenkins, etc.) ● unit testing framework of choice (e.g. Jest) ● TypeScript…?
Use the official SDKs Alexa node.js SDK: github.com/alexa/alexa-skills-kit-sdk-for-nodejs Actions on Google node.js SDK: github.com/actions-on-google/actions-on-google-nodejs
Basic code architecture JSON event with intent name and slot(s) Handler function mapped to that intent Use SDK to produce a response with speech, audio, etc.
Publishing a skill/action ● By default, skills in development are only available to you ● Certification process similar to mobile app store submissions ○ ~48-hour turnaround on average ○ Feedback is unpredictable! ○ Respect existing brands ● Can share a "beta" version with co-workers, friends, etc. ○ Great for QA as well as hobby projects
The challenges ● The code is not hard ● What is hard: ○ Learning about the platform limitations ○ Managing stakeholder expectations ○ Understanding & changing user behaviors ○ QA ● It's not just an engineering challenge!
Audio-first development ● You would think these platforms are ideal for audio… but they're not! ● It's clear the companies designing these platforms are still focused primarily on Text-to-Speech (TTS) ● The Actions on Google audio player is almost unusable ● The Alexa audio player has many features but is very unintuitive when you're first working with it
Error handling is hard ● Invisible errors: ○ The Alexa service / Google / Dialogflow can reject a user's request ○ If that happens, your app is not notified at all! ○ Logs/analytics can't tell the whole story ○ Users often don't understand why it failed ● Real user testing is critical!