[WITSMA] Voice: The New Frontend

[WITSMA] Voice: The New Frontend

B36609b33707f04623f84f7381d5e94e?s=128

Nara Kasbergen

March 28, 2019
Tweet

Transcript

  1. Voice: The New Frontier Frontend Nara Kasbergen (@xiehan) #WITSMA19 Thursday,

    March 28, 2019
  2. None
  3. me

  4. Now: Why voice? Then:

  5. None
  6. What we're covering today 1. Vision of the future 2.

    How we got here 3. Where we are today 4. How to build for voice today 5. Lessons learned
  7. Vision of the future 1

  8. None
  9. None
  10. None
  11. None
  12. Prediction: In the future, we will all be voice developers.

  13. How we got here 2

  14. Natural Language Processing definition: the ability of a computer program

    to understand human language as it is spoken. NLP is a component of artificial intelligence (AI). (source) circa 1960 first major advances in the 1980s
  15. Machine Learning (ML) definition: Machine learning is an application of

    artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves. (source) circa 1950s first major advances in the 1980s
  16. https://hackernoon.com/moores-law-is-alive-and-well-adc010ea7a63

  17. 1994 IBM Simon is the world's first smart phone 1987

    2009 Microsoft begins work on Cortana 2010 Siri launches as a standalone iOS app, later acquired by Apple 2012 Google Now Apple's Knowledge Navigator
  18. 2015 Amazon Alexa Skills Kit launches (June) 2014 2016 Google

    Assistant (May) + Google Home (November) 2017 Samsung Bixby (August) + Microsoft Cortana (October) 2018 Apple HomePod (February) Amazon Echo launches November 6
  19. “smart speakers”

  20. “smart speakers”

  21. Rise of multimodal devices

  22. Chatbots definition: a computer program or an artificial intelligence which

    conducts a conversation via auditory or textual methods. Such programs are often designed to convincingly simulate how a human would behave as a conversational partner, thereby passing the Turing test. (source) circa 1966 (ELIZA) first major advances in the early 2000s
  23. Google all-in on chatbot tech

  24. None
  25. Where we are today 3

  26. The major players

  27. None
  28. The request/response flow Your code request response

  29. The request/response flow Your code request response P.S. all the

    NLP and ML happens here
  30. None
  31. Voice for the web Web Speech API: • Speech recognition

    • Speech synthesis (TTS) W3C Community specification was published in 2012 SpeechRecognition interface currently only supported in Chrome, experimental feature Uses Google's servers to convert speech to text (requires Internet connection)
  32. Mozilla CommonVoice "Common Voice is Mozilla's initiative to help teach

    machines how real people speak." (source) Publicly open dataset Upload recordings of your voice Help reduce bias in Natural Language Processing (NLP) & machine learning
  33. Mozilla Scout Mycroft Open source alternatives

  34. The enterprise voice space NLP/ML/AI Consumer applications

  35. How to build for voice today 4

  36. disclaimer it's not about the code

  37. The stack • node.js • serverless (AWS Lambda or Google

    Cloud Functions) • lightweight database (e.g. Amazon DynamoDB) • CI server of choice (e.g. Travis, Jenkins, etc.) • unit testing framework of choice (e.g. Jest) • TypeScript…?
  38. Use the official SDKs Alexa node.js SDK: github.com/alexa/alexa-skills-kit-sdk-for-nodejs Actions on

    Google node.js SDK: github.com/actions-on-google/actions-on-google-nodejs
  39. Glossary Alexa Google / Dialogflow skill action / agent invocation

    name intent slot entity sample utterance training phrase
  40. “Alexa, ask NPR One to play the latest news”

  41. “Alexa, ask NPR One to play the latest news” invocation

    name
  42. “Alexa, ask NPR One what's playing”

  43. “Alexa, ask NPR One what's playing” “Alexa, ask NPR One

    what am I listening to?”
  44. “Alexa, ask NPR One what's playing” “Alexa, ask NPR One

    what am I listening to?” intent sample utterance sample utterance
  45. “Alexa, ask NPR One to play Hidden Brain”

  46. “Alexa, ask NPR One to play Hidden Brain” intent slot

    (or entity)
  47. Basic code architecture JSON event with intent name and slot(s)

    Handler function mapped to that intent Use SDK to produce a response with speech, audio, etc.
  48. Alexa "Hello World" skill

  49. Alexa "Hello World" skill

  50. Alexa "Hello World" skill

  51. Alexa "Hello World" skill

  52. Google "Hello World" action

  53. Google "Hello World" action

  54. Next, add more features Simple • Launch requests • Add

    intents • Wait for response • Handle slots • Add images to cards • Play simple audio • Use SSML Medium • User login (OAuth2) • Persistent data • State management • Contexts (Google) • Dialog management (Alexa) • Advanced audio (Alexa) Advanced • Customize display on visual devices • Monetization / transactions • Request user permissions • Notifications • Internationalization
  55. Publishing a skill/action • By default, skills in development are

    only available to you • Certification process similar to mobile app store submissions ◦ ~48-hour turnaround on average ◦ Feedback is unpredictable! ◦ Respect existing brands • Can share a "beta" version with co-workers, friends, etc. ◦ Great for QA as well as hobby projects
  56. Get started without code Alexa: • Alexa Skill Blueprints Google:

    • Actions on Google Templates
  57. Lessons learned 5

  58. The challenges • The code is not hard • What

    is hard: ◦ Learning about the platform limitations ◦ Managing stakeholder expectations ◦ Understanding & changing user behaviors ◦ QA • It's not just an engineering challenge!
  59. Audio-first development • You would think these platforms are ideal

    for audio… but they're not! • It's clear the companies designing these platforms are still focused primarily on Text-to-Speech (TTS) • The Actions on Google audio player is almost unusable • The Alexa audio player has many features but is very unintuitive when you're first working with it
  60. Error handling is hard • Invisible errors: ◦ The Alexa

    service / Google / Dialogflow can reject a user's request ◦ If that happens, your app is not notified at all! ◦ Logs/analytics can't tell the whole story ◦ Users often don't understand why it failed • Real user testing is critical!
  61. Prediction: In the future, we will all be voice developers.

  62. The ideal voice developer • Resilience • Patience • Honesty

    • Openness • Empathy ◦ Front-end developers have an edge!
  63. Women in Voice womeninvoice.wordpress.com @WomenInVoice womeninvoice.slack.com linkedin.com/groups/13612934

  64. NPR's Voice team is hiring! Mid-to-senior software engineer: direct link

    Product manager: direct link n.pr/tech-jobs
  65. Thank you! Keep in touch: @xiehan nara@nara.codes