Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Implementing a custom AI voice Assistant by streaming WebRTC to Dialogflow & Cloud Speech

Lee Boonstra
February 18, 2020

Implementing a custom AI voice Assistant by streaming WebRTC to Dialogflow & Cloud Speech

A best practice for streaming audio from a browser microphone to Dialogflow or Google Cloud STT by using websockets.

Lee Boonstra

February 18, 2020
Tweet

More Decks by Lee Boonstra

Other Decks in Technology

Transcript

  1. Implementing a custom AI voice Assistant
    by streaming WebRTC to Dialogflow & Cloud Speech
    Lee Boonstra
    Developer Advocate, Google
    Twitter: @ladysign

    View Slide

  2. Lee Boonstra
    Developer Advocate, Conversational AI
    Ex- JavaScript Technical Trainer
    Public Speaker (since 2013)
    Writer/Blogger for Techzine, .Net Magazine,
    Marketingfacts.nl, CustomerTalk.nl and
    Google Cloud Blog
    @ladysign
    http://www.leeboonstra.com

    View Slide

  3. What is an AI voice assistant?
    An intelligent software agent that can perform tasks for an
    individual based on commands or questions, through human
    speech and respond via synthesized voices.

    View Slide

  4. Why would you create your own voice AI?
    This is not a talk about the Google Assistant!
    It’s about streaming your own microphone to
    a back-end that can give smart answers with
    machine learning.

    View Slide

  5. ● Public available
    ● Runs on all Assistant powered devices
    ● Native Google Assistant features
    ● Invoke actions (Hey Google talk to )
    ● Customer Terms & Conditions
    ● Special technical requirements
    Google Assistant facts:
    PUBLIC
    Your business action
    Action from someone else
    Weather Action
    Recipes Action
    Nest
    Thermostat
    Alarm
    Clock
    You might want to build your own voice AI instead because of
    technical requirements, overkill, enterprise usage
    Lee Boonstra | @ladysign

    View Slide

  6. Use Cases for custom AI voice assistants
    Lee Boonstra | @ladysign

    View Slide

  7. Short Utterance vs. Streaming
    Turn on the lights
    Short utterance. -> Match to 1 intent. Long utterance. Many possible intent
    matches.
    Lee Boonstra | @ladysign

    View Slide

  8. The idea:
    Airport Self Service Kiosk Demo

    View Slide

  9. App Demo

    View Slide

  10. View Slide

  11. A best practice for streaming audio from a browser
    microphone to Dialogflow or Google Cloud STT by
    using websockets.

    View Slide

  12. User records voice
    with browser microphone
    Web Client
    Angular App with WebRTC
    Docker Container
    Browser plays audio
    Architecture
    Lee Boonstra | @ladysign

    View Slide

  13. User records voice
    with browser microphone
    Web Client
    Angular App with WebRTC
    Docker Container
    Browser plays audio
    Architecture
    How to get a microphone audio stream
    which works across all browsers?
    How to make sure the audio
    stream can be handled as an
    ArrayBuffer in the back-end?

    View Slide

  14. RecordRTC
    RecordRTC is WebRTC JavaScript library
    for audio/video as well as screen activity
    recording.
    Lee Boonstra | @ladysign

    View Slide

  15. User records voice
    with browser microphone
    Web Client
    Angular App with WebRTC
    Docker Container
    Server
    Node.js Codebase
    Docker Container
    Browser plays audio
    Architecture
    How to stream from
    front-end to back-end?
    How to stream Bidirectional
    Binary Data?
    Lee Boonstra | @ladysign

    View Slide

  16. ● Socket.io - Socket.IO enables
    real-time bidirectional event-based
    communication.
    ● Socket.io-Stream - for binary
    stream transfers through Socket.io
    Lee Boonstra | @ladysign

    View Slide

  17. RecordRTC (web)
    Lee Boonstra | @ladysign

    View Slide

  18. Socket Stream (web)
    Lee Boonstra | @ladysign

    View Slide

  19. Socket Stream (node)
    Lee Boonstra | @ladysign

    View Slide

  20. Lee Boonstra | @ladysign
    Google Speech-to-Text enables developers to convert
    audio to text by applying powerful neural network models
    in an easy-to-use API. The API recognizes 120 languages
    and variants to support your global user base. You can
    enable voice command-and-control, transcribe audio
    from call centers, and more.

    View Slide

  21. STT (node)
    Lee Boonstra | @ladysign

    View Slide

  22. User records voice
    with browser microphone
    Web Client
    Angular App with WebRTC
    Docker Container
    Server
    Node.js Codebase
    Docker Container
    Cloud STT
    Voice to text
    Browser plays audio
    Architecture
    AudioBuffer?
    Encoding?
    SampleRate?
    Number of Channels?
    How to get text from the
    HTML5 browser microphone
    stream?
    Lee Boonstra | @ladysign

    View Slide

  23. STT Node Demo

    View Slide

  24. Lee Boonstra | @ladysign
    Development suite
    for building
    Conversational UIs.
    ● Formerly known as API.AI
    ○ (Sept 2016, acquired by Google)
    ● Powered by Machine Learning:
    ○ Natural Language Understanding (NLU)
    ○ Intent Matching
    ○ Conversation Training
    ● Cross platform
    ● Build faster with the Web UI
    ● Scalable: Separate your conversation from code
    ● Speech / Voice Integration
    ● Multi-lingual bot support (20+ languages)
    ● Direct integration with 15+ channels like Google
    Assistant, Slack, Twilio, Facebook...

    View Slide

  25. View Slide

  26. View Slide

  27. View Slide

  28. View Slide

  29. User records voice
    with browser microphone
    Web Client
    Angular App with WebRTC
    Docker Container
    Server
    Node.js Codebase
    Docker Container
    Cloud STT
    Voice to text
    Dialogflow
    Intent Matching
    Browser plays audio
    Architecture
    Translate
    lang to English
    Translate
    lang to English
    Dialogflow to detect intents.
    Translate text to base
    language, and translate back.
    Lee Boonstra | @ladysign

    View Slide

  30. Dialogflow (node)
    Lee Boonstra | @ladysign

    View Slide

  31. Lee Boonstra | @ladysign
    When you sound like a computer. People treat you like a computer. This
    is where Wavenet Technology comes in.
    ● Voices sound natural and unique
    ● Capture subtleties like pitch, pace, and all the pauses that convey meaning
    TTS - Making use of DeepMind's WaveNet
    Technology
    https://deepmind.com/blog/wavenet-generative-model-raw-audio/ Lee Boonstra | @ladysign

    View Slide

  32. TTS (node)
    Lee Boonstra | @ladysign

    View Slide

  33. User records voice
    with browser microphone
    Web Client
    Angular App with WebRTC
    Docker Container
    Server
    Node.js Codebase
    Docker Container
    Cloud TTS
    Spoken voice
    Cloud STT
    Voice to text
    Dialogflow
    Intent Matching
    Browser plays audio
    Architecture
    Translate
    lang to English
    Translate
    lang to English
    How to bring an AudioBuffer
    to the WebClient?
    How to make it autoplay
    in all browsers?
    Lee Boonstra | @ladysign

    View Slide

  34. AudioBufferSourceNode (web)
    Lee Boonstra | @ladysign

    View Slide

  35. User records voice
    with browser microphone
    Web Client
    Angular App with WebRTC
    Docker Container
    Server
    Node.js Codebase
    Docker Container
    Cloud TTS
    Spoken voice
    Cloud STT
    Voice to text
    Dialogflow
    Intent Matching
    Browser plays audio
    Architecture
    Translate
    lang to English
    Translate
    lang to English
    Need HTTPS?
    SSL Certificates?
    Lee Boonstra | @ladysign

    View Slide

  36. https://github.com/dialogflow/sel
    fservicekiosk-audio-streaming
    I open sourced
    my code
    Lee Boonstra | @ladysign

    View Slide

  37. Thanks!
    Lee Boonstra | @ladysign
    Github: https://github.com/dialogflow/selfservicekiosk-audio-streaming
    Slides: https://bit.ly/2OZN9yg

    View Slide

  38. Hey, why not a Single Page App?
    Anti Pattern
    I've seen solutions online where the microphone directly got streamed to the Dialogflow, without a server part. The REST
    calls were made directly in the web client with JavaScript. I would consider this as an anti-pattern. You will likely expose
    your service account / private key in your client-side code. Anyone, who is handy with ChromeDev tools, could steal your
    key, and make (paid) API calls via your account. It's a better approach to always let a server handle the Google Cloud
    authentication. This way the service account won't be exposed to the public.

    View Slide