Implementing a custom AI voice Assistant by streaming WebRTC to Dialogflow & Cloud Speech

Slide 1

Slide 1 text

Implementing a custom AI voice Assistant by streaming WebRTC to Dialogflow & Cloud Speech Lee Boonstra Developer Advocate, Google Twitter: @ladysign

Slide 2

Slide 2 text

Lee Boonstra Developer Advocate, Conversational AI Ex- JavaScript Technical Trainer Public Speaker (since 2013) Writer/Blogger for Techzine, .Net Magazine, Marketingfacts.nl, CustomerTalk.nl and Google Cloud Blog @ladysign http://www.leeboonstra.com

Slide 3

Slide 3 text

What is an AI voice assistant? An intelligent software agent that can perform tasks for an individual based on commands or questions, through human speech and respond via synthesized voices.

Slide 4

Slide 4 text

Why would you create your own voice AI? This is not a talk about the Google Assistant! It’s about streaming your own microphone to a back-end that can give smart answers with machine learning.

Slide 5

Slide 5 text

● Public available ● Runs on all Assistant powered devices ● Native Google Assistant features ● Invoke actions (Hey Google talk to ) ● Customer Terms & Conditions ● Special technical requirements Google Assistant facts: PUBLIC Your business action Action from someone else Weather Action Recipes Action Nest Thermostat Alarm Clock You might want to build your own voice AI instead because of technical requirements, overkill, enterprise usage Lee Boonstra | @ladysign

Slide 6

Slide 6 text

Use Cases for custom AI voice assistants Lee Boonstra | @ladysign

Slide 7

Slide 7 text

Short Utterance vs. Streaming Turn on the lights Short utterance. -> Match to 1 intent. Long utterance. Many possible intent matches. Lee Boonstra | @ladysign

Slide 8

Slide 8 text

The idea: Airport Self Service Kiosk Demo

Slide 9

Slide 9 text

App Demo

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

A best practice for streaming audio from a browser microphone to Dialogflow or Google Cloud STT by using websockets.

Slide 12

Slide 12 text

User records voice with browser microphone Web Client Angular App with WebRTC Docker Container Browser plays audio Architecture Lee Boonstra | @ladysign

Slide 13

Slide 13 text

User records voice with browser microphone Web Client Angular App with WebRTC Docker Container Browser plays audio Architecture How to get a microphone audio stream which works across all browsers? How to make sure the audio stream can be handled as an ArrayBuffer in the back-end?

Slide 14

Slide 14 text

RecordRTC RecordRTC is WebRTC JavaScript library for audio/video as well as screen activity recording. Lee Boonstra | @ladysign

Slide 15

Slide 15 text

User records voice with browser microphone Web Client Angular App with WebRTC Docker Container Server Node.js Codebase Docker Container Browser plays audio Architecture How to stream from front-end to back-end? How to stream Bidirectional Binary Data? Lee Boonstra | @ladysign

Slide 16

Slide 16 text

● Socket.io - Socket.IO enables real-time bidirectional event-based communication. ● Socket.io-Stream - for binary stream transfers through Socket.io Lee Boonstra | @ladysign

Slide 17

Slide 17 text

RecordRTC (web) Lee Boonstra | @ladysign

Slide 18

Slide 18 text

Socket Stream (web) Lee Boonstra | @ladysign

Slide 19

Slide 19 text

Socket Stream (node) Lee Boonstra | @ladysign

Slide 20

Slide 20 text

Lee Boonstra | @ladysign Google Speech-to-Text enables developers to convert audio to text by applying powerful neural network models in an easy-to-use API. The API recognizes 120 languages and variants to support your global user base. You can enable voice command-and-control, transcribe audio from call centers, and more.

Slide 21

Slide 21 text

STT (node) Lee Boonstra | @ladysign

Slide 22

Slide 22 text

User records voice with browser microphone Web Client Angular App with WebRTC Docker Container Server Node.js Codebase Docker Container Cloud STT Voice to text Browser plays audio Architecture AudioBuffer? Encoding? SampleRate? Number of Channels? How to get text from the HTML5 browser microphone stream? Lee Boonstra | @ladysign

Slide 23

Slide 23 text

STT Node Demo

Slide 24

Slide 24 text

Lee Boonstra | @ladysign Development suite for building Conversational UIs. ● Formerly known as API.AI ○ (Sept 2016, acquired by Google) ● Powered by Machine Learning: ○ Natural Language Understanding (NLU) ○ Intent Matching ○ Conversation Training ● Cross platform ● Build faster with the Web UI ● Scalable: Separate your conversation from code ● Speech / Voice Integration ● Multi-lingual bot support (20+ languages) ● Direct integration with 15+ channels like Google Assistant, Slack, Twilio, Facebook...

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

User records voice with browser microphone Web Client Angular App with WebRTC Docker Container Server Node.js Codebase Docker Container Cloud STT Voice to text Dialogﬂow Intent Matching Browser plays audio Architecture Translate lang to English Translate lang to English Dialogflow to detect intents. Translate text to base language, and translate back. Lee Boonstra | @ladysign

Slide 30

Slide 30 text

Dialogflow (node) Lee Boonstra | @ladysign

Slide 31

Slide 31 text

Lee Boonstra | @ladysign When you sound like a computer. People treat you like a computer. This is where Wavenet Technology comes in. ● Voices sound natural and unique ● Capture subtleties like pitch, pace, and all the pauses that convey meaning TTS - Making use of DeepMind's WaveNet Technology https://deepmind.com/blog/wavenet-generative-model-raw-audio/ Lee Boonstra | @ladysign

Slide 32

Slide 32 text

TTS (node) Lee Boonstra | @ladysign

Slide 33

Slide 33 text

User records voice with browser microphone Web Client Angular App with WebRTC Docker Container Server Node.js Codebase Docker Container Cloud TTS Spoken voice Cloud STT Voice to text Dialogﬂow Intent Matching Browser plays audio Architecture Translate lang to English Translate lang to English How to bring an AudioBuffer to the WebClient? How to make it autoplay in all browsers? Lee Boonstra | @ladysign

Slide 34

Slide 34 text

AudioBufferSourceNode (web) Lee Boonstra | @ladysign

Slide 35

Slide 35 text

User records voice with browser microphone Web Client Angular App with WebRTC Docker Container Server Node.js Codebase Docker Container Cloud TTS Spoken voice Cloud STT Voice to text Dialogﬂow Intent Matching Browser plays audio Architecture Translate lang to English Translate lang to English Need HTTPS? SSL Certificates? Lee Boonstra | @ladysign

Slide 36

Slide 36 text

https://github.com/dialogflow/sel fservicekiosk-audio-streaming I open sourced my code Lee Boonstra | @ladysign

Slide 37

Slide 37 text

Thanks! Lee Boonstra | @ladysign Github: https://github.com/dialogflow/selfservicekiosk-audio-streaming Slides: https://bit.ly/2OZN9yg

Slide 38

Slide 38 text

Hey, why not a Single Page App? Anti Pattern I've seen solutions online where the microphone directly got streamed to the Dialogﬂow, without a server part. The REST calls were made directly in the web client with JavaScript. I would consider this as an anti-pattern. You will likely expose your service account / private key in your client-side code. Anyone, who is handy with ChromeDev tools, could steal your key, and make (paid) API calls via your account. It's a better approach to always let a server handle the Google Cloud authentication. This way the service account won't be exposed to the public.