Introduction to Speech Interfaces for Web Applications

Slide 1

Slide 1 text

Introduction to Speech Interfaces for Web Applications Kevin Hakanson April 23, 2016 #minnebar @hakanson

Slide 2

Slide 2 text

Speaking with your computing device is becoming commonplace. Most of us have used Apple's Siri, Google Now, Microsoft's Cortana, or Amazon's Alexa - but how can you speak with your web application? The Web Speech API can enable a voice interface by adding both Speech Synthesis (Text to Speech) and Speech Recognition (Speech to Text) functionality. This session will introduce the core concepts of Speech Synthesis and Speech Recognition. We will evaluate the current browser support and review alternative options. See the JavaScript code and UX design considerations required to add a speech interface to your web application. Come hear if it's as easy as it sounds? @hakanson 2

Slide 3

Slide 3 text

@hakanson 3 “As businesses create their roadmaps for technology adoption, companies that serve customers should be planning for, if not already implementing, both messaging-based and voice-based Conversational UIs. Source: “How Voice Plays into the Rise of the Conversational UI”

Slide 4

Slide 4 text

User Interfaces (UIs) • GUI – Graphicial User Inteface • NUI – Natural User Interface • “invisible” as the user continuously learns increasingly complex interactions • NLUI – Natural Language User Interface • linguistic phenomena such as verbs, phrases and clauses act as UI controls • VUI – Voice User Interface • voice/speech for hands-free/eyes-free interface @hakanson 4

Slide 5

Slide 5 text

Multimodal Interfaces Provides multiple modes for user to interact with system • Multimodal Input • Keyboard/Mouse • Touch • Gesture (Camera) • Voice (Microphone) • Multimodal Output • Screen • Audio Cues or Recordings • Synthesized Speech @hakanson 5

Slide 6

Slide 6 text

Design for Voice Interfaces Voice Interface • Voice Input • Recogition • Understanding • Audio Output "voice design should serve the needs of the user and solve a specific problem" @hakanson 6

Slide 7

Slide 7 text

@hakanson 7 “Normal people, when they think about speech recognition, they want the whole thing. They want recognition, they want understanding and they want an action to be taken.” Hsiao-Wuen Hon Microsoft Research Source: “Speak, hear, talk: The long quest for technology that understands speech as well as a human”

Slide 8

Slide 8 text

@hakanson 8

Slide 9

Slide 9 text

Types of Interactions • The Secretary • Recognize what is being said and record it • The Bouncer • Recognize who is speaking • The Gopher • Execute simple orders • The Assistant • Intelligently respond to natural language input @hakanson 9 Source: “Evangelizing and Designing Voice User Interface: Adopting VUI in a GUI world” Stephen Gay & Susan Hura

Slide 10

Slide 10 text

Opportunities • Hands Free • Extra Hand • Shortcuts • Humanize @hakanson 10 Source: “Evangelizing and Designing Voice User Interface: Adopting VUI in a GUI world” Stephen Gay & Susan Hura

Slide 11

Slide 11 text

Personality • Create a consistant personality • Conversational experience • Take turns • Be tolerant • Functional vs. Anthropomorphic • The more “human” the interface, the more user frustation when it doesn’t understand. @hakanson 11

Slide 12

Slide 12 text

@hakanson 12

Slide 13

Slide 13 text

Intelligent Personal Assistant An intelligent personal assistant (or simply IPA) is a software agent that can perform tasks or services for an individual. These tasks or services are based on user input, location awareness, and the ability to access information from a variety of online sources (such as weather or traffic conditions, news, stock prices, user schedules, retail prices, etc.). Source: Wikipedia @hakanson 13

Slide 14

Slide 14 text

Apple’s Siri • Speech Interpretation and Recognition Interface • Norwegian name that means "beautiful victory" • Integral part of Apple’s iOS since iOS 5 • Also integrated into Apple’s watchOS, tvOS and CarPlay • Rumored for OS X 10.12 • “Hey, Siri” @hakanson 14

Slide 15

Slide 15 text

@hakanson 15

Slide 16

Slide 16 text

Google Now • First included in Android 4.1 (Jelly Bean) • Available within Google Search mobile apps (Android, iOS) and Google Chrome desktop browser • Android TV, Android Wear, etc. • “OK, Google” @hakanson 16

Slide 17

Slide 17 text

Microsoft’s Cortana • Named after a synthetic intelligence character from Halo • Created for Windows Phone 8.1 • Available on Windows 10, XBOX, and iOS/Android mobile apps • Integration with Universal Windows Platform (UWP) apps • “Hey, Cortana” @hakanson 17

Slide 18

Slide 18 text

Cortana’s Chit Chat • Cortana has a team of writers which includes a screenwriter, a playwright, a novelist, and an essayist. • Their job is to come up with human-like dialogue that makes Cortana seem like more than just a series of clever algorithms. Microsoft calls this brand of quasi-human responsiveness “chit chat.” @hakanson 18 Source: “Inside Windows Cortana: The Most Human AI Ever Built”

Slide 19

Slide 19 text

Amazon Alexa • Short for Alexandria, an homage to the ancient library • Available on Amazon Echo and Fire TV • Companion web app for iOS/Android mobile app • Alexa Skills Kit • Alexa Voice Service • “Alexa” or “Amazon” or “Echo” @hakanson 19

Slide 20

Slide 20 text

@hakanson 20

Slide 21

Slide 21 text

Web Speech API •Enables you to incorporate voice data into web applications •Consists of two parts: • SpeechSynthesis (Text-to-Speech) • SpeechRecognition (Asynchronous Speech Recognition) @hakanson 21 https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API

Slide 22

Slide 22 text

Web Speech API Specification Defines a JavaScript API to enable web developers to incorporate speech recognition and synthesis into their web pages. It enables developers to use scripting to generate text-to-speech output and to use speech recognition as an input for forms, continuous dictation and control. Published by the Speech API Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. @hakanson 22 https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html

Slide 23

Slide 23 text

Browser Support @hakanson 23 http://caniuse.com/#search=speech

Slide 24

Slide 24 text

Chrome @hakanson 24

Slide 25

Slide 25 text

Firefox @hakanson 25 disabled by default, go to about:config to enable

Slide 26

Slide 26 text

Edge @hakanson 26

Slide 27

Slide 27 text

Speech Synthesis Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech computer or speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech. @hakanson 27 Source: Wikipedia

Slide 28

Slide 28 text

Utterance The SpeechSynthesisUtterance interface represents a speech request. Properties: • lang – in unset, lang value will be used • pitch – range between 0 (lowest) and 2 (highest) • rate – range between 0.1 (lowest) and 10 (highest) • text – plain text (or well formed SSML)* • voice – SpeechSynthesisVoice object • volume – range between 0 (lowest) and 1 (highest) @hakanson 28

Slide 29

Slide 29 text

Utterance Events • onboundary – fired when the spoken utterance reaches a word or sentence boundary • onend – fired when the utterance has finished being spoken • onerrror – fired when an error occurs that prevents the utterance from being succesfully spoken • onmark – fired when the spoken utterance reaches a named SSML "mark" tag • onpause – fired when the utterance is paused part way through • onresume – fired when a paused utterance is resumed • onstart – fired when the utterance has begun to be spoken @hakanson 29

Slide 30

Slide 30 text

SpeechSynthesis Controller interface for the speech service • speak() – add utternace to queue • speaking – utternace in process of being spoken • pending –queue contains as-yet-unspoken utterances • cancel()– remove all utternaces from queue • pause(), resume(), paused – control and indicate pause state • getVoices() – retuns list of SpeechSynthesisVoices @hakanson 30

Slide 31

Slide 31 text

JavaScript Example var msg = new SpeechSynthesisUtterance(); msg.text = "I'm sorry Dave, I'm afraid I can't do that"; window.speechSynthesis.speak(msg); @hakanson 31

Slide 32

Slide 32 text

"I'm sorry Dave, I'm afraid I can't do that" @hakanson 32 Source

Slide 33

Slide 33 text

Voices The SpeechSynthesisVoice interface represents a voice that the system supports. Properties: • default – indicates default voice for current app language • lang – BCP 47 language tag • localService – indicates if voice supplied by local speech synthesizer service • name – human-readable name that represents voice • voiceURI – location of speech synthesis service @hakanson 33

Slide 34

Slide 34 text

Voices by Platform • Chrome • Google US English • … • Mac • Samantha • Alex • … • Windows 10 • Microsoft David Desktop • Microsoft Zira Desktop @hakanson 34

Slide 35

Slide 35 text

SpeechSynthesisVoice default:true lang:"en-US" localService:true name:"Samantha" voiceURI:"Samantha" default:false lang:"en-US" localService:false name:"Google US English" voiceURI:"Google US English" @hakanson 35 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36

Slide 36

Slide 36 text

“Samantha” voiceURI • Chrome/Opera • Samantha • Safari • com.apple.speech.synthesis.voice.samantha • com.apple.speech.synthesis.voice.samantha.premium • Firefox • urn:moz-tts:osx:com.apple.speech.synthesis.voice.samantha.premium @hakanson 36

Slide 37

Slide 37 text

Google App’s New Voice Team included a Voice Coach and Linguist working in a recording studio @hakanson 37 Source: “The Google App’s New Voice - #NatAndLoEp 12”

Slide 38

Slide 38 text

@hakanson 38 Demo http://mdn.github.io/web-‐speech-‐api/speak-‐easy-‐synthesis/ https://github.com/mdn/web-‐speech-‐api/tree/master/speak-‐easy-‐synthesis

Slide 39

Slide 39 text

@hakanson 39 Demo https://jsbin.com/tinaso/edit?js,console,output

Slide 40

Slide 40 text

SSML • Speech Synthesis Markup Language (SSML) • Version 1.0; W3C Recommendation 7 September 2004 • XML-based markup language for assisting the generation of synthetic speech • Standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. @hakanson 40 https://www.w3.org/TR/speech-synthesis/

Slide 41

Slide 41 text

SSML Example 1st request was for 1 room on 10/19/2010 , with early arrival at 12:35pm .

@hakanson 41

Slide 42

Slide 42 text

Spoken Output and Accessibility “It’s important to understand that adding synthesized speech to an application and making an application accessible to all users (a process called access enabling) are different processes with different goals.” @hakanson 42 Source: “Speech Synthesis in OS X”

Slide 43

Slide 43 text

Speech Recognition Speech recognition (SR) is the inter-disciplinary sub-field of computational linguistics which incorporates knowledge and research in the linguistics, computer science, and electrical engineering fields to develop methodologies and technologies that enables the recognition and translation of spoken language into text by computers and computerized devices such as those categorized as smart technologies and robotics. It is also known as "automatic speech recognition" (ASR), "computer speech recognition", or just "speech to text" (STT). @hakanson 43 Source: Wikipedia

Slide 44

Slide 44 text

SpeechRecognition The SpeechRecognition interface is the controller interface for the recognition service; this also handles the SpeechRecognitionEvent sent from the recognition service. @hakanson 44

Slide 45

Slide 45 text

Properties • grammars – returns and sets a collection of SpeechGrammar objects that represent the grammars that will be understood by the current SpeechRecognition • lang – returns and sets the language of the current SpeechRecognition. If not specified, this defaults to the HTML lang attribute value, or the user agent's language setting if that isn't set either • continuous – controls whether continuous results are returned for each recognition, or only a single result. Defaults to single (false) • interimResults – controls whether interim results should be returned (true) or not (false.) Interim results are results that are not yet final (e.g. the isFinal property is false.) • maxAlternatives – sets the maximum number of SpeechRecognitionAlternatives provided per result (default value is 1) • serviceURI – specifies the location of the speech recognition service used by the current SpeechRecognition to handle the actual recognition (default is the user agent's default speech service) @hakanson 45

Slide 46

Slide 46 text

Events • onaudiostart – fired when the user agent has started to capture audio. • onaudioend – fired when the user agent has finished capturing audio.SpeechRecognition.onendFired when the speech recognition service has disconnected • onerror – fired when a speech recognition error occurs • onnomatch – fired when the speech recognition service returns a final result with no significant recognition. This may involve some degree of recognition, which doesn't meet or exceed the confidence threshold • onresult – fired when the speech recognition service returns a result — a word or phrase has been positively recognized and this has been communicated back to the app @hakanson 46

Slide 47

Slide 47 text

Events • onsoundstart – fired when any sound — recognisable speech or not — has been detected • onsoundend – fired when any sound — recognisable speech or not — has stopped being detected • onspeechstart – fired when sound that is recognised by the speech recognition service as speech has been detected • onspeechend – fired when speech recognised by the speech recognition service has stopped being detected • onstart – fired when the speech recognition service has begun listening to incoming audio with intent to recognize grammars associated with the current SpeechRecognition @hakanson 47

Slide 48

Slide 48 text

Methods • abort() – stops the speech recognition service from listening to incoming audio, and doesn't attempt to return a SpeechRecognitionResult • start() – starts the speech recognition service listening to incoming audio with intent to recognize grammars associated with the current SpeechRecognition • stop() – stops the speech recognition service from listening to incoming audio, and attempts to return a SpeechRecognitionResult using the audio captured so far @hakanson 48

Slide 49

Slide 49 text

JavaScript Example var recognition = new SpeechRecognition(); recognition.lang = 'en-US'; recognition.interimResults = false; recognition.maxAlternatives = 1; recognition.start(); @hakanson 49

Slide 50

Slide 50 text

SpeechRecognitionResult The SpeechRecognitionResult interface represents a single recognition match, which may contain multiple SpeechRecognitionAlternativeobjects. • isFinal – a Boolean that states whether this result is final (true) or not (false) — if so, then this is the final time this result will be returned; if not, then this result is an interim result, and may be updated later on • length – returns the length of the "array" — the number of SpeechRecognitionAlternative objects contained in the result (also referred to as "n-best alternatives”) • item – a standard getter that allows SpeechRecognitionAlternative objects within the result to be accessed via array syntax @hakanson 50

Slide 51

Slide 51 text

SpeechRecognitionAlternative The SpeechRecognitionAlternative interface represents a single word that has been recognised by the speech recognition service • transcript – returns a string containing the transcript of the recognised word • confidence – returns a numeric estimate of how confident the speech recognition system is that the recognition is correct @hakanson 51

Slide 52

Slide 52 text

JavaScript Example recognition.onresult = function(event) { var color = event.results[0][0].transcript; diagnostic.textContent = 'Result received: ' + color + '.'; bg.style.backgroundColor = color; } @hakanson 52

Slide 53

Slide 53 text

@hakanson 53 Demo http://mdn.github.io/web-‐speech-‐api/speech-‐color-‐changer/ https://github.com/mdn/web-‐speech-‐api/tree/master/speech-‐color-‐changer

Slide 54

Slide 54 text

Grammars • A speech recognition grammar is a container of language rules that define a set of constraints that a speech recognizer can use to perform recognition. • A grammar helps in the following ways: • Limits Vocabulary • Customizes Vocabulary • Filters Recogized Results • Identifies Rules • Defines Semantics @hakanson 54 https://msdn.microsoft.com/en-us/library/hh378342(v=office.14).aspx

Slide 55

Slide 55 text

SRGS • Speech Recognition Grammar Specification (SRGS) • Version 1.0; W3C Recommendation 16 March 2004 • Grammars are used so that developers can specify the words and patterns of words to be listened for by a speech recognizer • Augmented BNF (ABNF) or XML syntax • Modelled on the JSpeech Grammar Format specification [JSGF] @hakanson 55 https://www.w3.org/TR/speech-grammar/

Slide 56

Slide 56 text

JSGF • JSpeech Grammar Format (JSGF) • W3C Note 05 June 2000 • Platform-independent, vendor-independent textual representation of grammars for use in speech recognition • Derived from the JavaTM Speech API Grammar Format (Version 1.0, October, 1998) @hakanson 56

Slide 57

Slide 57 text

SpeechGrammar The SpeechGrammar interface represents a set of words or patterns of words that we want the recognition service to recognize. Defined using JSpeech Grammar Format (JSGF.) Other formats may also be supported in the future. • src – sets and returns a string containing the grammar from within in the SpeechGrammar object instance • weight – sets and returns the weight of the SpeechGrammar object @hakanson 57

Slide 58

Slide 58 text

Slide 59

Slide 59 text

Sample “OK, Google” Commands • Remind me to [do a task]. Ex.: "Remind me to get dog food at Target," will create a location-based reminder. "Remind me to take out the trash tomorrow morning," will give you a time-based reminder. • When's my next meeting? • How do I [task]? Ex.: "How do I make an Old Fashioned cocktail?" or "How do I fix a hole in my wall?” • If a song is playing, ask questions about the artist. For instance, "Where is she from?" (Android 6.0 Marshmallow) • To learn more about your surroundings, you can ask things like "What is the name of this place?" or "Show me movies at this place" or "Who built this bridge?" @hakanson 59 Source: “The complete list of 'OK, Google' commands”

Slide 60

Slide 60 text

Natural Language Understanding •Speech to Text •Text to Meaning @hakanson 60

Slide 61

Slide 61 text

NLP vs. FSM • Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. • A finite-state machine (FSM) is a mathematical model of computation used to design both computer programs and sequential logic circuits. @hakanson 61

Slide 62

Slide 62 text

KITT vs Samsung smart home @hakanson 62 Source

Slide 63

Slide 63 text

Other Speech APIs • Why? • Browser doesn’t support Web Speech API • Consistent experience across all browsers • How? • Web Audio API • JavaScript in browser • WebSocket connection directly from browser • HTTP API proxied though server @hakanson 63

Slide 64

Slide 64 text

Web Audio API The Web Audio API provides a powerful and versatile system for controlling audio on the Web, allowing developers to choose audio sources, add effects to audio, create audio visualizations, apply spatial effects (such as panning) and much more. @hakanson 64 https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API

Slide 65

Slide 65 text

Pocketsphinx.js Speech recognition in JavaScript • PocketSphinx.js is a speech recognizer that runs entirely in the web browser. It is built on: • a speech recognizer written in C (PocketSphinx) converted into JavaScript using Emscripten, • an audio recorder using the Web Audio API. @hakanson 65 https://syl22-00.github.io/pocketsphinx.js/live-demo.html

Slide 66

Slide 66 text

IBM Watson Developer Cloud • Text to Speech • Watson Text to Speech provides a REST API to synthesize speech audio from an input of plain text. • Once synthesized in real-time, the audio is streamed back to the client with minimal delay. • Speech to Text • Uses machine intelligence to combine information about grammar and language structure with knowledge of the composition of an audio signal to generate an accurate transcription. • Accessed via a WebSocket connection or REST API. @hakanson 66 http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/services-catalog.html

Slide 67

Slide 67 text

@hakanson 67 Demo https://text-‐to-‐speech-‐demo.mybluemix.net/ https://speech-‐to-‐text-‐demo.mybluemix.net/

Slide 68

Slide 68 text

Microsoft Cognitive Services • Speech API • Convert audio to text, understand intent, and convert text back to speech for natural responsiveness (rebranding of Bing and Project Oxford APIs) • Microsoft has used Speech API for Windows applications like Cortana and Skype Translator @hakanson 68 https://www.microsoft.com/cognitive-services/en-us/speech-api

Slide 69

Slide 69 text

Microsoft Cognitive Services • Speech Recognition • Convert spoken audio to text. • Text to Speech • Convert text to spoken audio • Speech Intent Recognition • Convert spoken audio to intent • In addition to returning recognized text, includes structured information about the incoming speech @hakanson 69

Slide 70

Slide 70 text

@hakanson 70 Demo https://www.microsoft.com/cognitive-‐services/en-‐us/speech-‐api

Slide 71

Slide 71 text

Google Cloud Speech API Enables developers to convert audio to text by applying powerful neural network models in an easy to use API • Over 80 Languages • Return Text Results In Real-Time • Accurate In Noisy Environments • Powered by Machine Learning @hakanson 71 https://cloud.google.com/speech/

Slide 72

Slide 72 text

Summary • Speech Interfaces are the future • and they have been for a long time • and don’t believe everything you see on TV • Know your customer and application • More UI/UX effort than JavaScript code • and time to leverage those writing and speaking skill sets • Web technology lags behind mobile, but is evolving @hakanson 72

Slide 73

Slide 73 text

Questions? @hakanson 73 Source