Designing for Voice Interactions - UX Australia: Designing for Mobile

source: http://www.ﬂickr.com/photos/altemark/304079314 Designing for Voice Interactions UX Australia Designing for
Mobility Melbourne, March 1 2013 Jonny Schneider Lead Consultant Mobile Experience Design & Strategy

‘Name of referenced work’, Author/source/URL, date. When you think of
voice recognition, you probably think of... ‘Understanding Moira’, AAMI TV Commercial, http://www.youtube.com/watch?v=EY_jL38HMy8 inaccurate too slow never works it’s a gimmick too tedious for me “I won’t use it until it’s faster and more accurate than typing” it can’t handle my accent A lot of those things might be true, but this is default thinking, likely based on many bad experiences. However, there are two sides to every story.

https://twitter.com/bennyg/status/167192535305945088 https://twitter.com/bennyg/status/167192535305945088

http://www.ﬂickr.com/photos/av_hire_london/5579125851 IDEA: Experience ﬁrst-hand what it's like to interact with
digital devices using predominantly your voice. METHOD: A group of colleagues committed to using voice wherever possible, for an entire day. Day of Voice Let’s take a more objective look at what it’s like to use voice in our everyday interactions. Today.

✦ Controlling the device is tedious ✦ I’m sorry, I
can’t do that for you ✦ Comprehension/recognition ✦ Expression ✦ Privacy ✦ Loss of context/paradigm Day of Voice: what didn’t work Control: “Dictation itself was ﬁne, but getting to where notes are taken very tedious.” “I couldn’t navigate to where I needed to be. It heard the command correctly, but didn’t know what to do with it” Limitations: Generally, it’s not pervasive enough to be relied upon “I can’t...” - “play games with voice”! - Attach to email - dictate an email address "schneider dot jonny at gmail com". - edit an address Recognition. i.e. Pam’s clips. Expression. Exclamation marks, commas, full stops, slang etc. is possible, but not natural. As a result “I found that everything tends to run together” Privacy. “On several occasions, I found myself wandering off to a small room or closet so that other’s couldn’t hear what I was talking about.” Loss of context. Chat client. Using voice means I have to break-out of the normal short-messaging paradigm that I’m used to. It changes to asynchronous audible communication. Without those visual cues, I’m not sure where I’m up to, or what I want to say next. A lot of this could just be that we’re not used to it.

✦ Google search with auto-suggest ✦ Dictation ✦ Accessibility* ✦
Control by command (XBox Kinect; Dragon for desktop) Day of Voice: what worked Examples of some useful and surprising experiences with voice Google search. “brilliant for rarely used words like 'oesophagus' or 'onomatopoeia', and much faster than guessing letters and typing.” Dictation. “Recording of notes is easy and I've done it on a number of occasions as I'd much prefer to talk than to type.” Can make light of a tedious task of typing on a mobile device. Even at 80% accuracy, this is way faster than typing, for longer messages Accessibility. Blind person using Instagram [video]

‘How Blind People Use Instagram’, Tommy Edison, 2012. http://bit.ly/YBmBmb blind
man uses Instagram (video) http://www.youtube.com/watch?v=P1e7ZCKQfMA

http://www.google.com/nexus/4/ ✦ On-board hardware (microphone and speaker) ✦ hands busy
+ eyes busy context of use ✦ Personal and ‘always with you’ nature of device suits idea of ‘virtual assistants’ Why is this so relevant for mobile?

Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites ‘83 ‘85 ‘87
‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11 AMPS Analogue GSM 2G/WAP/WML/i-mode 3G UMTS NextG The beginnings of speech recognition technology predates mobile telephony. Goes back to the 50s but let’s look at the last30 years •Ray Kurzweil’s reading machine: speech synthesiser for blind people. •+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive. •The next decade: mobile devices get smaller and more prolific. Internet starts to take off •(early 90s) SMS, then T9 later that decade •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen •Google voice search (2008) •Voice Control for iOS, then Voice Actions a year later •Swype text input •Voice controlled virtual assistants (SIRI and Google Now) 2012 •Visual IVR Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program. http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11 AMPS Analogue GSM 2G/WAP/WML/i-mode 3G UMTS NextG Telecom ‘Walkabout’ Kurzweil Reading Machine ←(1976) 1st commercial large vocabulary speech recogniser The beginnings of speech recognition technology predates mobile telephony. Goes back to the 50s but let’s look at the last30 years •Ray Kurzweil’s reading machine: speech synthesiser for blind people. •+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive. •The next decade: mobile devices get smaller and more prolific. Internet starts to take off •(early 90s) SMS, then T9 later that decade •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen •Google voice search (2008) •Voice Control for iOS, then Voice Actions a year later •Swype text input •Voice controlled virtual assistants (SIRI and Google Now) 2012 •Visual IVR Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program. http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11 Palm Treo Motorola Brick Nokia 5110 Motorola RAZR AMPS Analogue GSM 2G/WAP/WML/i-mode 3G UMTS NextG Telecom ‘Walkabout’ Kurzweil Reading Machine ←(1976) 1st commercial large vocabulary speech recogniser The beginnings of speech recognition technology predates mobile telephony. Goes back to the 50s but let’s look at the last30 years •Ray Kurzweil’s reading machine: speech synthesiser for blind people. •+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive. •The next decade: mobile devices get smaller and more prolific. Internet starts to take off •(early 90s) SMS, then T9 later that decade •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen •Google voice search (2008) •Voice Control for iOS, then Voice Actions a year later •Swype text input •Voice controlled virtual assistants (SIRI and Google Now) 2012 •Visual IVR Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program. http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11 Palm Treo Motorola Brick Nokia 5110 Motorola RAZR AMPS Analogue GSM 2G/WAP/WML/i-mode 3G UMTS NextG SMS is born Predictive Text (T9) Telecom ‘Walkabout’ Kurzweil Reading Machine ←(1976) 1st commercial large vocabulary speech recogniser The beginnings of speech recognition technology predates mobile telephony. Goes back to the 50s but let’s look at the last30 years •Ray Kurzweil’s reading machine: speech synthesiser for blind people. •+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive. •The next decade: mobile devices get smaller and more prolific. Internet starts to take off •(early 90s) SMS, then T9 later that decade •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen •Google voice search (2008) •Voice Control for iOS, then Voice Actions a year later •Swype text input •Voice controlled virtual assistants (SIRI and Google Now) 2012 •Visual IVR Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program. http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11 Palm Treo Motorola Brick Nokia 5110 Motorola RAZR AMPS Analogue GSM 2G/WAP/WML/i-mode 3G UMTS NextG SMS is born Predictive Text (T9) Telephone Banking 1st dial-in IVR (DTMF) Dragon Dictate v1 for PC Telecom ‘Walkabout’ Kurzweil Reading Machine ←(1976) 1st commercial large vocabulary speech recogniser The beginnings of speech recognition technology predates mobile telephony. Goes back to the 50s but let’s look at the last30 years •Ray Kurzweil’s reading machine: speech synthesiser for blind people. •+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive. •The next decade: mobile devices get smaller and more prolific. Internet starts to take off •(early 90s) SMS, then T9 later that decade •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen •Google voice search (2008) •Voice Control for iOS, then Voice Actions a year later •Swype text input •Voice controlled virtual assistants (SIRI and Google Now) 2012 •Visual IVR Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program. http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11 Palm Treo Motorola Brick Nokia 5110 Motorola RAZR HTC Dream (1st Android) iPhone 3 AMPS Analogue GSM 2G/WAP/WML/i-mode 3G UMTS NextG SMS is born Predictive Text (T9) Telephone Banking 1st dial-in IVR (DTMF) Dragon Dictate v1 for PC Telecom ‘Walkabout’ Kurzweil Reading Machine ←(1976) 1st commercial large vocabulary speech recogniser The beginnings of speech recognition technology predates mobile telephony. Goes back to the 50s but let’s look at the last30 years •Ray Kurzweil’s reading machine: speech synthesiser for blind people. •+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive. •The next decade: mobile devices get smaller and more prolific. Internet starts to take off •(early 90s) SMS, then T9 later that decade •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen •Google voice search (2008) •Voice Control for iOS, then Voice Actions a year later •Swype text input •Voice controlled virtual assistants (SIRI and Google Now) 2012 •Visual IVR Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program. http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11 Palm Treo Motorola Brick Nokia 5110 Motorola RAZR HTC Dream (1st Android) iPhone 3 AMPS Analogue GSM 2G/WAP/WML/i-mode 3G UMTS NextG Google voice search app SMS is born Predictive Text (T9) Telephone Banking 1st dial-in IVR (DTMF) Dragon Dictate v1 for PC Telecom ‘Walkabout’ Kurzweil Reading Machine ←(1976) 1st commercial large vocabulary speech recogniser The beginnings of speech recognition technology predates mobile telephony. Goes back to the 50s but let’s look at the last30 years •Ray Kurzweil’s reading machine: speech synthesiser for blind people. •+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive. •The next decade: mobile devices get smaller and more prolific. Internet starts to take off •(early 90s) SMS, then T9 later that decade •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen •Google voice search (2008) •Voice Control for iOS, then Voice Actions a year later •Swype text input •Voice controlled virtual assistants (SIRI and Google Now) 2012 •Visual IVR Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program. http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11 Palm Treo Motorola Brick Nokia 5110 Motorola RAZR HTC Dream (1st Android) iPhone 3 AMPS Analogue GSM 2G/WAP/WML/i-mode 3G UMTS NextG Google voice search app SMS is born Predictive Text (T9) Telephone Banking 1st dial-in IVR (DTMF) Dragon Dictate v1 for PC Voice control (iOS3) Voice actions (Froyo) Telecom ‘Walkabout’ Kurzweil Reading Machine ←(1976) 1st commercial large vocabulary speech recogniser The beginnings of speech recognition technology predates mobile telephony. Goes back to the 50s but let’s look at the last30 years •Ray Kurzweil’s reading machine: speech synthesiser for blind people. •+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive. •The next decade: mobile devices get smaller and more prolific. Internet starts to take off •(early 90s) SMS, then T9 later that decade •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen •Google voice search (2008) •Voice Control for iOS, then Voice Actions a year later •Swype text input •Voice controlled virtual assistants (SIRI and Google Now) 2012 •Visual IVR Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program. http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11 Palm Treo Motorola Brick Nokia 5110 Motorola RAZR HTC Dream (1st Android) iPhone 3 AMPS Analogue GSM 2G/WAP/WML/i-mode 3G UMTS NextG Google voice search app SMS is born Predictive Text (T9) Telephone Banking 1st dial-in IVR (DTMF) Dragon Dictate v1 for PC Voice control (iOS3) Voice actions (Froyo) Telecom ‘Walkabout’ Kurzweil Reading Machine ←(1976) 1st commercial large vocabulary speech recogniser Swype The beginnings of speech recognition technology predates mobile telephony. Goes back to the 50s but let’s look at the last30 years •Ray Kurzweil’s reading machine: speech synthesiser for blind people. •+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive. •The next decade: mobile devices get smaller and more prolific. Internet starts to take off •(early 90s) SMS, then T9 later that decade •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen •Google voice search (2008) •Voice Control for iOS, then Voice Actions a year later •Swype text input •Voice controlled virtual assistants (SIRI and Google Now) 2012 •Visual IVR Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program. http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11 Palm Treo Motorola Brick Nokia 5110 Motorola RAZR HTC Dream (1st Android) iPhone 3 AMPS Analogue GSM 2G/WAP/WML/i-mode 3G UMTS NextG Google voice search app SMS is born Predictive Text (T9) Telephone Banking 1st dial-in IVR (DTMF) Dragon Dictate v1 for PC Voice control (iOS3) Voice actions (Froyo) SIRI & Google Now Telecom ‘Walkabout’ Kurzweil Reading Machine ←(1976) 1st commercial large vocabulary speech recogniser Swype The beginnings of speech recognition technology predates mobile telephony. Goes back to the 50s but let’s look at the last30 years •Ray Kurzweil’s reading machine: speech synthesiser for blind people. •+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive. •The next decade: mobile devices get smaller and more prolific. Internet starts to take off •(early 90s) SMS, then T9 later that decade •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen •Google voice search (2008) •Voice Control for iOS, then Voice Actions a year later •Swype text input •Voice controlled virtual assistants (SIRI and Google Now) 2012 •Visual IVR Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program. http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

Visual IVR Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites ‘83
‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11 Palm Treo Motorola Brick Nokia 5110 Motorola RAZR HTC Dream (1st Android) iPhone 3 AMPS Analogue GSM 2G/WAP/WML/i-mode 3G UMTS NextG Google voice search app SMS is born Predictive Text (T9) Telephone Banking 1st dial-in IVR (DTMF) Dragon Dictate v1 for PC Voice control (iOS3) Voice actions (Froyo) SIRI & Google Now Telecom ‘Walkabout’ Kurzweil Reading Machine ←(1976) 1st commercial large vocabulary speech recogniser Swype The beginnings of speech recognition technology predates mobile telephony. Goes back to the 50s but let’s look at the last30 years •Ray Kurzweil’s reading machine: speech synthesiser for blind people. •+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive. •The next decade: mobile devices get smaller and more prolific. Internet starts to take off •(early 90s) SMS, then T9 later that decade •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen •Google voice search (2008) •Voice Control for iOS, then Voice Actions a year later •Swype text input •Voice controlled virtual assistants (SIRI and Google Now) 2012 •Visual IVR Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program. http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

http://www.ﬂickr.com/photos/carnamah/5859235859 What do people want? If I had asked people
what they wanted, they would have said faster horses. Henry Ford, nineteen twenty never Henry didn’t actually say this... Someone at Harvard Business Review went looking, and got a response from the Henry Ford Museum, who have researched the topic before, and had found no satisfactory result to suggest that Ford in fact said it! The point is... I believe there’s a misconception that people don’t like voice as an interaction method. I would argue that people will use whatever input method gets the job done quickly and with minimum fuss - that can be ‘voice’. I wonder what people said about: •T9 •Touch •Mobile telephony •or even computers

Used with permission by Kenneth Johnson. http://kennethjohnson.us/ ✦ All the
robots! ✦ Google glass Imagine the future... if machines could understand. A few examples: - HAL 9000 (2001: A Space Odyssey) - T-800 (Terminator) - Johnny 5 (Short Circuit) - Data (Star Trek) - Robocop ED-209 (Robocop) Not just movies. ...CSI and other such shows are riddled with intelligent, understanding, all singing, all dancing, talking computers. Sci-Fi movies have been spruiking the possibilities for decades. In reality, we’re moving at a much slower pace, but things like Google Glass are coming - in fact, you can participate for the trial study right now if you like.

Voice recognition technology Main types of voice interaction Design principles
›❯ ›❯ ›❯ Let’s talk about Voice

›❯ ›❯ ›❯

A (very) quick look at the technology search engine customer
database private APIs transaction gateway 3rd party APIs SPEECH RECOGNITION & SYNTHESIS SERVICE voice-to-text text-to-speech This is one conﬁguration, that we used on a recent project. There are many other ways this could be done. •sound clip recorded •clip sent to VTT •VTT interprets/translates •sent back as text •device sends text to other services (i.e. search engine) •data sent back to the device (often multiples, with a conﬁdence rating) •device sends text to be voiced over (i.e. a summary of the data presented to user) •TTS creates a voice clip and sends it back to the device •device presents the data and plays the voice clip

database private APIs transaction gateway 3rd party APIs A SPEECH RECOGNITION & SYNTHESIS SERVICE voice-to-text text-to-speech This is one conﬁguration, that we used on a recent project. There are many other ways this could be done. •sound clip recorded •clip sent to VTT •VTT interprets/translates •sent back as text •device sends text to other services (i.e. search engine) •data sent back to the device (often multiples, with a conﬁdence rating) •device sends text to be voiced over (i.e. a summary of the data presented to user) •TTS creates a voice clip and sends it back to the device •device presents the data and plays the voice clip

database private APIs transaction gateway 3rd party APIs A B SPEECH RECOGNITION & SYNTHESIS SERVICE voice-to-text text-to-speech This is one conﬁguration, that we used on a recent project. There are many other ways this could be done. •sound clip recorded •clip sent to VTT •VTT interprets/translates •sent back as text •device sends text to other services (i.e. search engine) •data sent back to the device (often multiples, with a conﬁdence rating) •device sends text to be voiced over (i.e. a summary of the data presented to user) •TTS creates a voice clip and sends it back to the device •device presents the data and plays the voice clip

database private APIs transaction gateway 3rd party APIs A B C SPEECH RECOGNITION & SYNTHESIS SERVICE voice-to-text text-to-speech This is one conﬁguration, that we used on a recent project. There are many other ways this could be done. •sound clip recorded •clip sent to VTT •VTT interprets/translates •sent back as text •device sends text to other services (i.e. search engine) •data sent back to the device (often multiples, with a conﬁdence rating) •device sends text to be voiced over (i.e. a summary of the data presented to user) •TTS creates a voice clip and sends it back to the device •device presents the data and plays the voice clip

http://www.ﬂickr.com/photos/citychiccountrymouse/3856797711 PURPOSE: Measure accuracy and latency of current voice recognition
solutions METHOD: ✦ 4 vendor solutions ✦ 14 test phrases for translation ✦ 12 participants ✦ phrases recorded ‘fast’ and ‘slow’ Let’s Benchmark!

“Are there any good deals nearby” I’ll get any goodies
nearby Are there any deals near me Adding any deals any of me Are there any good deals nearby ✔ ✘ ✔ ✘ Objective (exact) and subjective matching.

Average Accuracy Number of people tested Comments iSpeech 10% 4
Discarded after initial testing Google 47% 12 Non supported API Nuance - high quality audio 56% 12 10x file size Nuance - low quality audio 50% 12 1x file size Siri 64% 12 Not a reusable product Average accuracy of voice solutions Average accuracy. It’s a small number of participants. I’m sure you could find much more comprehensive test results from other sources. Knock yourself out!

0 20 40 60 80 100 P1 P2 P3 P4
P5 P6 P7 P8 P9 P10 P11 P12 Google Voice Nuance Wav Nuance Speex SIRI Accuracy of voice recognition by participant Accuracy by participant. Here’s Google Voice in pink. and now Nuance. and the other two vendors tested. This tells us there is signiﬁcant variation in accuracy, from person to person.

0 20 40 60 80 100 Australian (2) Indian (3)
Singaporean (3) American (1) Hong Kong (1) Malaysian (1) Chinese (1) Google Voice Nuance Wav Nuance Speex SIRI Average accuracy of voice recognition by accent It’s a similar story across the different accents.

A (very) quick look at the technology SPEECH RECOGNITION &
SYNTHESIS SERVICE voice-to-text text-to-speech search engine customer database private APIs transaction gateway 3rd party APIs A B C Remember A, B, C? We’re going to measure latency now. 2 weeks, sampling every 30 mins.

0 10 20 30 40 50 60 3G (in Asia)
WiFi (private) 3 16 10 21 2 4 Nuance Google Comparison of latency performance (seconds) 0 10 20 30 40 50 60 3G (in Asia) WiFi (private) 3 18 10 22 4 16 Voice-to-Text ‘Stuﬀ’ in the cloud Text-to-speech Let’s measure latency of each of those steps. Enormous latency! Over 40 seconds over 3G. Absurd. One important note, is that these times represent a whole phrase, the phrases are not broken down and processes synchronously, as is the case with products like Google voice search app.

0 10 20 30 40 50 60 3G (in Asia)
WiFi (private) 3 16 2 4 Nuance Google Comparison of latency performance (seconds) 0 10 20 30 40 50 60 3G (in Asia) WiFi (private) 3 18 4 16 Voice-to-Text Text-to-speech Even when we cut out the ‘other stuff’, and measure only VTT and TTS services, it’s still really very slow. Some of this can be improved with colocation of servers and services. This test involved servers that were geographically spread over the globe. However, that isn’t always feasible, depending on the services you are connecting with, and where they are served from.

http://www.ﬂickr.com/photos/lisovy/5415681393/ ✦ Even the best recognisers struggle to achieve higher
than 60% accuracy ✦ Latency is a problem, especially over slower networks Conclusions Consider the effect when these compound. It takes ages to get the result, and there’s a high likelihood it will be incorrect. Not ideal. My friend Rod Farmer kindly pointed out that it is possible to run concurrent requests - translating a few words at a time - in order to reduce latency signiﬁcantly. For our limited prototype, this kind of engineering wasn’t feasible. None the less, the recommendations that follow are helpful regardless of latency.

›❯ ›❯ ›❯

Main ways of interacting with voice Commands Dictation Natural Language
Identiﬁcation

http://www.ﬂickr.com/photos/bengrogan/2147048247 Command-based interactions think of: Selective hearing. ✦ System only
hears what it is listening for ✦ Structured/scripted Commands based systems are like ‘selective hearing’. The system only knows how to understand things that it is listening for. It’s a structured generally tedious way of interacting. It often feels scripted and impersonal, which are the kind of attributes that typically offend customers. This was typically the back-bone of the early IVRs (late 90s-2000s). AAMI, the Australian insurance company, has built it’s unique market position on exactly that. You might be familiar with the ‘Moira’ campaign.

Think about any time you’ve called your mobile provider. I
know it feels tedious, but ask yourself - would it be any better if you spoke with a person? Customers hate: 1. repeating themselves (usually because of a routing issue) 2. waiting in queue Telstra has 2nd biggest call centre with 600 unique reasons to call 200,000 inbound calls per day handling 1M transfers per month I’d like to argue that speaking with a real agent may well be a poorer experience than a machine. Why? Humans aren’t perfect either: - Attitude - Accents - Understanding - Consistency There are also times when we might simply prefer a machine. I can think of one or two times when I’ve really hoped to get to voicemail, because the person I was calling is a difficult to talk with. Or perhaps you’re ﬁve weeks overdue on your invoice, and would prefer not to explain yourself, but instead get it paid through an IVR. We’re talking about command based interactions - Strictly, most IVRs today has moved beyond simple ‘commands’. They usually begin with an open prompt, before moving to menu mode. We’ll discuss that in more detail in a moment.

‘Name of referenced work’, Author/source/URL, date. A very clever use
of simple voice commands to control an interface - entirely appropriate for the context of use you’d expect for this scenario (sticky ﬁngers etc.) Other’s noteworthy examples: - XBOX Kinect - Dragon for desktop

✦ Great as a text-input replacement, particularly for mobile, where
keypads are tedious ✦ It doesn’t need to ‘understand’ ✦ Predictive dictation, based on data http://www.ﬂickr.com/photos/vivax_imago/5603582392 Dictation Dictation think of: Hearing, but not understanding. The machine hears what you tell it, but can’t make meaning from it. I think we all understand how dictation works. The user says something, their speech is ‘recognised’ and then usually converted from voice to text. If it is reasonably accurate, it’s easy to see how this can be helpful. Driving or walking down the street while composing SMS on a touch screen is hideously difficult. Dangerous, and possibly illegal. Dictation frees you up to focus on other things. Complex vocabulary often also beneﬁt from dictation. A word like oesophagus is difficult to spell, and you could be left guessing what letter it starts with a few times before T9 kicks in to save the day. Dictating it is likely to be quicker and easier. Nuance’s Powerscribe360 is a great example of that in action. For medical practitioners.

It’s no co-incidence that major mobile operating systems have this
embedded right at the core. Just how it’s not a co-incidence that Google have just employed Ray Kurzweil as director of Engineering. Are they building SkyNet?

on a mac Example of predictive dictation: “What does onomatopoeia
mean?” The machine still doesn’t “understand” in the way we mean it. But just like search engines, it can predict what we mean based on statistical modeling. Think of how many billions of search queries Google has at hand, that are used to inform these statistical models.

on a mac on a mat Example of predictive dictation:
“What does onomatopoeia mean?” The machine still doesn’t “understand” in the way we mean it. But just like search engines, it can predict what we mean based on statistical modeling. Think of how many billions of search queries Google has at hand, that are used to inform these statistical models.

on a mac on a mat onomatopoeia mean? Example of
predictive dictation: “What does onomatopoeia mean?” The machine still doesn’t “understand” in the way we mean it. But just like search engines, it can predict what we mean based on statistical modeling. Think of how many billions of search queries Google has at hand, that are used to inform these statistical models.

http://bit.ly/XPJ7DC ✦ ‘natural language’ interactions ✦ The machine understands* meaning,
and can then respond in a helpful, meaningful and personal way Virtual Assistants think of: hearing and understanding* This is like hearing and understanding. ‘Understanding’ has an asterisk next to it, and you’ll see why over the next few slides. Machines have a really hard time trying to understand meaning - Why...

‘Subliminal: How your unconscious mind rules your behaviour’, p. 34.
Leonard Mlodinow, 2012. The cooking teacher said the students made good snacks. Meaning is nuanced The cannibal said the students made good snacks. It’s because human communication is complex and nuanced. and it can’t easily be automated or codiﬁed. Herein lies one of the biggest challenges for ‘intelligent’ or ‘understanding’ voice systems. “Teachers and Cannibals” is a basic example. As humans, we easily understand the meaning of these two statements that are only different by a single word. And you’re probably alarmed - I hope you’re alarmed - by the latter. Machines don’t understand this as easily.

Leonard Mlodinow, 2012. A common homily The spirit is willing, but the ﬂesh is weak Here’s another example...

Leonard Mlodinow, 2012. The spirit is willing, but the ﬂesh is weak A common homily, when programmatically translated Here’s another example...

Leonard Mlodinow, 2012. The spirit is willing, but the ﬂesh is weak The vodka is strong, but the meat is rotten A common homily, when programmatically translated Here’s another example...

http://www.ﬂickr.com/photos/lifementalhealthpics/8384573785 ✦ Semantic classiﬁcation ✦ Statistical probability modeling ✦ Creating
a perception of understanding What is machine ‘understanding’ Documents, conversations, or any kind of content can be manually classified or coded for meaning, and this becomes a model by which the machine can use for matching. Statistical algorithms similar to those used in search engines are also used to help the machine perform better, based on past behaviour of other people. This creates a perception of understanding or intelligence. You might call that ‘Artificial Intelligence’. Vocabulary is an important factor in accuracy of probability modeling. Radiography reader was a successful early speech recognition system, that was ultimately successful because the vocabulary in radiography is constrained, and the acoustic signature of the words are quite different. Therefore the algorithms are more successful.

http://www.ﬂickr.com/photos/lifementalhealthpics/8384573785 ✦ Can you access data to help do the
thinking on behalf of your users? ✦ prediction of customer needs ✦ Personalisation System awareness When a customer interacts with a service, various bits of data may be available: - identity - account status - location of call - time of day - device being used This can be used to predict customer needs. Example: Engineer cuts a cable that wipes out internet for all of Brunswick. 30,000 customers affected. For customers calling in from that geographic area, system has automated response, telling them about the problem. Customer hangs up. Lots of money saved. 20% vs. 2% improvement in routing and/or task completion by doing this. When compared with ‘tuning’ of semantic and statistical modeling.

Blade Runner, 1982. Warner Bros. img: http://replicant976.tumblr.com/image/12757032749 The Uncanny Valley
is not something we need worry about. Yet. The Uncanny Valley is a hypothesis is robotics that suggests that as robots approach human likeness, they incite repulsive emotions in humans. It doesn’t really apply to virtual agents, and so far, our experience has been that there is a long way to go before voice synthesis approaches human likeness - so it’s really nothing to worry about yet.

‘Sneakers’, Universal Studios, 1982. img: http://lat.ms/ZlHtN0 ✦ Voice biometrics Identification
think of: “My voice is my passport, verify me!” Who remembers the film Sneakers? One of my favourites. A team of security specialist steal the keycard and vocal codes of Warner Brandes, an unsuspecting employee of the ‘front’ company operated a bad guy who intends to become wealthy by using a decryption device to defraud companies for his own benefit. In the end, the good guys win, and in a postscript, they use the Janek decryption device to steal from the rich and give to the poor. A modern day Robin Hood story. This is a nice example of using voice biometrics for multi-factor authentication. There’re obvious applications for this, particularly for things like banking, where 2nd factor is often SMS, which has several limitations. 30 years later, we’re starting to see this kind of security for real.

›❯ ›❯ ›❯ We’ve seen opportunities for humans to interact with computers in helpful ways constraints in the capabilities of technology to deliver against this promise and objectives in business to optimise operating costs and improving customer service These are essentially the same ingredients to any design problem aren’t they? So let’s look at some principles that apply speciﬁcally to voice...

AT&T Visual IVR Project http://www.att.com/gen/press-room?pid=23362 ✦ High latency, low accuracy...
✦ Help users recover by using oﬀering alternatives Design for failure This could be as a multi-modal interface, or it could be a translated interface like this example of visual IVR, which let’s users traverse the IVR tree using a touch menu.

✦ Don’t treat voice as a ‘me too’ feature (will
your product or your customers actually beneﬁt from voice... really?) ✦ Think twice before introducing redundancy Would you like voice with that? Voice is the hot new thing right now, but resist the hype. It’s not trivial to implement, and even if it were, does that validate it as a ‘must have’ feature for your product? Voice is integrated into the OS of modern devices. Their technology is mature. It can be used with any input ﬁeld, any interface. The interaction design is polished, and extensively tested. Use that! If you can.

‘Name of referenced work’, Author/source/URL, date. ✦ Understand the various
modes of voice interaction ✦ Be careful about mixing modes (is that a command or a conversation?) Know when and how to use voice When you are designing for voice, understand the modes. Command, dictate, natural language, identity.

✦ Support multi-modal interactions and make it as seamless as
possible (voice, gesture, type, other) ✦ test, iterate, test, iterate... Let users decide how to interact Don Norman, 2003 “I believe that voice interfaces hold their greatest promise as an additional component to a multi-modal dialogue, rather than as the only interface channel.” Dictate and edit is a prime example of this. It’s beautifully crafted. Voice -> type gesture -> voice Test and iterate. Voice still isn’t a common/normal interaction, so you will likely get it a bit wrong the ﬁrst few times.

Don’t make me think “A simple voice interface can only
be as good as what the customer thinks they want. A better system is one that understands what their needs are likely to be, based on what’s known about them. ”

✦ Personalisation ✦ Work on making the system ‘smarter’ Create
a perception of understanding The speech recognition and synthesis tools have become commodities. Focus your energies on helping the system seem smarter.

Jonny Schneider Lead Consultant Mobile Experience Design & Strategy [email protected]
@jonnyschneider au.linkedin.com/in/jonnyschneider/ All images used by permission

Designing for Voice Interactions - UX Australia...

Designing for Voice Interactions - UX Australia: Designing for Mobile

More Decks by Jonny Schneider

Other Decks in Design

Featured

Transcript