Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Designing for Voice Interactions - UX Australia: Designing for Mobile

Designing for Voice Interactions - UX Australia: Designing for Mobile

Voice controlled interfaces and natural language processing are becoming normal. Major mobile platforms now have voice integrated into the operating system, and Google has been engineering its voice search experience in leaps and bounds recently.

Voice interactions will transform how we interact with computers and – just like the touch interfaces that came before them – mobile devices are driving the change.

Jonny Schneider

March 01, 2013
Tweet

More Decks by Jonny Schneider

Other Decks in Design

Transcript

  1. source: http://www.flickr.com/photos/altemark/304079314
    Designing for
    Voice Interactions
    UX Australia
    Designing for Mobility
    Melbourne, March 1 2013
    Jonny Schneider
    Lead Consultant
    Mobile Experience Design & Strategy

    View full-size slide

  2. ‘Name of referenced work’, Author/source/URL, date.
    When you think of voice
    recognition, you probably
    think of...
    ‘Understanding Moira’, AAMI TV Commercial, http://www.youtube.com/watch?v=EY_jL38HMy8
    inaccurate
    too slow
    never works
    it’s a gimmick too tedious for me
    “I won’t use it until it’s
    faster and more
    accurate than typing”
    it can’t handle
    my accent
    A lot of those things might be true, but this is default thinking, likely based on many bad experiences. However, there are two sides to every story.

    View full-size slide

  3. https://twitter.com/bennyg/status/167192535305945088
    https://twitter.com/bennyg/status/167192535305945088

    View full-size slide

  4. http://www.flickr.com/photos/av_hire_london/5579125851
    IDEA: Experience first-hand what it's like to
    interact with digital devices using
    predominantly your voice.
    METHOD: A group of colleagues committed
    to using voice wherever possible, for an
    entire day.
    Day of Voice
    Let’s take a more objective look at what it’s like to use voice in our everyday interactions. Today.

    View full-size slide

  5. ✦ Controlling the device is tedious
    ✦ I’m sorry, I can’t do that for you
    ✦ Comprehension/recognition
    ✦ Expression
    ✦ Privacy
    ✦ Loss of context/paradigm
    Day of Voice: what didn’t work
    Control:
    “Dictation itself was fine, but getting to where notes are taken very tedious.”
    “I couldn’t navigate to where I needed to be. It heard the command correctly, but didn’t know what to do with it”
    Limitations: Generally, it’s not pervasive enough to be relied upon
    “I can’t...”
    - “play games with voice”!
    - Attach to email
    - dictate an email address "schneider dot jonny at gmail com".
    - edit an address
    Recognition. i.e. Pam’s clips.
    Expression. Exclamation marks, commas, full stops, slang etc. is possible, but not natural.
    As a result “I found that everything tends to run together”
    Privacy. “On several occasions, I found myself wandering off to a small room or closet so that other’s couldn’t hear what I was
    talking about.”
    Loss of context. Chat client. Using voice means I have to break-out of the normal short-messaging paradigm that I’m used to. It
    changes to asynchronous audible communication. Without those visual cues, I’m not sure where I’m up to, or what I want to say
    next.
    A lot of this could just be that we’re not used to it.

    View full-size slide

  6. ✦ Google search with auto-suggest
    ✦ Dictation
    ✦ Accessibility*
    ✦ Control by command
    (XBox Kinect; Dragon for desktop)
    Day of Voice: what worked
    Examples of some useful and surprising experiences with voice
    Google search. “brilliant for rarely used words like 'oesophagus' or 'onomatopoeia', and much faster than guessing letters and
    typing.”
    Dictation.
    “Recording of notes is easy and I've done it on a number of occasions as I'd much prefer to talk than to type.”
    Can make light of a tedious task of typing on a mobile device.
    Even at 80% accuracy, this is way faster than typing, for longer messages
    Accessibility.
    Blind person using Instagram [video]

    View full-size slide

  7. ‘How Blind People Use Instagram’, Tommy Edison, 2012. http://bit.ly/YBmBmb
    blind man uses Instagram
    (video)
    http://www.youtube.com/watch?v=P1e7ZCKQfMA

    View full-size slide

  8. http://www.google.com/nexus/4/
    ✦ On-board hardware
    (microphone and speaker)
    ✦ hands busy + eyes busy
    context of use
    ✦ Personal and ‘always with you’
    nature of device suits idea of
    ‘virtual assistants’
    Why is this so relevant for mobile?

    View full-size slide

  9. Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
    ‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
    AMPS
    Analogue
    GSM
    2G/WAP/WML/i-mode
    3G UMTS
    NextG
    The beginnings of speech recognition technology predates mobile telephony.
    Goes back to the 50s but let’s look at the last30 years
    •Ray Kurzweil’s reading machine: speech synthesiser for blind people.
    •+10 years first the first commercial speech recogniser is created.
    It’s enormous, and very expensive.
    •The next decade: mobile devices get smaller and more prolific. Internet starts to take off
    •(early 90s) SMS, then T9 later that decade
    •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking
    •Touch devices happen
    •Google voice search (2008)
    •Voice Control for iOS, then Voice Actions a year later
    •Swype text input
    •Voice controlled virtual assistants (SIRI and Google Now) 2012
    •Visual IVR
    Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.
    http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

    View full-size slide

  10. Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
    ‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
    AMPS
    Analogue
    GSM
    2G/WAP/WML/i-mode
    3G UMTS
    NextG
    Telecom ‘Walkabout’
    Kurzweil
    Reading
    Machine
    ←(1976)
    1st commercial
    large vocabulary
    speech recogniser
    The beginnings of speech recognition technology predates mobile telephony.
    Goes back to the 50s but let’s look at the last30 years
    •Ray Kurzweil’s reading machine: speech synthesiser for blind people.
    •+10 years first the first commercial speech recogniser is created.
    It’s enormous, and very expensive.
    •The next decade: mobile devices get smaller and more prolific. Internet starts to take off
    •(early 90s) SMS, then T9 later that decade
    •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking
    •Touch devices happen
    •Google voice search (2008)
    •Voice Control for iOS, then Voice Actions a year later
    •Swype text input
    •Voice controlled virtual assistants (SIRI and Google Now) 2012
    •Visual IVR
    Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.
    http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

    View full-size slide

  11. Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
    ‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
    Palm Treo
    Motorola
    Brick
    Nokia 5110
    Motorola
    RAZR
    AMPS
    Analogue
    GSM
    2G/WAP/WML/i-mode
    3G UMTS
    NextG
    Telecom ‘Walkabout’
    Kurzweil
    Reading
    Machine
    ←(1976)
    1st commercial
    large vocabulary
    speech recogniser
    The beginnings of speech recognition technology predates mobile telephony.
    Goes back to the 50s but let’s look at the last30 years
    •Ray Kurzweil’s reading machine: speech synthesiser for blind people.
    •+10 years first the first commercial speech recogniser is created.
    It’s enormous, and very expensive.
    •The next decade: mobile devices get smaller and more prolific. Internet starts to take off
    •(early 90s) SMS, then T9 later that decade
    •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking
    •Touch devices happen
    •Google voice search (2008)
    •Voice Control for iOS, then Voice Actions a year later
    •Swype text input
    •Voice controlled virtual assistants (SIRI and Google Now) 2012
    •Visual IVR
    Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.
    http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

    View full-size slide

  12. Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
    ‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
    Palm Treo
    Motorola
    Brick
    Nokia 5110
    Motorola
    RAZR
    AMPS
    Analogue
    GSM
    2G/WAP/WML/i-mode
    3G UMTS
    NextG
    SMS is born
    Predictive
    Text (T9)
    Telecom ‘Walkabout’
    Kurzweil
    Reading
    Machine
    ←(1976)
    1st commercial
    large vocabulary
    speech recogniser
    The beginnings of speech recognition technology predates mobile telephony.
    Goes back to the 50s but let’s look at the last30 years
    •Ray Kurzweil’s reading machine: speech synthesiser for blind people.
    •+10 years first the first commercial speech recogniser is created.
    It’s enormous, and very expensive.
    •The next decade: mobile devices get smaller and more prolific. Internet starts to take off
    •(early 90s) SMS, then T9 later that decade
    •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking
    •Touch devices happen
    •Google voice search (2008)
    •Voice Control for iOS, then Voice Actions a year later
    •Swype text input
    •Voice controlled virtual assistants (SIRI and Google Now) 2012
    •Visual IVR
    Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.
    http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

    View full-size slide

  13. Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
    ‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
    Palm Treo
    Motorola
    Brick
    Nokia 5110
    Motorola
    RAZR
    AMPS
    Analogue
    GSM
    2G/WAP/WML/i-mode
    3G UMTS
    NextG
    SMS is born
    Predictive
    Text (T9)
    Telephone
    Banking
    1st
    dial-in IVR
    (DTMF)
    Dragon
    Dictate v1
    for PC
    Telecom ‘Walkabout’
    Kurzweil
    Reading
    Machine
    ←(1976)
    1st commercial
    large vocabulary
    speech recogniser
    The beginnings of speech recognition technology predates mobile telephony.
    Goes back to the 50s but let’s look at the last30 years
    •Ray Kurzweil’s reading machine: speech synthesiser for blind people.
    •+10 years first the first commercial speech recogniser is created.
    It’s enormous, and very expensive.
    •The next decade: mobile devices get smaller and more prolific. Internet starts to take off
    •(early 90s) SMS, then T9 later that decade
    •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking
    •Touch devices happen
    •Google voice search (2008)
    •Voice Control for iOS, then Voice Actions a year later
    •Swype text input
    •Voice controlled virtual assistants (SIRI and Google Now) 2012
    •Visual IVR
    Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.
    http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

    View full-size slide

  14. Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
    ‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
    Palm Treo
    Motorola
    Brick
    Nokia 5110
    Motorola
    RAZR
    HTC Dream
    (1st Android)
    iPhone 3
    AMPS
    Analogue
    GSM
    2G/WAP/WML/i-mode
    3G UMTS
    NextG
    SMS is born
    Predictive
    Text (T9)
    Telephone
    Banking
    1st
    dial-in IVR
    (DTMF)
    Dragon
    Dictate v1
    for PC
    Telecom ‘Walkabout’
    Kurzweil
    Reading
    Machine
    ←(1976)
    1st commercial
    large vocabulary
    speech recogniser
    The beginnings of speech recognition technology predates mobile telephony.
    Goes back to the 50s but let’s look at the last30 years
    •Ray Kurzweil’s reading machine: speech synthesiser for blind people.
    •+10 years first the first commercial speech recogniser is created.
    It’s enormous, and very expensive.
    •The next decade: mobile devices get smaller and more prolific. Internet starts to take off
    •(early 90s) SMS, then T9 later that decade
    •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking
    •Touch devices happen
    •Google voice search (2008)
    •Voice Control for iOS, then Voice Actions a year later
    •Swype text input
    •Voice controlled virtual assistants (SIRI and Google Now) 2012
    •Visual IVR
    Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.
    http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

    View full-size slide

  15. Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
    ‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
    Palm Treo
    Motorola
    Brick
    Nokia 5110
    Motorola
    RAZR
    HTC Dream
    (1st Android)
    iPhone 3
    AMPS
    Analogue
    GSM
    2G/WAP/WML/i-mode
    3G UMTS
    NextG
    Google voice
    search app
    SMS is born
    Predictive
    Text (T9)
    Telephone
    Banking
    1st
    dial-in IVR
    (DTMF)
    Dragon
    Dictate v1
    for PC
    Telecom ‘Walkabout’
    Kurzweil
    Reading
    Machine
    ←(1976)
    1st commercial
    large vocabulary
    speech recogniser
    The beginnings of speech recognition technology predates mobile telephony.
    Goes back to the 50s but let’s look at the last30 years
    •Ray Kurzweil’s reading machine: speech synthesiser for blind people.
    •+10 years first the first commercial speech recogniser is created.
    It’s enormous, and very expensive.
    •The next decade: mobile devices get smaller and more prolific. Internet starts to take off
    •(early 90s) SMS, then T9 later that decade
    •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking
    •Touch devices happen
    •Google voice search (2008)
    •Voice Control for iOS, then Voice Actions a year later
    •Swype text input
    •Voice controlled virtual assistants (SIRI and Google Now) 2012
    •Visual IVR
    Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.
    http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

    View full-size slide

  16. Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
    ‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
    Palm Treo
    Motorola
    Brick
    Nokia 5110
    Motorola
    RAZR
    HTC Dream
    (1st Android)
    iPhone 3
    AMPS
    Analogue
    GSM
    2G/WAP/WML/i-mode
    3G UMTS
    NextG
    Google voice
    search app
    SMS is born
    Predictive
    Text (T9)
    Telephone
    Banking
    1st
    dial-in IVR
    (DTMF)
    Dragon
    Dictate v1
    for PC
    Voice control
    (iOS3)
    Voice
    actions
    (Froyo)
    Telecom ‘Walkabout’
    Kurzweil
    Reading
    Machine
    ←(1976)
    1st commercial
    large vocabulary
    speech recogniser
    The beginnings of speech recognition technology predates mobile telephony.
    Goes back to the 50s but let’s look at the last30 years
    •Ray Kurzweil’s reading machine: speech synthesiser for blind people.
    •+10 years first the first commercial speech recogniser is created.
    It’s enormous, and very expensive.
    •The next decade: mobile devices get smaller and more prolific. Internet starts to take off
    •(early 90s) SMS, then T9 later that decade
    •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking
    •Touch devices happen
    •Google voice search (2008)
    •Voice Control for iOS, then Voice Actions a year later
    •Swype text input
    •Voice controlled virtual assistants (SIRI and Google Now) 2012
    •Visual IVR
    Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.
    http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

    View full-size slide

  17. Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
    ‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
    Palm Treo
    Motorola
    Brick
    Nokia 5110
    Motorola
    RAZR
    HTC Dream
    (1st Android)
    iPhone 3
    AMPS
    Analogue
    GSM
    2G/WAP/WML/i-mode
    3G UMTS
    NextG
    Google voice
    search app
    SMS is born
    Predictive
    Text (T9)
    Telephone
    Banking
    1st
    dial-in IVR
    (DTMF)
    Dragon
    Dictate v1
    for PC
    Voice control
    (iOS3)
    Voice
    actions
    (Froyo)
    Telecom ‘Walkabout’
    Kurzweil
    Reading
    Machine
    ←(1976)
    1st commercial
    large vocabulary
    speech recogniser
    Swype
    The beginnings of speech recognition technology predates mobile telephony.
    Goes back to the 50s but let’s look at the last30 years
    •Ray Kurzweil’s reading machine: speech synthesiser for blind people.
    •+10 years first the first commercial speech recogniser is created.
    It’s enormous, and very expensive.
    •The next decade: mobile devices get smaller and more prolific. Internet starts to take off
    •(early 90s) SMS, then T9 later that decade
    •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking
    •Touch devices happen
    •Google voice search (2008)
    •Voice Control for iOS, then Voice Actions a year later
    •Swype text input
    •Voice controlled virtual assistants (SIRI and Google Now) 2012
    •Visual IVR
    Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.
    http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

    View full-size slide

  18. Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
    ‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
    Palm Treo
    Motorola
    Brick
    Nokia 5110
    Motorola
    RAZR
    HTC Dream
    (1st Android)
    iPhone 3
    AMPS
    Analogue
    GSM
    2G/WAP/WML/i-mode
    3G UMTS
    NextG
    Google voice
    search app
    SMS is born
    Predictive
    Text (T9)
    Telephone
    Banking
    1st
    dial-in IVR
    (DTMF)
    Dragon
    Dictate v1
    for PC
    Voice control
    (iOS3)
    Voice
    actions
    (Froyo)
    SIRI &
    Google
    Now
    Telecom ‘Walkabout’
    Kurzweil
    Reading
    Machine
    ←(1976)
    1st commercial
    large vocabulary
    speech recogniser
    Swype
    The beginnings of speech recognition technology predates mobile telephony.
    Goes back to the 50s but let’s look at the last30 years
    •Ray Kurzweil’s reading machine: speech synthesiser for blind people.
    •+10 years first the first commercial speech recogniser is created.
    It’s enormous, and very expensive.
    •The next decade: mobile devices get smaller and more prolific. Internet starts to take off
    •(early 90s) SMS, then T9 later that decade
    •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking
    •Touch devices happen
    •Google voice search (2008)
    •Voice Control for iOS, then Voice Actions a year later
    •Swype text input
    •Voice controlled virtual assistants (SIRI and Google Now) 2012
    •Visual IVR
    Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.
    http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

    View full-size slide

  19. Visual
    IVR
    Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
    ‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
    Palm Treo
    Motorola
    Brick
    Nokia 5110
    Motorola
    RAZR
    HTC Dream
    (1st Android)
    iPhone 3
    AMPS
    Analogue
    GSM
    2G/WAP/WML/i-mode
    3G UMTS
    NextG
    Google voice
    search app
    SMS is born
    Predictive
    Text (T9)
    Telephone
    Banking
    1st
    dial-in IVR
    (DTMF)
    Dragon
    Dictate v1
    for PC
    Voice control
    (iOS3)
    Voice
    actions
    (Froyo)
    SIRI &
    Google
    Now
    Telecom ‘Walkabout’
    Kurzweil
    Reading
    Machine
    ←(1976)
    1st commercial
    large vocabulary
    speech recogniser
    Swype
    The beginnings of speech recognition technology predates mobile telephony.
    Goes back to the 50s but let’s look at the last30 years
    •Ray Kurzweil’s reading machine: speech synthesiser for blind people.
    •+10 years first the first commercial speech recogniser is created.
    It’s enormous, and very expensive.
    •The next decade: mobile devices get smaller and more prolific. Internet starts to take off
    •(early 90s) SMS, then T9 later that decade
    •(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking
    •Touch devices happen
    •Google voice search (2008)
    •Voice Control for iOS, then Voice Actions a year later
    •Swype text input
    •Voice controlled virtual assistants (SIRI and Google Now) 2012
    •Visual IVR
    Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.
    http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

    View full-size slide

  20. http://www.flickr.com/photos/carnamah/5859235859
    What do people want?
    If I had asked people what they wanted,
    they would have said faster horses.
    Henry Ford, nineteen twenty never
    Henry didn’t actually say this... Someone at Harvard Business Review went looking, and got a response from the Henry Ford
    Museum, who have researched the topic before, and had found no satisfactory result to suggest that Ford in fact said it!
    The point is...
    I believe there’s a misconception that people don’t like voice as an interaction method.
    I would argue that people will use whatever input method gets the job done quickly and with minimum fuss - that can be
    ‘voice’.
    I wonder what people said about:
    •T9
    •Touch
    •Mobile telephony
    •or even computers

    View full-size slide

  21. Used with permission by Kenneth Johnson. http://kennethjohnson.us/
    ✦ All the robots!
    ✦ Google glass
    Imagine the future...
    if machines could understand.
    A few examples:
    - HAL 9000 (2001: A Space Odyssey)
    - T-800 (Terminator)
    - Johnny 5 (Short Circuit)
    - Data (Star Trek)
    - Robocop ED-209 (Robocop)
    Not just movies.
    ...CSI and other such shows are riddled with intelligent, understanding, all singing, all dancing, talking computers.
    Sci-Fi movies have been spruiking the possibilities for decades. In reality, we’re moving at a much slower pace, but things like
    Google Glass are coming - in fact, you can participate for the trial study right now if you like.

    View full-size slide

  22. Voice recognition technology
    Main types of voice interaction
    Design principles
    ›❯
    ›❯
    ›❯
    Let’s talk about Voice

    View full-size slide

  23. Voice recognition technology
    Main types of voice interaction
    Design principles
    ›❯
    ›❯
    ›❯

    View full-size slide

  24. A (very) quick look at the technology
    search engine
    customer database
    private APIs
    transaction
    gateway
    3rd party APIs
    SPEECH
    RECOGNITION &
    SYNTHESIS SERVICE
    voice-to-text
    text-to-speech
    This is one configuration, that we used on a recent project.
    There are many other ways this could be done.
    •sound clip recorded
    •clip sent to VTT
    •VTT interprets/translates
    •sent back as text
    •device sends text to other services (i.e. search engine)
    •data sent back to the device (often multiples, with a confidence rating)
    •device sends text to be voiced over (i.e. a summary of the data presented to user)
    •TTS creates a voice clip and sends it back to the device
    •device presents the data and plays the voice clip

    View full-size slide

  25. A (very) quick look at the technology
    search engine
    customer database
    private APIs
    transaction
    gateway
    3rd party APIs
    A
    SPEECH
    RECOGNITION &
    SYNTHESIS SERVICE
    voice-to-text
    text-to-speech
    This is one configuration, that we used on a recent project.
    There are many other ways this could be done.
    •sound clip recorded
    •clip sent to VTT
    •VTT interprets/translates
    •sent back as text
    •device sends text to other services (i.e. search engine)
    •data sent back to the device (often multiples, with a confidence rating)
    •device sends text to be voiced over (i.e. a summary of the data presented to user)
    •TTS creates a voice clip and sends it back to the device
    •device presents the data and plays the voice clip

    View full-size slide

  26. A (very) quick look at the technology
    search engine
    customer database
    private APIs
    transaction
    gateway
    3rd party APIs
    A
    B
    SPEECH
    RECOGNITION &
    SYNTHESIS SERVICE
    voice-to-text
    text-to-speech
    This is one configuration, that we used on a recent project.
    There are many other ways this could be done.
    •sound clip recorded
    •clip sent to VTT
    •VTT interprets/translates
    •sent back as text
    •device sends text to other services (i.e. search engine)
    •data sent back to the device (often multiples, with a confidence rating)
    •device sends text to be voiced over (i.e. a summary of the data presented to user)
    •TTS creates a voice clip and sends it back to the device
    •device presents the data and plays the voice clip

    View full-size slide

  27. A (very) quick look at the technology
    search engine
    customer database
    private APIs
    transaction
    gateway
    3rd party APIs
    A
    B
    C
    SPEECH
    RECOGNITION &
    SYNTHESIS SERVICE
    voice-to-text
    text-to-speech
    This is one configuration, that we used on a recent project.
    There are many other ways this could be done.
    •sound clip recorded
    •clip sent to VTT
    •VTT interprets/translates
    •sent back as text
    •device sends text to other services (i.e. search engine)
    •data sent back to the device (often multiples, with a confidence rating)
    •device sends text to be voiced over (i.e. a summary of the data presented to user)
    •TTS creates a voice clip and sends it back to the device
    •device presents the data and plays the voice clip

    View full-size slide

  28. http://www.flickr.com/photos/citychiccountrymouse/3856797711
    PURPOSE: Measure accuracy and latency of
    current voice recognition solutions
    METHOD:
    ✦ 4 vendor solutions
    ✦ 14 test phrases for translation
    ✦ 12 participants
    ✦ phrases recorded ‘fast’ and ‘slow’
    Let’s Benchmark!

    View full-size slide

  29. “Are there any good deals nearby”
    I’ll get any goodies nearby
    Are there any deals near me
    Adding any deals any of me
    Are there any good deals nearby ✔



    Objective (exact) and subjective matching.

    View full-size slide

  30. Average
    Accuracy
    Number of people
    tested
    Comments
    iSpeech 10% 4
    Discarded after initial
    testing
    Google 47% 12 Non supported API
    Nuance - high
    quality audio
    56% 12 10x file size
    Nuance - low quality
    audio
    50% 12 1x file size
    Siri 64% 12
    Not a reusable
    product
    Average accuracy of voice solutions
    Average accuracy.
    It’s a small number of participants.
    I’m sure you could find much more comprehensive test results from other sources. Knock yourself out!

    View full-size slide

  31. 0
    20
    40
    60
    80
    100
    P1
    P2
    P3
    P4
    P5
    P6
    P7
    P8
    P9
    P10
    P11
    P12
    Google Voice Nuance Wav Nuance Speex SIRI
    Accuracy of voice recognition by participant
    Accuracy by participant.
    Here’s Google Voice in pink.
    and now Nuance.
    and the other two vendors tested.
    This tells us there is significant variation in accuracy, from person to person.

    View full-size slide

  32. 0
    20
    40
    60
    80
    100
    Australian (2)
    Indian (3)
    Singaporean (3)
    American (1)
    Hong Kong (1)
    Malaysian (1)
    Chinese (1)
    Google Voice Nuance Wav Nuance Speex SIRI
    Average accuracy of voice recognition by accent
    It’s a similar story across the different accents.

    View full-size slide

  33. A (very) quick look at the technology
    SPEECH
    RECOGNITION &
    SYNTHESIS SERVICE
    voice-to-text
    text-to-speech
    search engine
    customer database
    private APIs
    transaction
    gateway
    3rd party APIs
    A
    B
    C
    Remember A, B, C?
    We’re going to measure latency now.
    2 weeks, sampling every 30 mins.

    View full-size slide

  34. 0
    10
    20
    30
    40
    50
    60
    3G (in Asia) WiFi (private)
    3
    16
    10
    21
    2
    4
    Nuance Google
    Comparison of latency performance (seconds)
    0
    10
    20
    30
    40
    50
    60
    3G (in Asia) WiFi (private)
    3
    18
    10
    22
    4
    16
    Voice-to-Text ‘Stuff’ in the cloud Text-to-speech
    Let’s measure latency of each of those steps.
    Enormous latency!
    Over 40 seconds over 3G. Absurd.
    One important note, is that these times represent a whole phrase, the phrases are not broken down and processes
    synchronously, as is the case with products like Google voice search app.

    View full-size slide

  35. 0
    10
    20
    30
    40
    50
    60
    3G (in Asia) WiFi (private)
    3
    16
    2
    4
    Nuance Google
    Comparison of latency performance (seconds)
    0
    10
    20
    30
    40
    50
    60
    3G (in Asia) WiFi (private)
    3
    18
    4
    16
    Voice-to-Text Text-to-speech
    Even when we cut out the ‘other stuff’, and measure only VTT and TTS services, it’s still really very slow.
    Some of this can be improved with colocation of servers and services. This test involved servers that were geographically spread
    over the globe. However, that isn’t always feasible, depending on the services you are connecting with, and where they are served
    from.

    View full-size slide

  36. http://www.flickr.com/photos/lisovy/5415681393/
    ✦ Even the best recognisers struggle
    to achieve higher than 60% accuracy
    ✦ Latency is a problem,
    especially over slower networks
    Conclusions
    Consider the effect when these compound.
    It takes ages to get the result, and there’s a high likelihood it will be incorrect.
    Not ideal.
    My friend Rod Farmer kindly pointed out that it is possible to run concurrent requests - translating a few words at a time - in
    order to reduce latency significantly. For our limited prototype, this kind of engineering wasn’t feasible. None the less, the
    recommendations that follow are helpful regardless of latency.

    View full-size slide

  37. Voice recognition technology
    Main types of voice interaction
    Design principles
    ›❯
    ›❯
    ›❯

    View full-size slide

  38. Main ways of interacting with voice
    Commands Dictation
    Natural Language Identification

    View full-size slide

  39. http://www.flickr.com/photos/bengrogan/2147048247
    Command-based interactions
    think of: Selective hearing.
    ✦ System only hears
    what it is listening for
    ✦ Structured/scripted
    Commands based systems are like ‘selective hearing’.
    The system only knows how to understand things that it is listening for.
    It’s a structured generally tedious way of interacting. It often feels scripted and impersonal, which are the kind of attributes that
    typically offend customers.
    This was typically the back-bone of the early IVRs (late 90s-2000s).
    AAMI, the Australian insurance company, has built it’s unique market position on exactly that.
    You might be familiar with the ‘Moira’ campaign.

    View full-size slide

  40. Think about any time you’ve called your mobile provider.
    I know it feels tedious, but ask yourself - would it be any better if you spoke with a person?
    Customers hate:
    1. repeating themselves (usually because of a routing issue)
    2. waiting in queue
    Telstra has 2nd biggest call centre
    with 600 unique reasons to call
    200,000 inbound calls per day
    handling 1M transfers per month
    I’d like to argue that speaking with a real agent may well be a poorer experience than a machine.
    Why? Humans aren’t perfect either:
    - Attitude
    - Accents
    - Understanding
    - Consistency
    There are also times when we might simply prefer a machine.
    I can think of one or two times when I’ve really hoped to get to voicemail, because the person I was calling is a difficult to talk with. Or perhaps you’re five weeks overdue
    on your invoice, and would prefer not to explain yourself, but instead get it paid through an IVR.
    We’re talking about command based interactions - Strictly, most IVRs today has moved beyond simple ‘commands’. They usually begin with an open prompt, before
    moving to menu mode. We’ll discuss that in more detail in a moment.

    View full-size slide

  41. ‘Name of referenced work’, Author/source/URL, date.
    A very clever use of simple voice commands to control an interface - entirely appropriate for the context of use you’d expect for
    this scenario (sticky fingers etc.)
    Other’s noteworthy examples:
    - XBOX Kinect
    - Dragon for desktop

    View full-size slide

  42. ✦ Great as a text-input
    replacement, particularly for
    mobile, where keypads
    are tedious
    ✦ It doesn’t need to
    ‘understand’
    ✦ Predictive dictation,
    based on data
    http://www.flickr.com/photos/vivax_imago/5603582392
    Dictation
    Dictation
    think of: Hearing, but not understanding.
    The machine hears what you tell it, but can’t make meaning from it.
    I think we all understand how dictation works. The user says something, their speech is ‘recognised’ and then usually converted
    from voice to text.
    If it is reasonably accurate, it’s easy to see how this can be helpful.
    Driving or walking down the street while composing SMS on a touch screen is hideously difficult. Dangerous, and possibly
    illegal. Dictation frees you up to focus on other things.
    Complex vocabulary often also benefit from dictation. A word like oesophagus is difficult to spell, and you could be left
    guessing what letter it starts with a few times before T9 kicks in to save the day. Dictating it is likely to be quicker and easier.
    Nuance’s Powerscribe360 is a great example of that in action. For medical practitioners.

    View full-size slide

  43. It’s no co-incidence that major mobile operating systems have this embedded right at the core.
    Just how it’s not a co-incidence that Google have just employed Ray Kurzweil as director of Engineering.
    Are they building SkyNet?

    View full-size slide

  44. on a mac
    Example of predictive dictation:
    “What does onomatopoeia mean?”
    The machine still doesn’t “understand” in the way we mean it.
    But just like search engines, it can predict what we mean based on statistical modeling.
    Think of how many billions of search queries Google has at hand, that are used to inform these statistical models.

    View full-size slide

  45. on a mac on a mat
    Example of predictive dictation:
    “What does onomatopoeia mean?”
    The machine still doesn’t “understand” in the way we mean it.
    But just like search engines, it can predict what we mean based on statistical modeling.
    Think of how many billions of search queries Google has at hand, that are used to inform these statistical models.

    View full-size slide

  46. on a mac on a mat onomatopoeia
    mean?
    Example of predictive dictation:
    “What does onomatopoeia mean?”
    The machine still doesn’t “understand” in the way we mean it.
    But just like search engines, it can predict what we mean based on statistical modeling.
    Think of how many billions of search queries Google has at hand, that are used to inform these statistical models.

    View full-size slide

  47. on a mac on a mat onomatopoeia
    mean?
    Example of predictive dictation:
    “What does onomatopoeia mean?”
    The machine still doesn’t “understand” in the way we mean it.
    But just like search engines, it can predict what we mean based on statistical modeling.
    Think of how many billions of search queries Google has at hand, that are used to inform these statistical models.

    View full-size slide

  48. http://bit.ly/XPJ7DC
    ✦ ‘natural language’ interactions
    ✦ The machine understands*
    meaning, and can then respond in
    a helpful, meaningful and personal
    way
    Virtual Assistants
    think of: hearing and understanding*
    This is like hearing and understanding.
    ‘Understanding’ has an asterisk next to it, and you’ll see why over the next few slides.
    Machines have a really hard time trying to understand meaning - Why...

    View full-size slide

  49. ‘Subliminal: How your unconscious mind rules your behaviour’, p. 34. Leonard Mlodinow, 2012.
    The cooking teacher said the students
    made good snacks.
    Meaning is nuanced
    The cannibal said the students made
    good snacks.
    It’s because human communication is complex and nuanced.
    and it can’t easily be automated or codified.
    Herein lies one of the biggest challenges for ‘intelligent’ or ‘understanding’ voice systems.
    “Teachers and Cannibals” is a basic example.
    As humans, we easily understand the meaning of these two statements that are only different by a single word.
    And you’re probably alarmed - I hope you’re alarmed - by the latter.
    Machines don’t understand this as easily.

    View full-size slide

  50. ‘Subliminal: How your unconscious mind rules your behaviour’, p. 34. Leonard Mlodinow, 2012.
    A common homily
    The spirit is willing, but the flesh is weak
    Here’s another example...

    View full-size slide

  51. ‘Subliminal: How your unconscious mind rules your behaviour’, p. 34. Leonard Mlodinow, 2012.
    The spirit is willing, but the flesh is weak
    A common homily,
    when programmatically translated
    Here’s another example...

    View full-size slide

  52. ‘Subliminal: How your unconscious mind rules your behaviour’, p. 34. Leonard Mlodinow, 2012.
    The spirit is willing, but the flesh is weak
    The vodka is strong, but the meat is rotten
    A common homily,
    when programmatically translated
    Here’s another example...

    View full-size slide

  53. http://www.flickr.com/photos/lifementalhealthpics/8384573785
    ✦ Semantic classification
    ✦ Statistical probability
    modeling
    ✦ Creating a perception of
    understanding
    What is machine ‘understanding’
    Documents, conversations, or any kind of content can be manually classified or coded for meaning, and this becomes a model by which the
    machine can use for matching.
    Statistical algorithms similar to those used in search engines are also used to help the machine perform better, based on past behaviour of
    other people.
    This creates a perception of understanding or intelligence. You might call that ‘Artificial Intelligence’.
    Vocabulary is an important factor in accuracy of probability modeling.
    Radiography reader was a successful early speech recognition system, that was ultimately successful because the vocabulary in radiography
    is constrained, and the acoustic signature of the words are quite different. Therefore the algorithms are more successful.

    View full-size slide

  54. http://www.flickr.com/photos/lifementalhealthpics/8384573785
    ✦ Can you access data to
    help do the thinking on
    behalf of your users?
    ✦ prediction of customer
    needs
    ✦ Personalisation
    System awareness
    When a customer interacts with a service, various bits of data may be available:
    - identity
    - account status
    - location of call
    - time of day
    - device being used
    This can be used to predict customer needs.
    Example:
    Engineer cuts a cable that wipes out internet for all of Brunswick. 30,000 customers affected. For customers calling in from that
    geographic area, system has automated response, telling them about the problem. Customer hangs up. Lots of money saved.
    20% vs. 2% improvement in routing and/or task completion by doing this. When compared with ‘tuning’ of semantic and
    statistical modeling.

    View full-size slide

  55. Blade Runner, 1982. Warner Bros. img: http://replicant976.tumblr.com/image/12757032749
    The Uncanny Valley
    is not something we need
    worry about.
    Yet.
    The Uncanny Valley is a hypothesis is robotics that suggests that as robots approach human likeness, they incite repulsive
    emotions in humans.
    It doesn’t really apply to virtual agents, and so far, our experience has been that there is a long way to go before voice synthesis
    approaches human likeness - so it’s really nothing to worry about yet.

    View full-size slide

  56. ‘Sneakers’, Universal Studios, 1982. img: http://lat.ms/ZlHtN0
    ✦ Voice biometrics
    Identification
    think of: “My voice is my passport, verify me!”
    Who remembers the film Sneakers? One of my favourites.
    A team of security specialist steal the keycard and vocal codes of Warner Brandes, an unsuspecting employee of the ‘front’
    company operated a bad guy who intends to become wealthy by using a decryption device to defraud companies for his own
    benefit.
    In the end, the good guys win, and in a postscript, they use the Janek decryption device to steal from the rich and give to the
    poor. A modern day Robin Hood story.
    This is a nice example of using voice biometrics for multi-factor authentication. There’re obvious applications for this,
    particularly for things like banking, where 2nd factor is often SMS, which has several limitations.
    30 years later, we’re starting to see this kind of security for real.

    View full-size slide

  57. Voice recognition technology
    Main types of voice interaction
    Design principles
    ›❯
    ›❯
    ›❯
    We’ve seen opportunities for humans to interact with computers in helpful ways
    constraints in the capabilities of technology to deliver against this promise
    and objectives in business to optimise operating costs and improving customer service
    These are essentially the same ingredients to any design problem aren’t they?
    So let’s look at some principles that apply specifically to voice...

    View full-size slide

  58. AT&T Visual IVR Project http://www.att.com/gen/press-room?pid=23362
    ✦ High latency, low accuracy...
    ✦ Help users recover by using
    offering alternatives
    Design for failure
    This could be as a multi-modal interface, or it could be a translated interface like this example of visual IVR, which let’s users
    traverse the IVR tree using a touch menu.

    View full-size slide

  59. ✦ Don’t treat voice as a ‘me too’ feature
    (will your product or your customers
    actually benefit from voice... really?)
    ✦ Think twice before introducing
    redundancy
    Would you like voice with that?
    Voice is the hot new thing right now, but resist the hype. It’s not trivial to implement, and even if it were, does that validate it
    as a ‘must have’ feature for your product?
    Voice is integrated into the OS of modern devices.
    Their technology is mature. It can be used with any input field, any interface.
    The interaction design is polished, and extensively tested.
    Use that! If you can.

    View full-size slide

  60. ‘Name of referenced work’, Author/source/URL, date.
    ✦ Understand the various modes of voice
    interaction
    ✦ Be careful about mixing modes
    (is that a command or a conversation?)
    Know when and how to use voice
    When you are designing for voice, understand the modes.
    Command, dictate, natural language, identity.

    View full-size slide

  61. ✦ Support multi-modal
    interactions and make it as
    seamless as possible
    (voice, gesture, type, other)
    ✦ test, iterate, test, iterate...
    Let users decide how to interact
    Don Norman, 2003
    “I believe that voice interfaces hold their greatest promise as an additional component to a multi-modal dialogue, rather than as
    the only interface channel.”
    Dictate and edit is a prime example of this. It’s beautifully crafted.
    Voice -> type
    gesture -> voice
    Test and iterate.
    Voice still isn’t a common/normal interaction, so you will likely get it a bit wrong the first few times.

    View full-size slide

  62. Don’t make me think
    “A simple voice interface can only be as good as
    what the customer thinks they want. A better
    system is one that understands what their needs
    are likely to be, based on what’s known about
    them. ”

    View full-size slide

  63. ✦ Personalisation
    ✦ Work on making the system ‘smarter’
    Create a perception of understanding
    The speech recognition and synthesis tools have become commodities. Focus your energies on helping the system seem smarter.

    View full-size slide

  64. Jonny Schneider
    Lead Consultant
    Mobile Experience Design & Strategy
    [email protected]
    @jonnyschneider
    au.linkedin.com/in/jonnyschneider/
    All images used by permission

    View full-size slide