$30 off During Our Annual Pro Sale. View Details »

The History and Future of Speaking with Machines

The History and Future of Speaking with Machines

An overview of the centuries-long history of humanity's effort to give machines voice, from the earliest myths to the Web Speech API.

Matt Buck

May 24, 2017
Tweet

More Decks by Matt Buck

Other Decks in Technology

Transcript

  1. Matt Buck
    CTO & Co-founder, Voxable
    The History and Future of
    Speaking with Machines
    #givevoice

    View Slide

  2. Overview
    • The history of speaking with machines
    • Voice on the web
    • The future of speaking with machines
    • Demonstration of building a voice interface

    View Slide

  3. THE HISTORY OF SPEAKING
    WITH MACHINES
    #givevoice

    View Slide

  4. Talking with Machines
    Speech
    recognition
    Speech
    synthesis

    View Slide

  5. • c.1219 - c.1292
    • English philosopher and
    Franciscan friar
    • Proponent of the
    scientific method
    Roger Bacon

    View Slide

  6. Brazen Heads
    • Automatons
    • Not real (obvs)
    • Could answer any
    question put to them
    https://en.wikipedia.org/wiki/Speech_synthesis#History

    View Slide

  7. 1700s: Early speech synthesis
    • 1779: Christian Gottlieb
    Kratzenstein models vocal
    tract
    • 1791: Wolfgang von
    Kempelen’s “acoustic-
    mechanical speech machine”
    https://en.wikipedia.org/wiki/
    Wolfgang_von_Kempelen's_Speaking_Machine

    View Slide

  8. View Slide

  9. 1846: Euphonia
    • Created by Joseph Faber
    • Also played like an organ
    • Modeled entire head
    • Spoke three languages
    https://irrationalgeographic.wordpress.com/
    2009/06/24/joseph-fabers-talking-euphonia/

    View Slide

  10. https://irrationalgeographic.wordpress.com/2009/06/24/joseph-fabers-talking-euphonia/
    Alexander Graham Bell

    View Slide

  11. https://dood.al/pinktrombone/

    View Slide

  12. https://www.flickr.com/photos/
    64416865@N00/5451405887/in/photostream/

    View Slide

  13. https://www.flickr.com/photos/
    64416865@N00/5451405887/in/photostream/

    View Slide

  14. https://www.flickr.com/photos/
    64416865@N00/5451405887/in/photostream/

    View Slide

  15. 99% Invisible
    EP. 208
    “VOX EX
    MACHINA”
    http://99percentinvisible.org/episode/vox-ex-machina/

    View Slide

  16. https://www.youtube.com/watch?v=0rAyrmm7vv0
    1940: Voder World’s Fair Demo
    http://hackaday.com/2014/08/12/retrotechtacular-the-voder-from-bell-labs/

    View Slide

  17. https://www.youtube.com/watch?v=41U78QP8nBk
    1961: The First Machine to Sing
    http://speechstones.com/milestones.html

    View Slide

  18. https://www.youtube.com/watch?v=OuEN5TjYRCE
    2001: Still Singing
    http://speechstones.com/milestones.html

    View Slide

  19. • One of the most influential
    engineers of the 20th century
    • Inventor of the term “transistor”
    • “There are strong reasons for
    believing that spoken English is…
    not recognizable phoneme by
    phoneme or word by word.”
    1969: John Robinson Pierce
    https://en.wikipedia.org/wiki/Speech_recognition#History

    View Slide

  20. • Used TI’s Solid State
    Speech
    • First toy to use speech that
    was synthesized
    1978: The Texas Instruments
    Speak & Spell
    https://commons.wikimedia.org/wiki/File:TI_SpeakSpell_no_shadow.jpg

    View Slide

  21. 1978: The TI Speak & Spell
    https://www.youtube.com/watch?v=qM8FcN0aAvU

    View Slide

  22. • Interest in speech recognition reignited by
    DARPA grants in early 70s
    • Tangora: A voice-activated word processor
    with a 20,000 word vocabulary
    1986: IBM Tangora

    View Slide

  23. View Slide

  24. I
    V
    R

    View Slide

  25. Interactive
    Voice
    Response

    View Slide

  26. • Integrated with core
    Apple apps + Wolfram
    Alpha
    • Solved real problems
    2011: Siri

    View Slide

  27. • Neural network capable of generating speech
    • Can mimic any human voice
    • Reduces gap in performance by over 50%
    2016: Google WaveNet
    https://deepmind.com/blog/wavenet-generative-model-raw-audio/

    View Slide

  28. https://www.youtube.com/watch?v=qM8FcN0aAvU

    View Slide

  29. https://lyrebird.ai

    View Slide

  30. • Ubiquitous voice interfaces
    • Always-on, react to “wake
    words”
    • Siri-like functionality
    • Control smart-home devices
    2014-2016: Amazon Echo &
    Google Home

    View Slide

  31. Tea.
    Earl Grey.
    Hot.

    View Slide

  32. • Roger Bacon’s Brazen Head
    • Early attempts at speech synthesis
    • Voice in the 20th century
    • Ubiquitous voice
    THE HISTORY OF SPEAKING WITH MACHINES

    View Slide

  33. VOICE ON THE WEB
    #givevoice

    View Slide

  34. The Core Technology
    The Voice
    User
    Interface
    Automatic Speech
    Recognition (ASR)
    Natural Language
    Understanding (NLU)
    Bot Intelligence
    Text to Speech
    (TTS)

    View Slide

  35. The Core
    Technology
    Automatic Speech
    Recognition (ASR)
    Takes the spoken word and turns it
    into text.

    View Slide

  36. View Slide

  37. The Core
    Technology
    Natural Language
    Understanding (NLU)
    Gives the text meaning by turning it
    into structured data

    View Slide

  38. The Core
    Technology
    Bot Intelligence
    Bot Intelligence manages the context
    of the user, the application, and the
    conversation.

    View Slide

  39. The Core
    Technology
    Text to Speech (TTS)
    The result generated by the
    intelligence is spoken back to the
    user.

    View Slide

  40. Designing a Voice Interface
    • Determine intents of the system
    • Define entities for those intents
    • Model the conversational flow
    • Scripting and read-through

    View Slide

  41. Determine Intents
    • Intents: goals the user wishes to accomplish
    • I need to run this report
    • I need to find this person’s phone number
    • Create a list of your user’s goals

    View Slide

  42. Determine Entities
    “Show me flights from Austin to
    Atlanta leaving next Friday 

    after 4pm.”
    departure city
    destination city date
    time
    INTENT: FLIGHTSEARCH

    View Slide

  43. Model the Conversation Flow
    “Show me flights Atlanta
    leaving next Friday after
    4pm.”
    Sure
    thing! Would
    you like to fly first
    class, business
    class, or
    coach?
    First class
    Classy!
    And would
    you like a meal on
    this flight?
    Perfect! Here are a list
    of flights that match
    your criteria.
    Yes, of course.
    find flights
    first class
    first class, yes to meal
    departure city: Austin

    View Slide

  44. The Web Speech API
    • Comes in two flavors:
    • SpeechSynthesis
    • SpeechRecognition

    View Slide

  45. SpeechSynthesis
    var synth = window.speechSynthesis;

    var utterance = new SpeechSynthesisUtterance('Hello, world!');


    synth.speak(utterance);

    View Slide

  46. SpeechSynthesis
    • Controls for:
    • Voice - utterance.voice
    • Pitch - utterance.pitch
    • Rate - utterance.rate

    View Slide

  47. SSML
    • Speech Synthesis Markup Language
    • https://github.com/nickclaw/alexa-ssml

    I'm a mouse.

    I'm a house.


    That is huge news!

    That is a hefty fine!



    View Slide

  48. source: caniuse.com

    View Slide

  49. source: caniuse.com

    View Slide

  50. SpeechRecognition
    var recognition = new webkitSpeechRecognition();


    document.body.onclick = function() {

    console.log('clicked');

    recognition.start();

    };


    recognition.onresult = function(event.results) {

    var utterance = event.results[0][0].transcript;

    alert(utterance);

    };

    View Slide

  51. View Slide

  52. View Slide

  53. The pitfalls of keyword matching
    Well, I don’t need
    help with the schedule,
    but where are the
    restrooms?
    Hello there! Welcome to the
    conference. You can say:
    “show me a map”,
    “schedule”, or “contact
    organizers.”
    Great! Here’s the
    schedule…

    View Slide

  54. Natural Language Understanding
    • Can be built from scratch, but it’s expensive
    • Existing platforms
    • API.AI (Google)
    • Wit.ai (Facebook)
    • LUIS (Microsoft)
    • Watson Cognitive Services (IBM)

    View Slide

  55. View Slide

  56. View Slide

  57. Demo Resources
    • https://github.com/voxable-labs/spree-api-ai-
    demo
    • https://github.com/TalAter/SpeechKITT

    View Slide

  58. https://github.com/voxable-labs/expando
    (is it possible to|can I|how do I) return (something|an item)
    is it possible to return something
    is it possible to return an item
    can I return something
    can I return an item
    how do I return something
    how do I return an item
    Expando

    View Slide

  59. • Core technology
    • Designing a voice interface
    • The Web Speech API
    • Natural Language Understanding
    VOICE ON THE WEB

    View Slide

  60. THE FUTURE OF SPEAKING
    WITH MACHINES
    #givevoice

    View Slide

  61. View Slide

  62. Sub-vocal Recognition
    • 2008: Audeo demos sub-vocal recognition
    • 2009-2017: ????
    • 2017: Facebook announces SVR research

    View Slide

  63. HOW TO BUILD A VOICE
    INTERFACE IN TEN MINUTES
    #givevoice

    View Slide