Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The History and Future of Speaking with Machines

The History and Future of Speaking with Machines

An overview of the centuries-long history of humanity's effort to give machines voice, from the earliest myths to the Web Speech API.

Matt Buck

May 24, 2017

More Decks by Matt Buck

Other Decks in Technology


  1. Matt Buck CTO & Co-founder, Voxable The History and Future

    of Speaking with Machines #givevoice
  2. Overview • The history of speaking with machines • Voice

    on the web • The future of speaking with machines • Demonstration of building a voice interface

  4. Talking with Machines Speech recognition Speech synthesis

  5. • c.1219 - c.1292 • English philosopher and Franciscan friar

    • Proponent of the scientific method Roger Bacon
  6. Brazen Heads • Automatons • Not real (obvs) • Could

    answer any question put to them https://en.wikipedia.org/wiki/Speech_synthesis#History
  7. 1700s: Early speech synthesis • 1779: Christian Gottlieb Kratzenstein models

    vocal tract • 1791: Wolfgang von Kempelen’s “acoustic- mechanical speech machine” https://en.wikipedia.org/wiki/ Wolfgang_von_Kempelen's_Speaking_Machine
  8. None
  9. 1846: Euphonia • Created by Joseph Faber • Also played

    like an organ • Modeled entire head • Spoke three languages https://irrationalgeographic.wordpress.com/ 2009/06/24/joseph-fabers-talking-euphonia/
  10. https://irrationalgeographic.wordpress.com/2009/06/24/joseph-fabers-talking-euphonia/ Alexander Graham Bell

  11. https://dood.al/pinktrombone/

  12. https://www.flickr.com/photos/ [email protected]/5451405887/in/photostream/

  13. https://www.flickr.com/photos/ [email protected]/5451405887/in/photostream/

  14. https://www.flickr.com/photos/ [email protected]/5451405887/in/photostream/

  15. 99% Invisible EP. 208 “VOX EX MACHINA” http://99percentinvisible.org/episode/vox-ex-machina/

  16. https://www.youtube.com/watch?v=0rAyrmm7vv0 1940: Voder World’s Fair Demo http://hackaday.com/2014/08/12/retrotechtacular-the-voder-from-bell-labs/

  17. https://www.youtube.com/watch?v=41U78QP8nBk 1961: The First Machine to Sing http://speechstones.com/milestones.html

  18. https://www.youtube.com/watch?v=OuEN5TjYRCE 2001: Still Singing http://speechstones.com/milestones.html

  19. • One of the most influential engineers of the 20th

    century • Inventor of the term “transistor” • “There are strong reasons for believing that spoken English is… not recognizable phoneme by phoneme or word by word.” 1969: John Robinson Pierce https://en.wikipedia.org/wiki/Speech_recognition#History
  20. • Used TI’s Solid State Speech • First toy to

    use speech that was synthesized 1978: The Texas Instruments Speak & Spell https://commons.wikimedia.org/wiki/File:TI_SpeakSpell_no_shadow.jpg
  21. 1978: The TI Speak & Spell https://www.youtube.com/watch?v=qM8FcN0aAvU

  22. • Interest in speech recognition reignited by DARPA grants in

    early 70s • Tangora: A voice-activated word processor with a 20,000 word vocabulary 1986: IBM Tangora
  23. None
  24. I V R

  25. Interactive Voice Response

  26. • Integrated with core Apple apps + Wolfram Alpha •

    Solved real problems 2011: Siri
  27. • Neural network capable of generating speech • Can mimic

    any human voice • Reduces gap in performance by over 50% 2016: Google WaveNet https://deepmind.com/blog/wavenet-generative-model-raw-audio/
  28. https://www.youtube.com/watch?v=qM8FcN0aAvU

  29. https://lyrebird.ai

  30. • Ubiquitous voice interfaces • Always-on, react to “wake words”

    • Siri-like functionality • Control smart-home devices 2014-2016: Amazon Echo & Google Home
  31. Tea. Earl Grey. Hot.

  32. • Roger Bacon’s Brazen Head • Early attempts at speech

    synthesis • Voice in the 20th century • Ubiquitous voice THE HISTORY OF SPEAKING WITH MACHINES
  33. VOICE ON THE WEB #givevoice

  34. The Core Technology The Voice User Interface Automatic Speech Recognition

    (ASR) Natural Language Understanding (NLU) Bot Intelligence Text to Speech (TTS)
  35. The Core Technology Automatic Speech Recognition (ASR) Takes the spoken

    word and turns it into text.
  36. None
  37. The Core Technology Natural Language Understanding (NLU) Gives the text

    meaning by turning it into structured data
  38. The Core Technology Bot Intelligence Bot Intelligence manages the context

    of the user, the application, and the conversation.
  39. The Core Technology Text to Speech (TTS) The result generated

    by the intelligence is spoken back to the user.
  40. Designing a Voice Interface • Determine intents of the system

    • Define entities for those intents • Model the conversational flow • Scripting and read-through
  41. Determine Intents • Intents: goals the user wishes to accomplish

    • I need to run this report • I need to find this person’s phone number • Create a list of your user’s goals
  42. Determine Entities “Show me flights from Austin to Atlanta leaving

    next Friday 
 after 4pm.” departure city destination city date time INTENT: FLIGHTSEARCH
  43. Model the Conversation Flow “Show me flights Atlanta leaving next

    Friday after 4pm.” Sure thing! Would you like to fly first class, business class, or coach? First class Classy! And would you like a meal on this flight? Perfect! Here are a list of flights that match your criteria. Yes, of course. find flights first class first class, yes to meal departure city: Austin
  44. The Web Speech API • Comes in two flavors: •

    SpeechSynthesis • SpeechRecognition
  45. SpeechSynthesis var synth = window.speechSynthesis;
 var utterance = new SpeechSynthesisUtterance('Hello,

  46. SpeechSynthesis • Controls for: • Voice - utterance.voice • Pitch

    - utterance.pitch • Rate - utterance.rate
  47. SSML • Speech Synthesis Markup Language • https://github.com/nickclaw/alexa-ssml <speak version="1.0"

    xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
 <s><prosody pitch="x-high">I'm a mouse.</prosody></s>
 <s><prosody pitch="x-low">I'm a house.</prosody></s>
 <s>That is <emphasis>huge</emphasis> news!</s>
 <s>That is a <emphasis level="strong">hefty</emphasis> fine!</s>
  48. source: caniuse.com

  49. source: caniuse.com

  50. SpeechRecognition var recognition = new webkitSpeechRecognition();
 document.body.onclick = function()

 recognition.onresult = function(event.results) {
 var utterance = event.results[0][0].transcript;
  51. None
  52. None
  53. The pitfalls of keyword matching Well, I don’t need help

    with the schedule, but where are the restrooms? Hello there! Welcome to the conference. You can say: “show me a map”, “schedule”, or “contact organizers.” Great! Here’s the schedule…
  54. Natural Language Understanding • Can be built from scratch, but

    it’s expensive • Existing platforms • API.AI (Google) • Wit.ai (Facebook) • LUIS (Microsoft) • Watson Cognitive Services (IBM)
  55. None
  56. None
  57. Demo Resources • https://github.com/voxable-labs/spree-api-ai- demo • https://github.com/TalAter/SpeechKITT

  58. https://github.com/voxable-labs/expando (is it possible to|can I|how do I) return (something|an

    item) is it possible to return something is it possible to return an item can I return something can I return an item how do I return something how do I return an item Expando
  59. • Core technology • Designing a voice interface • The

    Web Speech API • Natural Language Understanding VOICE ON THE WEB

  61. None
  62. Sub-vocal Recognition • 2008: Audeo demos sub-vocal recognition • 2009-2017:

    ???? • 2017: Facebook announces SVR research