Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The History and Future of Speaking with Machines

The History and Future of Speaking with Machines

An overview of the centuries-long history of humanity's effort to give machines voice, from the earliest myths to the Web Speech API.

Matt Buck

May 24, 2017

More Decks by Matt Buck

Other Decks in Technology


  1. Matt Buck CTO & Co-founder, Voxable The History and Future

    of Speaking with Machines #givevoice
  2. Overview • The history of speaking with machines • Voice

    on the web • The future of speaking with machines • Demonstration of building a voice interface
  3. • c.1219 - c.1292 • English philosopher and Franciscan friar

    • Proponent of the scientific method Roger Bacon
  4. Brazen Heads • Automatons • Not real (obvs) • Could

    answer any question put to them https://en.wikipedia.org/wiki/Speech_synthesis#History
  5. 1700s: Early speech synthesis • 1779: Christian Gottlieb Kratzenstein models

    vocal tract • 1791: Wolfgang von Kempelen’s “acoustic- mechanical speech machine” https://en.wikipedia.org/wiki/ Wolfgang_von_Kempelen's_Speaking_Machine
  6. 1846: Euphonia • Created by Joseph Faber • Also played

    like an organ • Modeled entire head • Spoke three languages https://irrationalgeographic.wordpress.com/ 2009/06/24/joseph-fabers-talking-euphonia/
  7. • One of the most influential engineers of the 20th

    century • Inventor of the term “transistor” • “There are strong reasons for believing that spoken English is… not recognizable phoneme by phoneme or word by word.” 1969: John Robinson Pierce https://en.wikipedia.org/wiki/Speech_recognition#History
  8. • Used TI’s Solid State Speech • First toy to

    use speech that was synthesized 1978: The Texas Instruments Speak & Spell https://commons.wikimedia.org/wiki/File:TI_SpeakSpell_no_shadow.jpg
  9. • Interest in speech recognition reignited by DARPA grants in

    early 70s • Tangora: A voice-activated word processor with a 20,000 word vocabulary 1986: IBM Tangora
  10. • Neural network capable of generating speech • Can mimic

    any human voice • Reduces gap in performance by over 50% 2016: Google WaveNet https://deepmind.com/blog/wavenet-generative-model-raw-audio/
  11. • Ubiquitous voice interfaces • Always-on, react to “wake words”

    • Siri-like functionality • Control smart-home devices 2014-2016: Amazon Echo & Google Home
  12. • Roger Bacon’s Brazen Head • Early attempts at speech

    synthesis • Voice in the 20th century • Ubiquitous voice THE HISTORY OF SPEAKING WITH MACHINES
  13. The Core Technology The Voice User Interface Automatic Speech Recognition

    (ASR) Natural Language Understanding (NLU) Bot Intelligence Text to Speech (TTS)
  14. The Core Technology Bot Intelligence Bot Intelligence manages the context

    of the user, the application, and the conversation.
  15. The Core Technology Text to Speech (TTS) The result generated

    by the intelligence is spoken back to the user.
  16. Designing a Voice Interface • Determine intents of the system

    • Define entities for those intents • Model the conversational flow • Scripting and read-through
  17. Determine Intents • Intents: goals the user wishes to accomplish

    • I need to run this report • I need to find this person’s phone number • Create a list of your user’s goals
  18. Determine Entities “Show me flights from Austin to Atlanta leaving

    next Friday 
 after 4pm.” departure city destination city date time INTENT: FLIGHTSEARCH
  19. Model the Conversation Flow “Show me flights Atlanta leaving next

    Friday after 4pm.” Sure thing! Would you like to fly first class, business class, or coach? First class Classy! And would you like a meal on this flight? Perfect! Here are a list of flights that match your criteria. Yes, of course. find flights first class first class, yes to meal departure city: Austin
  20. The Web Speech API • Comes in two flavors: •

    SpeechSynthesis • SpeechRecognition
  21. SSML • Speech Synthesis Markup Language • https://github.com/nickclaw/alexa-ssml <speak version="1.0"

    xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
 <s><prosody pitch="x-high">I'm a mouse.</prosody></s>
 <s><prosody pitch="x-low">I'm a house.</prosody></s>
 <s>That is <emphasis>huge</emphasis> news!</s>
 <s>That is a <emphasis level="strong">hefty</emphasis> fine!</s>
  22. SpeechRecognition var recognition = new webkitSpeechRecognition();
 document.body.onclick = function()

 recognition.onresult = function(event.results) {
 var utterance = event.results[0][0].transcript;
  23. The pitfalls of keyword matching Well, I don’t need help

    with the schedule, but where are the restrooms? Hello there! Welcome to the conference. You can say: “show me a map”, “schedule”, or “contact organizers.” Great! Here’s the schedule…
  24. Natural Language Understanding • Can be built from scratch, but

    it’s expensive • Existing platforms • API.AI (Google) • Wit.ai (Facebook) • LUIS (Microsoft) • Watson Cognitive Services (IBM)
  25. https://github.com/voxable-labs/expando (is it possible to|can I|how do I) return (something|an

    item) is it possible to return something is it possible to return an item can I return something can I return an item how do I return something how do I return an item Expando
  26. • Core technology • Designing a voice interface • The

    Web Speech API • Natural Language Understanding VOICE ON THE WEB