The History and Future of Speaking with Machines

Matt Buck CTO & Co-founder, Voxable The History and Future
of Speaking with Machines #givevoice

Overview • The history of speaking with machines • Voice
on the web • The future of speaking with machines • Demonstration of building a voice interface

THE HISTORY OF SPEAKING WITH MACHINES #givevoice

Talking with Machines Speech recognition Speech synthesis

• c.1219 - c.1292 • English philosopher and Franciscan friar
• Proponent of the scientiﬁc method Roger Bacon

Brazen Heads • Automatons • Not real (obvs) • Could
answer any question put to them https://en.wikipedia.org/wiki/Speech_synthesis#History

1700s: Early speech synthesis • 1779: Christian Gottlieb Kratzenstein models
vocal tract • 1791: Wolfgang von Kempelen’s “acoustic- mechanical speech machine” https://en.wikipedia.org/wiki/ Wolfgang_von_Kempelen's_Speaking_Machine

1846: Euphonia • Created by Joseph Faber • Also played
like an organ • Modeled entire head • Spoke three languages https://irrationalgeographic.wordpress.com/ 2009/06/24/joseph-fabers-talking-euphonia/

https://irrationalgeographic.wordpress.com/2009/06/24/joseph-fabers-talking-euphonia/ Alexander Graham Bell

https://dood.al/pinktrombone/

https://www.ﬂickr.com/photos/ 64416865@N00/5451405887/in/photostream/

99% Invisible EP. 208 “VOX EX MACHINA” http://99percentinvisible.org/episode/vox-ex-machina/

https://www.youtube.com/watch?v=0rAyrmm7vv0 1940: Voder World’s Fair Demo http://hackaday.com/2014/08/12/retrotechtacular-the-voder-from-bell-labs/

https://www.youtube.com/watch?v=41U78QP8nBk 1961: The First Machine to Sing http://speechstones.com/milestones.html

https://www.youtube.com/watch?v=OuEN5TjYRCE 2001: Still Singing http://speechstones.com/milestones.html

• One of the most inﬂuential engineers of the 20th
century • Inventor of the term “transistor” • “There are strong reasons for believing that spoken English is… not recognizable phoneme by phoneme or word by word.” 1969: John Robinson Pierce https://en.wikipedia.org/wiki/Speech_recognition#History

• Used TI’s Solid State Speech • First toy to
use speech that was synthesized 1978: The Texas Instruments Speak & Spell https://commons.wikimedia.org/wiki/File:TI_SpeakSpell_no_shadow.jpg

1978: The TI Speak & Spell https://www.youtube.com/watch?v=qM8FcN0aAvU

• Interest in speech recognition reignited by DARPA grants in
early 70s • Tangora: A voice-activated word processor with a 20,000 word vocabulary 1986: IBM Tangora

Interactive Voice Response

• Integrated with core Apple apps + Wolfram Alpha •
Solved real problems 2011: Siri

• Neural network capable of generating speech • Can mimic
any human voice • Reduces gap in performance by over 50% 2016: Google WaveNet https://deepmind.com/blog/wavenet-generative-model-raw-audio/

https://www.youtube.com/watch?v=qM8FcN0aAvU

https://lyrebird.ai

• Ubiquitous voice interfaces • Always-on, react to “wake words”
• Siri-like functionality • Control smart-home devices 2014-2016: Amazon Echo & Google Home

Tea. Earl Grey. Hot.

• Roger Bacon’s Brazen Head • Early attempts at speech
synthesis • Voice in the 20th century • Ubiquitous voice THE HISTORY OF SPEAKING WITH MACHINES

VOICE ON THE WEB #givevoice

The Core Technology The Voice User Interface Automatic Speech Recognition
(ASR) Natural Language Understanding (NLU) Bot Intelligence Text to Speech (TTS)

The Core Technology Automatic Speech Recognition (ASR) Takes the spoken
word and turns it into text.

The Core Technology Natural Language Understanding (NLU) Gives the text
meaning by turning it into structured data

The Core Technology Bot Intelligence Bot Intelligence manages the context
of the user, the application, and the conversation.

The Core Technology Text to Speech (TTS) The result generated
by the intelligence is spoken back to the user.

Designing a Voice Interface • Determine intents of the system
• Deﬁne entities for those intents • Model the conversational ﬂow • Scripting and read-through

Determine Intents • Intents: goals the user wishes to accomplish
• I need to run this report • I need to ﬁnd this person’s phone number • Create a list of your user’s goals

Determine Entities “Show me ﬂights from Austin to Atlanta leaving
next Friday   after 4pm.” departure city destination city date time INTENT: FLIGHTSEARCH

Model the Conversation Flow “Show me flights Atlanta leaving next
Friday after 4pm.” Sure thing! Would you like to fly first class, business class, or coach? First class Classy! And would you like a meal on this flight? Perfect! Here are a list of flights that match your criteria. Yes, of course. find flights first class first class, yes to meal departure city: Austin

The Web Speech API • Comes in two ﬂavors: •
SpeechSynthesis • SpeechRecognition

SpeechSynthesis var synth = window.speechSynthesis;  var utterance = new SpeechSynthesisUtterance('Hello,
world!');    synth.speak(utterance);

SpeechSynthesis • Controls for: • Voice - utterance.voice • Pitch
- utterance.pitch • Rate - utterance.rate

SSML • Speech Synthesis Markup Language • https://github.com/nickclaw/alexa-ssml <speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">  <s><prosody pitch="x-high">I'm a mouse.</prosody></s>  <s><prosody pitch="x-low">I'm a house.</prosody></s>  <p>  <s>That is <emphasis>huge</emphasis> news!</s>  <s>That is a <emphasis level="strong">hefty</emphasis> fine!</s>  </p>  </speak>

source: caniuse.com

SpeechRecognition var recognition = new webkitSpeechRecognition();    document.body.onclick = function()
{  console.log('clicked');  recognition.start();  };    recognition.onresult = function(event.results) {  var utterance = event.results[0][0].transcript;  alert(utterance);  };

The pitfalls of keyword matching Well, I don’t need help
with the schedule, but where are the restrooms? Hello there! Welcome to the conference. You can say: “show me a map”, “schedule”, or “contact organizers.” Great! Here’s the schedule…

Natural Language Understanding • Can be built from scratch, but
it’s expensive • Existing platforms • API.AI (Google) • Wit.ai (Facebook) • LUIS (Microsoft) • Watson Cognitive Services (IBM)

Demo Resources • https://github.com/voxable-labs/spree-api-ai- demo • https://github.com/TalAter/SpeechKITT

https://github.com/voxable-labs/expando (is it possible to|can I|how do I) return (something|an
item) is it possible to return something is it possible to return an item can I return something can I return an item how do I return something how do I return an item Expando

• Core technology • Designing a voice interface • The
Web Speech API • Natural Language Understanding VOICE ON THE WEB

THE FUTURE OF SPEAKING WITH MACHINES #givevoice

Sub-vocal Recognition • 2008: Audeo demos sub-vocal recognition • 2009-2017:
???? • 2017: Facebook announces SVR research

HOW TO BUILD A VOICE INTERFACE IN TEN MINUTES #givevoice

The History and Future of Speaking with Machines

The History and Future of Speaking with Machines

More Decks by Matt Buck

Other Decks in Technology

Featured

Transcript