The life of an utterance

The life of an utterance From your phone to your
Xbox Kinect, what happens when you speak to your devices and they actually listen.

Who am I?

Look Mom I’m Famous!

The beginning...

My speech reco adventures

Industries

Where is speech found in everyday life?

Professions and Schooling for SR • Speech Scientist - usually
a PhD in CS or EE • Software Engineer - BS CS or CompEng, MS a plus, C++ or C • Speech UX Designer - varies: anthropology, BS CS, linguistics... • Speech Product Manager - varies: BS CS, MS CS, Linguistics, BS/MS EE

Speech is HARD

The Life of an Utterance

What is an utterance? An utterance is the vocalization (speaking)
of a word or words that represent a single meaning to the computer. Utterances can be a single word, a few words, a sentence, or even multiple sentences.

Hooked on Phonics Phoneme - any of the perceptually distinct
units of sound in a specified language that distinguish one word from another. 44 in American English The word dog is made up of 3 phonemes: /d/-/o/-/g/

How we do speech recognition

What is automated speech recognition? 1. SE loads a list
of words to recognize 2. Utterances are received in wave form 3. SE looks at features, compares against acoustic model using grammars to guide 4. Determines which words match best and returns a result

Building the Acoustic Model Goal: Model likelihood of sounds given
spectral features, pronunciation models, and prior context Collect lots of speech and transcribe all the words Train the model on the labeled speech How much speech was needed for one language for the Xbox Kinect?

What’s happening under the hood? Acoustic Front End Acoustic Models
Search Language Model Input Speech Recognize d Utterance • Signal is converted to a sequence of feature vectors based on spectral and temporal measurements • Acoustic models represent sub-word units, such as phonemes • Language model predicts the next set of words, and controls which models are hypothesized • Search is crucial to the system, since many combinations of words must be investigated to find most probable word sequence

Search Algorithms and Data Structures • Basic search algorithms •
Time synchronous search • Stack decoding • Lexical trees • Efficient trees

Interactive Demo of SR http://www.match-project.org. uk/resources/tutorial/Speech_Language/Speech_Recogniti on/Rec_4.html

Performance • Vocabulary size and confusability • Speaker dependence vs.
independence • Isolated, discontinuous, or continuous speech • Task and language constraints • Read vs. spontaneous speech • Adverse conditions

Designing a Voice User Experience Case study: A movie player
Design 3 screens with visual/audio menu and screen grammar 1. Pick a movie 2. Play the movie 3. Rate the movie

Now you do it • Language Learning App - Going
to a restaurant • Vehicle Navigation App - Getting Directions • Video Game - Finding a treasure • Robotics - Your Own Jarvis from Ironman

Want to play? Google Speech Demo https://www.google.com/intl/en/chrome/demos/speech.html

[email protected] @engijen Thank you!

The life of an utterance

The life of an utterance

Jennifer Arguello

More Decks by Jennifer Arguello

Other Decks in Technology

Featured

Transcript