Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The life of an utterance

The life of an utterance

An introduction to speech recognition technology. Originally presented to the 2014 Intel Girls Who Code program at Stanford University.

Jennifer Arguello

July 15, 2014
Tweet

More Decks by Jennifer Arguello

Other Decks in Technology

Transcript

  1. The life of an utterance From your phone to your

    Xbox Kinect, what happens when you speak to your devices and they actually listen.
  2. Professions and Schooling for SR • Speech Scientist - usually

    a PhD in CS or EE • Software Engineer - BS CS or CompEng, MS a plus, C++ or C • Speech UX Designer - varies: anthropology, BS CS, linguistics... • Speech Product Manager - varies: BS CS, MS CS, Linguistics, BS/MS EE
  3. What is an utterance? An utterance is the vocalization (speaking)

    of a word or words that represent a single meaning to the computer. Utterances can be a single word, a few words, a sentence, or even multiple sentences.
  4. Hooked on Phonics Phoneme - any of the perceptually distinct

    units of sound in a specified language that distinguish one word from another. 44 in American English The word dog is made up of 3 phonemes: /d/-/o/-/g/
  5. What is automated speech recognition? 1. SE loads a list

    of words to recognize 2. Utterances are received in wave form 3. SE looks at features, compares against acoustic model using grammars to guide 4. Determines which words match best and returns a result
  6. Building the Acoustic Model Goal: Model likelihood of sounds given

    spectral features, pronunciation models, and prior context Collect lots of speech and transcribe all the words Train the model on the labeled speech How much speech was needed for one language for the Xbox Kinect?
  7. What’s happening under the hood? Acoustic Front End Acoustic Models

    Search Language Model Input Speech Recognize d Utterance • Signal is converted to a sequence of feature vectors based on spectral and temporal measurements • Acoustic models represent sub-word units, such as phonemes • Language model predicts the next set of words, and controls which models are hypothesized • Search is crucial to the system, since many combinations of words must be investigated to find most probable word sequence
  8. Search Algorithms and Data Structures • Basic search algorithms •

    Time synchronous search • Stack decoding • Lexical trees • Efficient trees
  9. Performance • Vocabulary size and confusability • Speaker dependence vs.

    independence • Isolated, discontinuous, or continuous speech • Task and language constraints • Read vs. spontaneous speech • Adverse conditions
  10. Designing a Voice User Experience Case study: A movie player

    Design 3 screens with visual/audio menu and screen grammar 1. Pick a movie 2. Play the movie 3. Rate the movie
  11. Now you do it • Language Learning App - Going

    to a restaurant • Vehicle Navigation App - Getting Directions • Video Game - Finding a treasure • Robotics - Your Own Jarvis from Ironman