Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Applied ASR

Ben Langfeld
December 05, 2013

Applied ASR

Delivered at AdhearsionConf 2013

Ben Langfeld

December 05, 2013
Tweet

More Decks by Ben Langfeld

Other Decks in Programming

Transcript

  1. P A G E WHAT IS ASR? Automatic Speech Recognition

    ! Giving a computer system with an audio-based user interface the ability to “understand” human speech. !4
  2. P A G E WHERE IS ASR USED? •IVR •Dictation

    •Accessibility systems •Artificial intelligence !5
  3. P A G E WHERE IS ASR USED? •IVR •Dictation

    •Accessibility systems •Artificial intelligence •The Google search bar !5
  4. P A G E ANATOMY OF AN ASR ENGINE !6

    Endpointing Feature Extraction
  5. P A G E ANATOMY OF AN ASR ENGINE !6

    Endpointing Feature Extraction Recognition
  6. P A G E ANATOMY OF AN ASR ENGINE !6

    Endpointing Feature Extraction Recognition Natural Language Understanding
  7. P A G E ANATOMY OF AN ASR ENGINE !6

    Endpointing Feature Extraction Recognition Natural Language Understanding Dialog Management
  8. P A G E HOW RECOGNITION WORKS •Searches recognition model

    •Acoustic models •Dictionary •Grammar !7
  9. P A G E HOW RECOGNITION WORKS •Searches recognition model

    •Acoustic models •Dictionary •Grammar •Rule-based !7
  10. P A G E HOW RECOGNITION WORKS •Searches recognition model

    •Acoustic models •Dictionary •Grammar •Rule-based •Statistical Language Model !7
  11. P A G E HOW RECOGNITION WORKS •Searches recognition model

    •Acoustic models •Dictionary •Grammar •Rule-based •Statistical Language Model •Matches feature vectors to similarly represented models !7
  12. P A G E NATURAL LANGUAGE UNDERSTANDING •Assign meaning to

    human speech •Unstructured -> structured data !8
  13. P A G E NATURAL LANGUAGE UNDERSTANDING •Assign meaning to

    human speech •Unstructured -> structured data •Key-value !8
  14. P A G E NATURAL LANGUAGE UNDERSTANDING •Assign meaning to

    human speech •Unstructured -> structured data •Key-value •Maybe nesting !8
  15. P A G E NATURAL LANGUAGE UNDERSTANDING •Assign meaning to

    human speech •Unstructured -> structured data •Key-value •Maybe nesting •Either statistical or rule-based !8
  16. P A G E NATURAL LANGUAGE UNDERSTANDING •Assign meaning to

    human speech •Unstructured -> structured data •Key-value •Maybe nesting •Either statistical or rule-based •Rule-based = SISR (Semantic Interpretation in Speech Recognition) - http://www.w3.org/TR/ semantic-interpretation/ !8
  17. P A G E NATURAL LANGUAGE UNDERSTANDING •Assign meaning to

    human speech •Unstructured -> structured data •Key-value •Maybe nesting •Either statistical or rule-based •Rule-based = SISR (Semantic Interpretation in Speech Recognition) - http://www.w3.org/TR/ semantic-interpretation/ •Can get very complex very quickly !8
  18. P A G E NATURAL LANGUAGE UNDERSTANDING •Assign meaning to

    human speech •Unstructured -> structured data •Key-value •Maybe nesting •Either statistical or rule-based •Rule-based = SISR (Semantic Interpretation in Speech Recognition) - http://www.w3.org/TR/ semantic-interpretation/ •Can get very complex very quickly •One of the main components of so-called AI !8
  19. P A G E WHAT SOLUTIONS ARE AVAILABLE? •Lumenvox (UniMRCP

    & C API) •Nuance - complex and confusing product range •Vestec •AT&T Speech API (HTTP) •AT&T Watson •CMU Sphinx & PocketSphinx - open source !9
  20. P A G E HOW TO GET STARTED QUICKLY? •Telephony

    Dev Box •http://mojolingo.github.io/Telephony-Dev-Box •Asterisk & Lumenvox (MRCP) •FreeSWITCH & PocketSphinx •Adhearsion !10
  21. P A G E GRAMMAR DEVELOPMENT MASTERCLASS •SRGS - http://www.w3.org/TR/speech-grammar/

    •RubySpeech - github.com/benlangfeld/ ruby_speech !11
  22. P A G E GRAMMAR DEVELOPMENT MASTERCLASS •SRGS - http://www.w3.org/TR/speech-grammar/

    •RubySpeech - github.com/benlangfeld/ ruby_speech !11
  23. P A G E GRAMMAR DEVELOPMENT MASTERCLASS •SRGS - http://www.w3.org/TR/speech-grammar/

    •RubySpeech - github.com/benlangfeld/ ruby_speech •Rules !11
  24. P A G E GRAMMAR DEVELOPMENT MASTERCLASS •SRGS - http://www.w3.org/TR/speech-grammar/

    •RubySpeech - github.com/benlangfeld/ ruby_speech •Rules •Tokens !11
  25. P A G E GRAMMAR DEVELOPMENT MASTERCLASS •SRGS - http://www.w3.org/TR/speech-grammar/

    •RubySpeech - github.com/benlangfeld/ ruby_speech •Rules •Tokens •Sequences !11
  26. P A G E GRAMMAR DEVELOPMENT MASTERCLASS •SRGS - http://www.w3.org/TR/speech-grammar/

    •RubySpeech - github.com/benlangfeld/ ruby_speech •Rules •Tokens •Sequences •Alternatives !11
  27. P A G E GRAMMAR DEVELOPMENT MASTERCLASS •SRGS - http://www.w3.org/TR/speech-grammar/

    •RubySpeech - github.com/benlangfeld/ ruby_speech •Rules •Tokens •Sequences •Alternatives •Repeats !11
  28. P A G E GRAMMAR DEVELOPMENT MASTERCLASS •SRGS - http://www.w3.org/TR/speech-grammar/

    •RubySpeech - github.com/benlangfeld/ ruby_speech •Rules •Tokens •Sequences •Alternatives •Repeats •Tags !11
  29. P A G E TESTING AND TUNING •Recognition-testing - scripted

    and small volume •Usability testing •Tuning •Project pilot phase •Call monitoring, call log analysis or UX research •In and out of grammar examples •Compare recognition with utterance •Dictionary tuning •Grammar probabilities and weighting !12
  30. P A G E SPEECH RECOGNITION ON THE WEB !13

    bit.ly/HTML5_Speech_Input_API talater.com/annyang
  31. P A G E SPEAKER VERIFICATION •Enrolment -> Speech Model

    •Impostor model: combination of models of other speakers !14
  32. P A G E SPEAKER VERIFICATION •Enrolment -> Speech Model

    •Impostor model: combination of models of other speakers •Identity claim -> compared to speech model and impostor model !14