Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Speech Recognition in Deep Learning

Kajal Puri
October 06, 2018

Introduction to Speech Recognition in Deep Learning

Kajal Puri

October 06, 2018
Tweet

More Decks by Kajal Puri

Other Decks in Research

Transcript

  1. DIY guide to Speech Recognition in ML @Agirlhasnofame @kajal-puri kajal-puri.github.io

    @kajalp @kajalpuri @KajalP Kajal Puri Kajal Puri PyCon France 6th October, 2018 1/22
  2. Speech Recognition • Speech Recognition : Convert speech/spoken data into

    text data. Spoken data can be a word, syllable, character or sequence of words. • Applications : 1. Automatically generated Captions (subtitles) in a YouTube video 2. VoiceMail Transcription 3. Siri, Cortana, Echo, Google Home etc. Kajal Puri PyCon France 6th October, 2018 2/22
  3. Automatic Speech Recognition • ASR (Automatic Speech Recogniser) : The

    workflow of an ASR is : ◦ Speech Segment Extraction ◦ Speech Processing and Modelling ◦ Pattern recognition and training Kajal Puri PyCon France 6th October, 2018 3/22
  4. Challenges of ASR 1. Human Accent Most data sets available

    on the internet is in "English" language and with mostly 80%-90% of the speakers are Native Americans. So, the training data is clearly of one type and can't recognize the same word uttered by people who are not of American origin and possess different accents where as humans can do this very well. Kajal Puri PyCon France 6th October, 2018 4/22
  5. Problems of ASR 2. Background Noise : Mostly, the data

    sets available publicly have little to no noise in the background which makes it easier for the algorithm to deduce the words easily, but this case is not possible in the real-life situation. Kajal Puri PyCon France 6th October, 2018 7/22
  6. Problems of ASR 3. Misinterpretation of Homophones : Homophones are

    the words that sound almost/exactly the same, might have similar spellings as well but their meanings are totally different from one another. Like a person might say "We've won the match" but the system might hear it as "We've one the match" which makes the sentence quite meaningless. A few more examples misinterpreted by ASR are: “We went to see the sea”, “Their book is lying there”, “I hear a lot of rides can be enjoyed here”, “I always write, right” Kajal Puri PyCon France 6th October, 2018
  7. Problems of ASR 4. Overlapping Sounds : Available data sets

    has this one common feature which is - there is no overlap of multiple speakers in the same audio stream whereas this happens almost all the time in real life and humans can understand well even then. They have single-stream, single speaker kind of data available, in which when one person speaks, other person is silent and vice-versa which is probably never the case in real-life. A good ASR must be able to segment and distinguish the audio based on who is speaking. It should also be able to make sense of audio with overlapping speakers. Kajal Puri PyCon France 6th October, 2018
  8. Problems of ASR 5. Other minor issues : • Reverberation

    from varying the environment. • Artefacts from the hardware. • The audio and compression artefacts. • The sample rate. • The age of the speaker. • Low latency. • High Computation tools necessary. Kajal Puri PyCon France 6th October, 2018
  9. Need for an ASR • Helps in multi-tasking like driving.

    • Boon for people with disability. • Can be used by both literate and illiterate. • Lots of languages are close to extinction i.e. endangered languages, can be used in technology, promoted and preserved with ASR. • Speech is primary source of communication. • Faster way of instruction. Kajal Puri PyCon France 6th October, 2018 11/22
  10. Types of an ASR Kajal Puri PyCon France 6th October,

    2018 12/22 1. Template based : (Traditional method) a. already have a default set of templates, finds the closest match b. Works fine for discrete utterances and a single user 2. Knowledge or rule based : (Traditional method) a. Pre-defined rules b. If something is out of the rules, the system can’t respond c. E.g. IVRS system of companies 3. Statistical Approach : (Today’s traditional method) a. Lot of data is collected and trained on deep/machine learning models. b. At run time, apply statistical processes. c. Much more accurate and human-like
  11. Existing Libraries/APIs in ASR • Google Speech API : Best

    ASR available right now, generates english transcripts with more than 80% accuracy depending upon quality of dataset. It has separate APIs for Android OS and Javascript API for Chrome. • IBM Watson API : Oldest API that converts speech to text. Very limited options to get customised result. • Microsoft Cognitive Services : Bing Speech AI : Posses different unique features like voice authentication, number of distinct speakers, works for many different languages and dialects. • DeepSpeech : Mozilla’s open source DeepSpeech is a speech-to-text engine which is based on Baidu’s research paper and is available for python and tensorflow mainly as a library Kajal Puri PyCon France 6th October, 2018 13/22
  12. Why Deep Speech? Kajal Puri PyData Delhi 11th August, 2018

    14/22 • Open Source • Hyperparameter tuning of the models • Cost effective • Batch processing • Train your own model • Distributed training • Pre-trained models • Gives better result with real-time speech data
  13. Text-to-Speech Kajal Puri PyCon France 6th October, 2018 17/22 Two

    major parameters on which TTS by machine is judged upon : 1. Intelligibility • Quality of the audio • Clean and clear • Listenable 2. How humanly it sounds • Emotions • Timing and structure of sentences • Pronunciation
  14. Wavenet Kajal Puri PyCon France 6th October, 2018 18/22 •

    A generative model for raw audio. • Autoregressive generation. • Built using stack of convolutional layers with residual and skip connections. • Input taken by network is raw audio waveform, after the network process the input, output produced is a waveform sample. • The model is not conditioned i.e. it was not fed the structure of the speech, so it does not generate any meaningful audio. • Experiment - It was trained on human speaking data, the output produced was sounded like human but was really just blabbering, humming and then taking pauses at random times. • After feeding a lot of data, it produced some high quality voice and sentences that makes sense, but it’s very expensive to train. • It took 4 minutes to generate one second of audio.
  15. Sample-RNN Kajal Puri PyCon France 6th October, 2018 20/22 •

    Another approach to generate audio samples. • Multiple RNNs are connected in a hierarchy. • Top level take large chunks of data, process it and pass on to next level and so on to the bottom-most level, where one single sample is generated. • This is also an autoregressive generative model. • Computationally 500 times faster than WaveNet. • It is an unconditional audio generator. • The audio samples generated are as good as wavenet but not much experimentation has been done on this model so not much can be said about its future scope and the kind of problems it can solve.
  16. Play take me to the John I’m not allowing the

    case without drop out and maybe I can look after the dog we can tick her I had to go to begin may be run without dropping. Kajal Puri PyCon France 6th October, 2018
  17. So remember the days chords to classify a new upcoming

    picture so what you will do as you will train split your data between train and test so that you can train a mother on some training data and you just keep aside as I said that would represent represent the new upcoming day diet that you have never seen before and you want this that you want to classify. Kajal Puri PyCon France 6th October, 2018
  18. References Kajal Puri PyCon France 6th October, 2018 22/22 •

    Deep Speech Baidu’s paper • Deep Speech • WaveNet, google DeepMind paper • Google Speech API • IBM Watson Speech API • Microsoft Bing Speech API • Speech Synthesis using deep neural networks • Sample RNN