Introduction to Speech Recognition in Deep Learning

DIY guide to Speech Recognition in ML @Agirlhasnofame @kajal-puri kajal-puri.github.io
@kajalp @kajalpuri @KajalP Kajal Puri Kajal Puri PyCon France 6th October, 2018 1/22

Speech Recognition • Speech Recognition : Convert speech/spoken data into
text data. Spoken data can be a word, syllable, character or sequence of words. • Applications : 1. Automatically generated Captions (subtitles) in a YouTube video 2. VoiceMail Transcription 3. Siri, Cortana, Echo, Google Home etc. Kajal Puri PyCon France 6th October, 2018 2/22

Automatic Speech Recognition • ASR (Automatic Speech Recogniser) : The
workflow of an ASR is : ◦ Speech Segment Extraction ◦ Speech Processing and Modelling ◦ Pattern recognition and training Kajal Puri PyCon France 6th October, 2018 3/22

Challenges of ASR 1. Human Accent Most data sets available
on the internet is in "English" language and with mostly 80%-90% of the speakers are Native Americans. So, the training data is clearly of one type and can't recognize the same word uttered by people who are not of American origin and possess different accents where as humans can do this very well. Kajal Puri PyCon France 6th October, 2018 4/22

Demo 1 Kajal Puri PyCon France 6th October, 2018 5/22

Demo 2 Kajal Puri PyCon France 6th October, 2018

Problems of ASR 2. Background Noise : Mostly, the data
sets available publicly have little to no noise in the background which makes it easier for the algorithm to deduce the words easily, but this case is not possible in the real-life situation. Kajal Puri PyCon France 6th October, 2018 7/22

Problems of ASR 3. Misinterpretation of Homophones : Homophones are
the words that sound almost/exactly the same, might have similar spellings as well but their meanings are totally different from one another. Like a person might say "We've won the match" but the system might hear it as "We've one the match" which makes the sentence quite meaningless. A few more examples misinterpreted by ASR are: “We went to see the sea”, “Their book is lying there”, “I hear a lot of rides can be enjoyed here”, “I always write, right” Kajal Puri PyCon France 6th October, 2018

Problems of ASR 4. Overlapping Sounds : Available data sets
has this one common feature which is - there is no overlap of multiple speakers in the same audio stream whereas this happens almost all the time in real life and humans can understand well even then. They have single-stream, single speaker kind of data available, in which when one person speaks, other person is silent and vice-versa which is probably never the case in real-life. A good ASR must be able to segment and distinguish the audio based on who is speaking. It should also be able to make sense of audio with overlapping speakers. Kajal Puri PyCon France 6th October, 2018

Problems of ASR 5. Other minor issues : • Reverberation
from varying the environment. • Artefacts from the hardware. • The audio and compression artefacts. • The sample rate. • The age of the speaker. • Low latency. • High Computation tools necessary. Kajal Puri PyCon France 6th October, 2018

Need for an ASR • Helps in multi-tasking like driving.
• Boon for people with disability. • Can be used by both literate and illiterate. • Lots of languages are close to extinction i.e. endangered languages, can be used in technology, promoted and preserved with ASR. • Speech is primary source of communication. • Faster way of instruction. Kajal Puri PyCon France 6th October, 2018 11/22

Types of an ASR Kajal Puri PyCon France 6th October,
2018 12/22 1. Template based : (Traditional method) a. already have a default set of templates, finds the closest match b. Works fine for discrete utterances and a single user 2. Knowledge or rule based : (Traditional method) a. Pre-defined rules b. If something is out of the rules, the system can’t respond c. E.g. IVRS system of companies 3. Statistical Approach : (Today’s traditional method) a. Lot of data is collected and trained on deep/machine learning models. b. At run time, apply statistical processes. c. Much more accurate and human-like

Existing Libraries/APIs in ASR • Google Speech API : Best
ASR available right now, generates english transcripts with more than 80% accuracy depending upon quality of dataset. It has separate APIs for Android OS and Javascript API for Chrome. • IBM Watson API : Oldest API that converts speech to text. Very limited options to get customised result. • Microsoft Cognitive Services : Bing Speech AI : Posses different unique features like voice authentication, number of distinct speakers, works for many different languages and dialects. • DeepSpeech : Mozilla’s open source DeepSpeech is a speech-to-text engine which is based on Baidu’s research paper and is available for python and tensorflow mainly as a library Kajal Puri PyCon France 6th October, 2018 13/22

Why Deep Speech? Kajal Puri PyData Delhi 11th August, 2018
14/22 • Open Source • Hyperparameter tuning of the models • Cost effective • Batch processing • Train your own model • Distributed training • Pre-trained models • Gives better result with real-time speech data

How Deep Speech works? Kajal Puri PyCon France 6th October,
2018 15/22

Result Comparison Kajal Puri PyCon France 6th October, 2018 16/22

Text-to-Speech Kajal Puri PyCon France 6th October, 2018 17/22 Two
major parameters on which TTS by machine is judged upon : 1. Intelligibility • Quality of the audio • Clean and clear • Listenable 2. How humanly it sounds • Emotions • Timing and structure of sentences • Pronunciation

Wavenet Kajal Puri PyCon France 6th October, 2018 18/22 •
A generative model for raw audio. • Autoregressive generation. • Built using stack of convolutional layers with residual and skip connections. • Input taken by network is raw audio waveform, after the network process the input, output produced is a waveform sample. • The model is not conditioned i.e. it was not fed the structure of the speech, so it does not generate any meaningful audio. • Experiment - It was trained on human speaking data, the output produced was sounded like human but was really just blabbering, humming and then taking pauses at random times. • After feeding a lot of data, it produced some high quality voice and sentences that makes sense, but it’s very expensive to train. • It took 4 minutes to generate one second of audio.

Wavenet Kajal Puri PyCon France 6th October, 2018 19/22

Sample-RNN Kajal Puri PyCon France 6th October, 2018 20/22 •
Another approach to generate audio samples. • Multiple RNNs are connected in a hierarchy. • Top level take large chunks of data, process it and pass on to next level and so on to the bottom-most level, where one single sample is generated. • This is also an autoregressive generative model. • Computationally 500 times faster than WaveNet. • It is an unconditional audio generator. • The audio samples generated are as good as wavenet but not much experimentation has been done on this model so not much can be said about its future scope and the kind of problems it can solve.

Sample-RNN Kajal Puri PyCon France 6th October, 2018 21/22

Play take me to the John I’m not allowing the
case without drop out and maybe I can look after the dog we can tick her I had to go to begin may be run without dropping. Kajal Puri PyCon France 6th October, 2018

So remember the days chords to classify a new upcoming
picture so what you will do as you will train split your data between train and test so that you can train a mother on some training data and you just keep aside as I said that would represent represent the new upcoming day diet that you have never seen before and you want this that you want to classify. Kajal Puri PyCon France 6th October, 2018

References Kajal Puri PyCon France 6th October, 2018 22/22 •
Deep Speech Baidu’s paper • Deep Speech • WaveNet, google DeepMind paper • Google Speech API • IBM Watson Speech API • Microsoft Bing Speech API • Speech Synthesis using deep neural networks • Sample RNN

[email protected] [email protected] OR

Introduction to Speech Recognition in Deep Lear...

Introduction to Speech Recognition in Deep Learning

Kajal Puri

More Decks by Kajal Puri

Other Decks in Research

Featured

Transcript

DIY guide to Speech Recognition in ML @Agirlhasnofame @kajal-puri kajal-puri.github.io

Speech Recognition • Speech Recognition : Convert speech/spoken data into

Automatic Speech Recognition • ASR (Automatic Speech Recogniser) : The

Challenges of ASR 1. Human Accent Most data sets available

Demo 1 Kajal Puri PyCon France 6th October, 2018 5/22

Demo 2 Kajal Puri PyCon France 6th October, 2018

Problems of ASR 2. Background Noise : Mostly, the data

Problems of ASR 3. Misinterpretation of Homophones : Homophones are

Problems of ASR 4. Overlapping Sounds : Available data sets

Problems of ASR 5. Other minor issues : • Reverberation

Need for an ASR • Helps in multi-tasking like driving.

Types of an ASR Kajal Puri PyCon France 6th October,

Existing Libraries/APIs in ASR • Google Speech API : Best

Why Deep Speech? Kajal Puri PyData Delhi 11th August, 2018

How Deep Speech works? Kajal Puri PyCon France 6th October,

Result Comparison Kajal Puri PyCon France 6th October, 2018 16/22

Text-to-Speech Kajal Puri PyCon France 6th October, 2018 17/22 Two

Wavenet Kajal Puri PyCon France 6th October, 2018 18/22 •

Wavenet Kajal Puri PyCon France 6th October, 2018 19/22

Sample-RNN Kajal Puri PyCon France 6th October, 2018 20/22 •

Sample-RNN Kajal Puri PyCon France 6th October, 2018 21/22

Play take me to the John I’m not allowing the

So remember the days chords to classify a new upcoming

References Kajal Puri PyCon France 6th October, 2018 22/22 •

[email protected] [email protected] OR