Slide 1

Slide 1 text

Toward Deep Learning on Speech Recognition for Khmer Language Author: Chanmann Lim Faculty Advisor: Dr. Yunxin Zhao 05/04/2016 University of Missouri-Columbia 1

Slide 2

Slide 2 text

Agenda ‣  ASR Architecture ‣  Motivation and Khmer Dataset ‣  Data-preprocessing ‣  Acoustic Modeling and Results ‣  Conclusion 05/04/2016 University of Missouri-Columbia 2

Slide 3

Slide 3 text

Agenda ‣  ASR Architecture ‣  Motivation and Khmer Dataset ‣  Data-preprocessing ‣  Acoustic Modeling and Results ‣  Conclusion 05/04/2016 University of Missouri-Columbia 3

Slide 4

Slide 4 text

ASR Architecture 05/04/2016 University of Missouri-Columbia 4

Slide 5

Slide 5 text

ASR Architecture 05/04/2016 University of Missouri-Columbia 5

Slide 6

Slide 6 text

ASR Architecture 05/04/2016 University of Missouri-Columbia 6 !"ុងភ្ំ("ញ* (Phnom Penh City)

Slide 7

Slide 7 text

Agenda ‣  ASR Architecture ‣  Motivation and Khmer Dataset ‣  Data-preprocessing ‣  Acoustic Modeling and Results ‣  Conclusion 05/04/2016 University of Missouri-Columbia 7

Slide 8

Slide 8 text

Motivation ‣  Building an ASR for a new language remains challenging –  Lack of training data –  Interdisciplinary field of research (linguistic, signal processing and machine learning) ‣  To build ASR for my own language ``Khmer (+្"រ)” ‣  To preserve Khmer language in this modern digital age 05/04/2016 University of Missouri-Columbia 8

Slide 9

Slide 9 text

Dataset ‣  ``Khmer keywords” was collected by Institute of Technology of Cambodia ‣  15 speakers (9 males and 6 females) ‣  194 words/speaker ‣  Recorded with mobile phone 05/04/2016 University of Missouri-Columbia 9

Slide 10

Slide 10 text

Agenda ‣  ASR Architecture ‣  Motivation and Khmer Dataset ‣  Data-preprocessing ‣  Acoustic Modeling and Results ‣  Conclusion 05/04/2016 University of Missouri-Columbia 10

Slide 11

Slide 11 text

Voice Activity Detection ‣  To segment long audio files into short files –  One short file for each spoken word ‣  2711 files from 14 speakers 05/04/2016 University of Missouri-Columbia 11

Slide 12

Slide 12 text

Training and test set ‣  10 training speakers –  1934 utterances (254,458 frames) ‣  4 test speakers –  777 utterances 05/04/2016 University of Missouri-Columbia 12

Slide 13

Slide 13 text

Agenda ‣  ASR Architecture ‣  Motivation and Khmer Dataset ‣  Data-preprocessing ‣  Acoustic Modeling and Results ‣  Conclusion 05/04/2016 University of Missouri-Columbia 13

Slide 14

Slide 14 text

1. Gaussian Mixture Model HMM ‣  Each state of HMMs is represented by a GMM 05/04/2016 University of Missouri-Columbia 14 /p/

Slide 15

Slide 15 text

1. Context-Dependent Models ‣  60 monophones produce ~ 200k triphones ‣  Tie triphone states using Phonetic Decision Tree (PDT) clustering ‣  Khmer phonetic question set 05/04/2016 University of Missouri-Columbia 15 Phonetic Decision Tree Clustering

Slide 16

Slide 16 text

1. CD-GMM-HMM Results ‣  State occupancy threshold = 30 ‣  Different likelihood gain thresholds (TB) ‣  TB_480 with 6 guassians gives 97.17% (word accuracy) 05/04/2016 University of Missouri-Columbia 16

Slide 17

Slide 17 text

2. Deep Neural Network HMM ‣  Estimate posterior probability of each phone state 05/04/2016 University of Missouri-Columbia 17 ‣  Emission likelihood:

Slide 18

Slide 18 text

2. Neural Network Architecture ‣  Layered-structure of neurons ‣  Activation function f( . ) ‣  Backprob with SDG to minimize 05/04/2016 University of Missouri-Columbia 18 (Output)

Slide 19

Slide 19 text

2. Gradient Refinement ‣  Prevent gradient vanishing or explosion ‣  Popular techniques –  Gradient Clipping: Clip the gradient once it exceed the threshold –  Weight Decay: Penalize objective function by adding scaled L2-norm –  Momentum: Speed up convergence by adding velocity from the previous gradient –  Max-norm: Set the maximum L2-norm bound and scale the weight once it exceed the bound 05/04/2016 University of Missouri-Columbia 19

Slide 20

Slide 20 text

2. Dropout ‣  Prevent overfitting in large neural networks ‣  Randomly omit some hidden nodes during training 05/04/2016 University of Missouri-Columbia 20

Slide 21

Slide 21 text

2. Cross-lingual Model Transfer ‣  Leverage the auxiliary data from other languages 05/04/2016 University of Missouri-Columbia 21 Input layer Hidden layers trained with other languages Output layer of the target language

Slide 22

Slide 22 text

2. CD-DNN-HMM Configuration ‣  Input :15 frames (7-1-7) of MFCCs normalized to zero mean and unit variance ‣  # hidden layers : 1 – 8 ‣  # hidden units : 512, 1024, 2048 ‣  Activation func : ReLU ‣  Initialization : Supervised layer-wise pre-training ‣  Minibatch size : 200 ‣  Learning rate : 0.0001 (with Newbob decay scheduler) ‣  Weight decay : 0.001 ‣  Momentum : 0.99 ‣  Max-norm : 1 ‣  Dropout rate : input = 0.5; hidden = 0.02 ‣  Min # epoches : 24 (fine-tuning) 05/04/2016 University of Missouri-Columbia 22

Slide 23

Slide 23 text

2. CD-DNN-HMM Results ‣  More hidden layers help when hidden layers <= 5 ‣  5-hidden-layer networks (512 and 1024 hidden nodes) achieve 93.31% (word accuracy) 05/04/2016 University of Missouri-Columbia 23

Slide 24

Slide 24 text

2. CD-DNN-HMM Results ‣  Initialization : hidden layers from English DNN ‣  Recognition performance degraded for all 5-hidden layer nets ‣  Need more investigations 05/04/2016 University of Missouri-Columbia 24

Slide 25

Slide 25 text

Agenda ‣  ASR Architecture ‣  Motivation and Khmer Dataset ‣  Data-preprocessing ‣  Acoustic Modeling and Results ‣  Conclusion 05/04/2016 University of Missouri-Columbia 25

Slide 26

Slide 26 text

Conclusion ‣  Pronunciation lexicon and phonetic question set for Khmer are build from scratch ‣  GMM-HMM baseline ‣  The first DNN-HMM training recipe for Khmer ASR ‣  Future works –  Use unsupervised pre-training for transfer learning –  Investigate other types of DNN. i.e., RNN and CNN –  Use DNN with continuous speech recognition for Khmer 05/04/2016 University of Missouri-Columbia 26

Slide 27

Slide 27 text

Thank You! and Questions? 05/04/2016 University of Missouri-Columbia 27