3rd Place Solution of AutoSpeech 2019

3rd Place Solution of AutoSpeech 2019 AutoDL workshop @NeurIPS 2019
Dec 14, 2019 Team Kon https://github.com/Y-oHr-N/autospeech19 Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.

Dataset Raw speech data • different speech classification domains ◦
speaker identification ◦ emotion classification ◦ accent recognition ◦ music genre classification ◦ language identification • single channel • 16 kHz sampling rate Meta data • number of classes (is greater than 2 and less than 100) • number of training instances (varies from hundreds to thousands) • time budget (is 1800 seconds for all the datasets) 2 Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.

Evaluation 3 Metric • area under the learning curve (ALC)
How to get a high ALC • speed up feature extraction • converge the loss quickly • keep a high AUROC Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved. transformed time 2 * AUROC - 1

Preprocessing 4 crop the ﬁrst 5 seconds raw speech data
repeat if its length is short Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.

Feature extraction 5 compute a log mel spectrogram Copyright (C)
2019 NS Solutions Corporation, All Rights Reserved. n_dft=1024, n_hop=512, n_mels=64, fmin=20, fmax=8000 standardize Use kapre [Choi+, '17] for fast transformation preprocessed speech data

Data augmentation 6 random crop mixup [Zhang+, '17] cutout [DeVries
& Taylor, '17] Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved. 64 px 64 px 64 px 157 px

Network architecture 7 Copyright (C) 2019 NS Solutions Corporation, All
Rights Reserved. ReLU Dropout-0.5 Dense-64 Flatten Dense-n Softmax Conv block Conv block Conv block Input Conv block Conv block Conv block Conv2D-64 ReLU BN MP2D Use the same architecture as the sample code

Training Train with the following settings • initialization: use parameters
from pretrained model trained on data01 • optimizer: SGD ◦ lr: 0.01 ◦ momentum: 0.9 ◦ decay: 1e-06 • batch size: 32 Compute the validation score every 5 epochs and resume training if it is not the highest • the ratio of training and validation data is 9:1 Stop training when the remaining time is less than 0.125 * time_budget 8 Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.

Inference 9 Make 10 inferences for each speech data and
arithmetically average results random crop infer average result result Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.

Results of the feedback phase 10 Data 11 12 13
14 15 Domain speaker emotion accent music language Rank 2 2 1 3 3 ALC 0.8997 0.6013 0.8420 0.5620 0.8703 2 * AUROC - 1 0.9916 0.7154 0.9775 0.5972 0.9980 Curve Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.

Data 21 22 23 24 25 Domain speaker emotion accent
music language Rank 3 4 1 4 4 ALC 0.9477 0.7597 0.9132 0.6588 0.8572 2 * AUROC - 1 1.0000 0.8273 0.9776 0.7166 0.9835 Curve Results of the ﬁnal phase 11 Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.

Conclusion 12 Successful work • feature extraction using kapre •
pretrained model • data augmentation Unsuccessful work • complex network architecture like EﬃcientNet [Tan & Le, '19] Future work • automated data augmentation Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.

References 13 Choi, K., Joo, D., and Kim, J., "Kapre:
On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras." In arXiv, 2017. DeVries, T., and Taylor G. W., "Improved Regularization of Convolutional Neural Networks with Cutout." In arXiv, 2017. Tan, M., and Le, Q. V., "EﬃcientNet: Rethinking Model Scaling for Convolutional Neural Networks." In Proceedings of ICML, pp. 6105-6114, 2019. Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D., "mixup: Beyond Empirical Risk Minimization." In Proceedings of ICLR, 2018. Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.

3rd Place Solution of AutoSpeech 2019

3rd Place Solution of AutoSpeech 2019

Kon

More Decks by Kon

Other Decks in Science

Featured

Transcript

3rd Place Solution of AutoSpeech 2019 AutoDL workshop @NeurIPS 2019

Dataset Raw speech data • diﬀerent speech classiﬁcation domains ◦

Evaluation 3 Metric • area under the learning curve (ALC)

Preprocessing 4 crop the ﬁrst 5 seconds raw speech data

Feature extraction 5 compute a log mel spectrogram Copyright (C)

Data augmentation 6 random crop mixup [Zhang+, '17] cutout [DeVries

Network architecture 7 Copyright (C) 2019 NS Solutions Corporation, All

Training Train with the following settings • initialization: use parameters

Inference 9 Make 10 inferences for each speech data and

Results of the feedback phase 10 Data 11 12 13

Data 21 22 23 24 25 Domain speaker emotion accent

Conclusion 12 Successful work • feature extraction using kapre •

References 13 Choi, K., Joo, D., and Kim, J., "Kapre:

NS Solutions and NS logo are registered trademarks of NS