Upgrade to Pro — share decks privately, control downloads, hide ads and more …

3rd Place Solution of AutoSpeech 2019

Kon
November 17, 2019

3rd Place Solution of AutoSpeech 2019

Kon

November 17, 2019
Tweet

More Decks by Kon

Other Decks in Science

Transcript

  1. 3rd Place Solution of AutoSpeech 2019
    AutoDL workshop @NeurIPS 2019
    Dec 14, 2019
    Team Kon
    https://github.com/Y-oHr-N/autospeech19
    Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.

    View full-size slide

  2. Dataset
    Raw speech data
    ● different speech classification domains
    ○ speaker identification
    ○ emotion classification
    ○ accent recognition
    ○ music genre classification
    ○ language identification
    ● single channel
    ● 16 kHz sampling rate
    Meta data
    ● number of classes (is greater than 2 and less than 100)
    ● number of training instances (varies from hundreds to thousands)
    ● time budget (is 1800 seconds for all the datasets)
    2
    Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.

    View full-size slide

  3. Evaluation
    3
    Metric
    ● area under the learning curve (ALC)
    How to get a high ALC
    ● speed up feature extraction
    ● converge the loss quickly
    ● keep a high AUROC
    Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.
    transformed time
    2 * AUROC - 1

    View full-size slide

  4. Preprocessing
    4
    crop the first 5 seconds
    raw speech data
    repeat if its length is short
    Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.

    View full-size slide

  5. Feature extraction
    5
    compute a log mel spectrogram
    Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.
    n_dft=1024, n_hop=512, n_mels=64, fmin=20, fmax=8000
    standardize
    Use kapre [Choi+, '17] for fast transformation
    preprocessed speech data

    View full-size slide

  6. Data augmentation
    6
    random crop mixup [Zhang+, '17] cutout [DeVries & Taylor, '17]
    Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.
    64 px
    64 px
    64 px
    157 px

    View full-size slide

  7. Network architecture
    7
    Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.
    ReLU
    Dropout-0.5
    Dense-64
    Flatten
    Dense-n
    Softmax
    Conv block
    Conv block
    Conv block
    Input
    Conv block
    Conv block
    Conv block
    Conv2D-64
    ReLU
    BN
    MP2D
    Use the same architecture as the sample code

    View full-size slide

  8. Training
    Train with the following settings
    ● initialization: use parameters from pretrained model trained on data01
    ● optimizer: SGD
    ○ lr: 0.01
    ○ momentum: 0.9
    ○ decay: 1e-06
    ● batch size: 32
    Compute the validation score every 5 epochs and resume training if it is not the highest
    ● the ratio of training and validation data is 9:1
    Stop training when the remaining time is less than 0.125 * time_budget
    8
    Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.

    View full-size slide

  9. Inference
    9
    Make 10 inferences for each speech data and arithmetically average results
    random crop infer average
    result
    result
    Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.

    View full-size slide

  10. Results of the feedback phase
    10
    Data 11 12 13 14 15
    Domain speaker emotion accent music language
    Rank 2 2 1 3 3
    ALC 0.8997 0.6013 0.8420 0.5620 0.8703
    2 * AUROC - 1 0.9916 0.7154 0.9775 0.5972 0.9980
    Curve
    Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.

    View full-size slide

  11. Data 21 22 23 24 25
    Domain speaker emotion accent music language
    Rank 3 4 1 4 4
    ALC 0.9477 0.7597 0.9132 0.6588 0.8572
    2 * AUROC - 1 1.0000 0.8273 0.9776 0.7166 0.9835
    Curve
    Results of the final phase
    11
    Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.

    View full-size slide

  12. Conclusion
    12
    Successful work
    ● feature extraction using kapre
    ● pretrained model
    ● data augmentation
    Unsuccessful work
    ● complex network architecture like EfficientNet [Tan & Le, '19]
    Future work
    ● automated data augmentation
    Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.

    View full-size slide

  13. References
    13
    Choi, K., Joo, D., and Kim, J.,
    "Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with
    Keras."
    In arXiv, 2017.
    DeVries, T., and Taylor G. W.,
    "Improved Regularization of Convolutional Neural Networks with Cutout."
    In arXiv, 2017.
    Tan, M., and Le, Q. V.,
    "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks."
    In Proceedings of ICML, pp. 6105-6114, 2019.
    Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D.,
    "mixup: Beyond Empirical Risk Minimization."
    In Proceedings of ICLR, 2018.
    Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.

    View full-size slide

  14. NS Solutions and NS logo are registered trademarks of NS Solutions Corporation
    14
    Copyright (C) 2019 NS Solutions Corporation, All Rights Reserved.

    View full-size slide