7th Place Solution for Freesound Audio Tagging 2019

A66d070fc253858f8c5a93c4412fa702?s=47 uratatsu
April 07, 2020

7th Place Solution for Freesound Audio Tagging 2019

Tokyo BISH Bash #01
(https://tokyo-bish-bash.connpass.com/event/171564/)

Freesound Audio Tagging 2019 in Kaggle
(https://www.kaggle.com/c/freesound-audio-tagging-2019)
7th place solution

A66d070fc253858f8c5a93c4412fa702?s=128

uratatsu

April 07, 2020
Tweet

Transcript

  1. 7th Place Solution for Freesound Audio Tagging 2019 Tokyo BISH

    Bash #01 Hitachi, Ltd. Social Infrastructure Information Systems Division 2/26/2020 Tatsuya Uratani
  2. © Hitachi, Ltd. 2020. All rights reserved. 7th Place Solution

    for Freesound Audio Tagging 2019
  3. © Hitachi, Ltd. 2020. All rights reserved. Hello! I am

    Tatsuya Uratani You can find me at @uratatsu on kaggle. 3
  4. © Hitachi, Ltd. 2020. All rights reserved. Uratatsu Major :

    Evolutionary psychology, Evolutionary Ecology Occupation : Data Scientist 4
  5. © Hitachi, Ltd. 2020. All rights reserved. Team Shirogane 5

    @kaerururu • NLP ML engineer • Kaggle Master @Hidehisa Arai • The University of Tokyo • Kaggle Master
  6. © Hitachi, Ltd. 2020. All rights reserved. Competition Overview Welcome

    to the kaggle competition. 1
  7. © Hitachi, Ltd. 2020. All rights reserved. Freesound Audio Tagging

    2019 7 Task • 80 class multilabel classification of audio data • Clean data and noisy data from different sources
  8. © Hitachi, Ltd. 2020. All rights reserved. Competition Rules •

    GPU Kernel < 1 hour run-time (Inference only) • External data and pre-trained models are not allowed • 2 stage kernel competition Light models ensemble or Heavy single model TTA or Ensemble Faster preprocess 8
  9. © Hitachi, Ltd. 2020. All rights reserved. Competition Points •

    Image classification when converted to spectrogram. • Many preprocess parameters • Various length inputs • Multi class Multi label • Clean(4970) and Noisy(19815) datasets 9
  10. © Hitachi, Ltd. 2020. All rights reserved. Basic Solution 10

    Convolutional Neural Network Convert to Melspectrogram 128 * 128 Random Crop Various Length 80 classes
  11. © Hitachi, Ltd. 2020. All rights reserved. Sound Sample 11

    Accordion Church_bell Female_singing Sigh Motorcycle Traffic_noise_and_roadway_noise Meow Bathtub_(filling_or_washing)
  12. © Hitachi, Ltd. 2020. All rights reserved. 7th Place Solution

    ~Shirogane Solution~ 2
  13. © Hitachi, Ltd. 2020. All rights reserved. 13 Inception 1ch

    LB 0.720↑ CustomCNN LB 0.720↑ Inception 3ch LB 0.724↑ Noisy Log Mel Spectrogram CV :0.01 up LB :0.005 up Augmentation in batch ・mix up ・Random Resized Crop (Inception only) ・HorizontalFlip Data Augmentation ・pitch Data Augmentation ・fade ・treble & bass ・pitch ・equalize ・reverb 20tta Predict 20tta 20tta Blending CV :0.01 up Pretrain Inception 1ch Inception 3ch CustomCNN weight Curated Log Mel Spectrogram 128 * 128 Strength Crop 128 * 128 Strength Crop Shirogane Solution Augmentation in batch ・mix up ・Random Erasing ・Coarse Dropout ・Random Resized Crop (Inception only) ・HorizontalFlip
  14. © Hitachi, Ltd. 2020. All rights reserved. Data Augmentation with

    SoX • Fade • Pitch • Reverb 14 Point 1 • Treble & Bass • Equalize
  15. © Hitachi, Ltd. 2020. All rights reserved. Strength Adaptive Crop

    • Crop by distribution of strength of db Why we noticed ? 15 Point 2
  16. © Hitachi, Ltd. 2020. All rights reserved. Custom Convolution Network

    • Custom pooling layer 16 Point 3 def forward(self, x): x = x.view(x.size(0), 1, x.size(1), x.size(2)) x = self.conv1(x, pool_size=(1, 1), pool_type="both") x = self.conv2(x, pool_size=(4, 1), pool_type="both") x = self.conv3(x, pool_size=(1, 3), pool_type="both") x = self.conv4(x, pool_size=(4, 1), pool_type="both") x = self.conv5(x, pool_size=(1, 3), pool_type="both") ……………………………………………………………………………… elif pool_type == "both": x1 = F.max_pool2d(x, kernel_size=pool_size) x2 = F.avg_pool2d(x, kernel_size=pool_size) x = x1 + x2
  17. © Hitachi, Ltd. 2020. All rights reserved. Inception with Random

    Resized Crop • We use Inception-V3 model • Default Random Resized Crop • Low score in single fold → score jumping up in 5 folds 17 Point 4 torchvision.transforms.RandomResi zedCrop( size, scale=(0.08, 1.0), ratio=(0.75,1.3333333333333333), interpolation=2)
  18. © Hitachi, Ltd. 2020. All rights reserved. Other’s Solution ~Other

    Top Solution~ 3
  19. © Hitachi, Ltd. 2020. All rights reserved. 19 Other’s Solution

    • Melspectrogram Layer (2nd) They use it to search the hyperparameter of log mel end 2 end • SpecMix (8th) SpecAugment SpecMix
  20. © Hitachi, Ltd. 2020. All rights reserved. 20 Other’s Solution

    • Multitask learning with noisy labels. (4th) • Semi-supervised learning(SSL) with noisy data. (4th) • Resnet34 + Envnet-v2 (4th)
  21. © Hitachi, Ltd. 2020. All rights reserved. How we work

    ~Unsophisticated work~ 4
  22. © Hitachi, Ltd. 2020. All rights reserved. Resources Tools Discussion

    :Slack To Do: Trello Code : Kaggle Kernel Our Experience First pure image competition for all members. All Kaggle expert 22 Machine Resources GPU : P100 * 7 → P100 * 4 (Kaggle Kernel)
  23. © Hitachi, Ltd. 2020. All rights reserved. Our Policy Try

    every methods in Discussion/ Public Kernel We investigated all discussion and high score kernel. We try most methods. 23 Number of experiments first (Not jobs, Not research) Understanding method is also important, but it is more important not to stop GPUs during the competition.
  24. © Hitachi, Ltd. 2020. All rights reserved. Unsophisticated Works View

    and Listen many data Some sounds are very dirty. Viewing and listening the data is most important for making hypothesis. 24 Try many famous architecture We try many famous architecture. (ex. Resnet, Densenet, Wideresnet, resnext, se-resnext...)
  25. KaggleはGoogle LLCの商標です FreesoundはBeijing Xiaoniao Tingting Technology Co., LTD.の商標です SoX http://sox.sourceforge.net/