Kaggle Fress Sound Audio Tagging 2019 Model Pipeline

Slide 1

Slide 1 text

Free Sound Audio Tagging 2019 Remove Silent Audios train curated ~ 5k audios train noisy ~ 20k audios test Trim Silent Parts librosa.effects.trim Sampling Rate : 44.1 kHz FFT Window Size : 80 ms Hop Length : 10 ms Mel Bands : 64 librosa.feature.melspectrogram librosa.core.power_to_db LogMel Spectrogram 64 X ( depends on each audio length ) 1 Frequency - Wise 25 Statistical Features ( mean, std, mean grad, ... ) stat 1 stat 2 stat 25 64 64 x 25 features Normalized with constant range 2 Clustered Features Distances to each cluaster center Flatten => Standardized n_clusters : 200 MiniBatchKMeans Resources: Geforce GTX 1080 Ti x 2 Training and Prediction MobileNet V2 ResNet 50 DenseNet 121 1 BN Conv2D Point - wise 10 filters Conv2D Point - wise 3 filters Dense (1536) + BN Dense (384) + BN BCE SoftMax BCE BCE BCE BCE GAP 2D Dense (80) - TTA (x3) width = 0.2 height = 6/64 - 5 length ensemble 263, 388, 513, 638, 763 - 5 fold ensemble  5 folds using iterative-stratification  Augmentation shift range: width = 0.6, height = 12/64  Random LogMel Length Extraction 263 <= Length <= 763, padding with zero  MixUp: alpha = 0.5  3 stage learning with 3 LR schedule (Cyclic, RLRonPlateau x 2) 1. train w/o Feature 2, train-curated ONLY 2. train w/ Feature 2, train-curated and ALL train-noisy using MODIFIED BCE (ignoring audios with high BCE) 3. train w/ Feature 2, train-curated and SELECTED train-noisy (low BCE audios) Training Public : 31 th / 880 teams Local 0.876 on train-curated Public LB 0.719 Private : 28 th / 880 teams Private LB 0.72820 Feature Extraction Copyright 2019 @ Maxwell_110 concat 2 BN w/o clustered features 2 BN 2 BN Model Backbones prediction Geometric Blending 6 Models Blending Coefficients - Optimized using 5 fold OOF - Blending coefficients MobileNet V2 : 0.24/0.22 ResNet 50 : 0.15/0.07 DenseNet 121 : 0.12/0.20