Slide 1
Slide 1 text
Free Sound Audio Tagging 2019
Remove
Silent
Audios
train curated
~ 5k audios
train noisy
~ 20k audios
test
Trim Silent Parts
librosa.effects.trim
Sampling Rate : 44.1 kHz
FFT Window Size : 80 ms
Hop Length : 10 ms
Mel Bands : 64
librosa.feature.melspectrogram
librosa.core.power_to_db
LogMel Spectrogram
64
X ( depends on each audio length )
1
Frequency - Wise
25 Statistical Features
( mean, std, mean grad, ... )
stat 1 stat 2 stat 25
64
64 x 25 features
Normalized
with
constant range
2
Clustered Features
Distances to
each cluaster center
Flatten
=> Standardized
n_clusters : 200
MiniBatchKMeans
Resources:
Geforce GTX 1080 Ti x 2
Training and Prediction
MobileNet V2
ResNet 50
DenseNet 121
1 BN Conv2D
Point - wise
10 filters
Conv2D
Point - wise
3 filters
Dense
(1536)
+
BN
Dense
(384)
+
BN
BCE
SoftMax
BCE
BCE
BCE
BCE
GAP
2D
Dense
(80)
- TTA (x3)
width = 0.2
height = 6/64
- 5 length ensemble
263, 388, 513, 638, 763
- 5 fold ensemble
5 folds using iterative-stratification
Augmentation
shift range: width = 0.6, height = 12/64
Random LogMel Length Extraction
263 <= Length <= 763, padding with zero
MixUp: alpha = 0.5
3 stage learning with 3 LR schedule (Cyclic, RLRonPlateau x 2)
1. train w/o Feature 2, train-curated ONLY
2. train w/ Feature 2, train-curated and ALL train-noisy
using MODIFIED BCE (ignoring audios with high BCE)
3. train w/ Feature 2,
train-curated and SELECTED train-noisy (low BCE audios)
Training
Public : 31 th / 880 teams
Local 0.876 on train-curated
Public LB 0.719
Private : 28 th / 880 teams
Private LB 0.72820
Feature Extraction
Copyright 2019 @ Maxwell_110
concat
2
BN
w/o clustered features
2
BN
2
BN
Model Backbones
prediction
Geometric Blending
6 Models
Blending Coefficients
- Optimized using 5 fold OOF
- Blending coefficients
MobileNet V2 : 0.24/0.22
ResNet 50 : 0.15/0.07
DenseNet 121 : 0.12/0.20