Slide 1

Slide 1 text

1D Features Resources: TITAN RTX, 1080Ti x 2, 2080Ti x 2 *1 Cornell Birdcall Identification Kaggle XenoCant Extended XenoCant*2 Copyright 2020 @ Maxwell_110 *1 My room became tropical. *2 Credit to Vopani *3 Kerneler-kun had become a notebook expert! Check his profile. Feature Extraction Training / Prediction BirdVox ff1010 fs2019 ESC50 264 birds nocall Trim silent parts (librosa.effects.trim) Load Audio (librosa.load) Resample with 22.05 kHz remove only silent start/end parts Log Mel Spectrogram (2D) ( librosa.feature.melspectrogram librosa.core.power_to_db ) Audio Data (1D)  5 - 10 (s) variable audio length  64 nmel, 10ms hop, 80ms sfft  Event aware extraction  5 (s) constant audio length  Event aware extraction 64 500 - 1000 5 x 22050 1 Augmentation  p: 0.5  width / height shift: 0.2 / 0.1  Scale: -0.05 / +0.05 2 Random Eraser  p: 0.5  erase num: 1  width: [0, 0.1]  height: [0.1, 0.3]  fill with -1 Standardize [- 1, + 1] To 2D Models To 1D Models  p: 0.5  width shift: 0.2  NoiseInjection Augmentation Standardize [- 1, + 1] cut out at random 2D Features ResNet 18 nocall Binary Model (call / nocall ) 3 4 Multi-Label Model (264 types) Multi-Task-Learning (MTL) for primary and noisy background labels call 2D Models 2048 nodes + BN + ReLu 512 nodes + BN + ReLu 2D: GAP / 1D: MAP*4 2048 nodes + BN + ReLu 512 nodes + BN + ReLu 2D: GAP / 1D: MAP*4 264 nodes primary labels 264 nodes background labels ResNet 18 2D Models MTL Loss 3 stage scratch learning 1. primary only - , = [, ] - 2D: 200 epochs + Early Stopping (ES) Adam, CyclicLR 1e-4 ~ e-3 - 1D: 100 epochs + ES SGD, CosineAnnealing 1e-1 ~ e-6 => Adam, ReduceLROnP 1e-4 2. + background - , = [, ] - Adam (5e-5) - ReduceLROnP (x 0.25) 3. + Psuedo Labeling - , = [, ] - Adam (5e-5) - ReduceLROnP (x 0.25) - Predictions of backgrounds more than 0.15 are added to primary labels as soft labels in primary branch. All values are clipped between 0 and 1. primary branch background branch PANNs 1D 1D Model Blending Public: 36 th (0.623) Private: 39 th (0.580) *3 • Class-Wise Blending - For each bird class - Optimize blend coefficients with BCE loss • Class-Wise threshold optimization - For each bird class - Maximize macro-F1 (not sample-wise) • 3 Epoch ensemble for PANNs 1D • 5 Fold ensemble for ResNet18 , PANNs 1D Model Architecture *4 mean on time axis, max and mean on freq axis

Slide 2

Slide 2 text

1 Event Aware Extraction Example: XC341516.mp3 (ebird_code: brespa, 60 sec) brespa bug Naïve Random Extraction (5 sec) waveform logmel logmel freq-wise max nocall, no event => mislabeling logmel freq-wise max aware extraction noisy event / call => sometimes works, other times fails. logmel freq-wise max aware extraction with removing lower frequencies bug brespa  Using logmel spectrum, extract audio chunks which contain logmel frequency-wise max intensity over the threshold.  Because almost of all Bird call frequencies range from 1 ~ 8 kHz, get the better chunks for training via removing lower frequencies (~ 300Hz).  Variable extraction length, 5 - 10 sec, also helps to get a good signal, meanwhile 2D models are well generalized to variable length. threshold