Cornell Birdcall 36th place solution

1D Features Resources: TITAN RTX, 1080Ti x 2, 2080Ti x
2 *1 Cornell Birdcall Identification Kaggle XenoCant Extended XenoCant*2 Copyright 2020 @ Maxwell_110 *1 My room became tropical. *2 Credit to Vopani *3 Kerneler-kun had become a notebook expert! Check his profile. Feature Extraction Training / Prediction BirdVox ff1010 fs2019 ESC50 264 birds nocall Trim silent parts (librosa.effects.trim) Load Audio (librosa.load) Resample with 22.05 kHz remove only silent start/end parts Log Mel Spectrogram (2D) ( librosa.feature.melspectrogram librosa.core.power_to_db ) Audio Data (1D)  5 - 10 (s) variable audio length  64 nmel, 10ms hop, 80ms sfft  Event aware extraction  5 (s) constant audio length  Event aware extraction 64 500 - 1000 5 x 22050 1 Augmentation  p: 0.5  width / height shift: 0.2 / 0.1  Scale: -0.05 / +0.05 2 Random Eraser  p: 0.5  erase num: 1  width: [0, 0.1]  height: [0.1, 0.3]  fill with -1 Standardize [- 1, + 1] To 2D Models To 1D Models  p: 0.5  width shift: 0.2  NoiseInjection Augmentation Standardize [- 1, + 1] cut out at random 2D Features ResNet 18 nocall Binary Model (call / nocall ) 3 4 Multi-Label Model (264 types) Multi-Task-Learning (MTL) for primary and noisy background labels call 2D Models 2048 nodes + BN + ReLu 512 nodes + BN + ReLu 2D: GAP / 1D: MAP*4 2048 nodes + BN + ReLu 512 nodes + BN + ReLu 2D: GAP / 1D: MAP*4 264 nodes primary labels 264 nodes background labels ResNet 18 2D Models MTL Loss 3 stage scratch learning 1. primary only - , = [, ] - 2D: 200 epochs + Early Stopping (ES) Adam, CyclicLR 1e-4 ~ e-3 - 1D: 100 epochs + ES SGD, CosineAnnealing 1e-1 ~ e-6 => Adam, ReduceLROnP 1e-4 2. + background - , = [, ] - Adam (5e-5) - ReduceLROnP (x 0.25) 3. + Psuedo Labeling - , = [, ] - Adam (5e-5) - ReduceLROnP (x 0.25) - Predictions of backgrounds more than 0.15 are added to primary labels as soft labels in primary branch. All values are clipped between 0 and 1. primary branch background branch PANNs 1D 1D Model Blending Public: 36 th (0.623) Private: 39 th (0.580) *3 • Class-Wise Blending - For each bird class - Optimize blend coefficients with BCE loss • Class-Wise threshold optimization - For each bird class - Maximize macro-F1 (not sample-wise) • 3 Epoch ensemble for PANNs 1D • 5 Fold ensemble for ResNet18 , PANNs 1D Model Architecture *4 mean on time axis, max and mean on freq axis

1 Event Aware Extraction Example: XC341516.mp3 (ebird_code: brespa, 60 sec)
brespa bug Naïve Random Extraction (5 sec) waveform logmel logmel freq-wise max nocall, no event => mislabeling logmel freq-wise max aware extraction noisy event / call => sometimes works, other times fails. logmel freq-wise max aware extraction with removing lower frequencies bug brespa  Using logmel spectrum, extract audio chunks which contain logmel frequency-wise max intensity over the threshold.  Because almost of all Bird call frequencies range from 1 ~ 8 kHz, get the better chunks for training via removing lower frequencies (~ 300Hz).  Variable extraction length, 5 - 10 sec, also helps to get a good signal, meanwhile 2D models are well generalized to variable length. threshold

Cornell Birdcall 36th place solution

Cornell Birdcall 36th place solution

Maxwell

More Decks by Maxwell

Other Decks in Science

Featured

Transcript

1D Features Resources: TITAN RTX, 1080Ti x 2, 2080Ti x

1 Event Aware Extraction Example: XC341516.mp3 (ebird_code: brespa, 60 sec)