Slide 1

Slide 1 text

Cornell Birdcall Identification 6th place solution Team「Deepでポン」 Hidehisa Arai 1

Slide 2

Slide 2 text

About the competition 2 https://www.kaggle.com/c/birdsong-recognition Train dataset Test dataset The task: Build a system that can detect and classify birdcall events in audio clips • Around 20k audio clips • Additional ~25k audio clips (not officially provided but allowed to use) • 264 classes (bird species) • Each audio clip has primary labels and 1/3 of them also have secondary labels • Can be treated as multi-label, but itʼs basically for multi-class classification • Annotated in clip-level • Data from Xeno Canto , which means user- uploaded data. The annotations are also done by the uploaders • File format, sampling rate, audio quality, audio duration, etc. varies a lot. • Around 150 audio clips • Each clip is 10min long • 265 classes (bird species + `nocall` class) • We need to submit 5s chunk level prediction • The dataset is soundscape, which means it contains birdcall events of multiple species (sometimes overlapped) • Multi-label classification problem • Annotation is done in controlled environment with annotators

Slide 3

Slide 3 text

About the competition 3 https://www.kaggle.com/c/birdsong-recognition Train dataset Test dataset The task: Build a system that can detect and classify birdcall events in audio clips • Around 20k audio clips • Additional ~25k audio clips (not officially provided but allowed to use) • 264 classes (bird species) • Each audio clip has primary labels and 1/3 of them also have secondary labels • Can be treated as multi-label, but itʼs basically for multi-class classification • Annotated in clip-level • Data from Xeno Canto , which means user- uploaded data. The annotations are also done by the uploaders • File format, sampling rate, audio quality, audio duration, etc. varies a lot. • Around 150 audio clips • Each clip is 10min long • 265 classes (bird species + `nocall` class) • We need to submit 5s chunk level prediction • The dataset is soundscape, which means it contains birdcall events of multiple species (sometimes overlapped) • Multi-label classification problem • Annotation is done in controlled environment with annotators Large difference between train and test

Slide 4

Slide 4 text

How to use audio data with NN 4 1D CNN approach 2D CNN approach Slow and usually vulnerable to noise Learn fast, often used EnvNet WaveNet encoder Convert raw waveform to (log)(mel)spectrogram and treat as an image Use rich pool of image recognition models to classify spectrogram

Slide 5

Slide 5 text

Result Public 2nd (0.628) → Private 6th (0.668) 5

Slide 6

Slide 6 text

Abstract 6 • 3 staged training to gradually remove the noise in labels • Sound Event Detection(SED) style training and inference • Weighted blending with 11 models of no-fold training + EMA

Slide 7

Slide 7 text

Major challenges of the competition Discrepancy between train/test(Domain Shift) Weak and noisy labels 1. Source of the audio clips • Train: Uploaded to Xeno Canto by the recorder • Test: (Possibly) taken by microphones set outside 2. Annotation • Train: Clip-wise labels annotated by the uploaders of those clips • Test: Chunk-level labels annotated by trained annotator Labels for train data are clip level (weak label) and has only one primary label There may exist secondary labels but not that trustworthy (noisy label) Sometime uploaders just donʼt put secondary labels even enough multiple species are present in the clip (missing label) Class ʻnocallʼ only exists in test dataset 7

Slide 8

Slide 8 text

Motivation ‒ domain shift Decompose domain shift : input、: label、∗: true label(unobservable) Estimating "#$": "#$" → "#$" is the task !"#$% ≠ (!&'! ) Distribution of input is different between train/test • Comes from sound collection environment • SNR of the test is comparatively small because birds often sing far away • Frequency and type of non- birdcall events are also different !"#$% ∗ ≠ (!&'! ∗ ) Distribution of true label is different between train/test • Comes from the difference of the location • Distribution of birds or call- types are displayed • Test is passive, so nocall class occurs more often than train !"#$% ≠ !&'! Input-output relationship is different between train/test • Comes from the difference of the annotation methods • In train dataset, annotation level varies a lot since the uploaders are the annotators • In test dataset, annotation level is controlled and annotation is chunk-level 8

Slide 9

Slide 9 text

Motivation ‒ domain shift !"#$% ≠ (!&'! ) Distribution of input is different between train/test • Comes from sound collection environment • SNR of the test is comparatively small because birds often sing far away • Frequency and type of non- birdcall events are also different !"#$% ∗ ≠ (!&'! ∗ ) Distribution of true label is different between train/test • Comes from the difference of the location • Distribution of birds or call- types are displayed • Test is passive, so nocall class occurs more often than train !"#$% ≠ !&'! Input-output relationship is different between train/test • Comes from the difference of the annotation methods • In train dataset, annotation level varies a lot since the uploaders are the annotators • In test dataset, annotation level is controlled and annotation is chunk-level 9 Provide all possible variation with Data Augmentation Ensemble of models trained with dataset with different class distribution Make !"#$% closer !&'! by label correction

Slide 10

Slide 10 text

Motivation‒ label noise Decompose label noise Weak label is noisy label The fact that labels only exist in clip level is itself treated as noisy label There are some cases that birdcall events that are in labels doesnʼt exist We can trust primary labels but they are distributed in the whole clips Birds in secondary labels are only scarcely hearable There are some cases that birdcall events that arenʼt in labels exist 2/3 of the train data donʼt have secondary labels It is up to the uploaders whether to put secondary labels or not, so there exist some missing labels 10 labels: aldfly, leafly, amered, canwar label: warvir

Slide 11

Slide 11 text

Motivation‒ label noise 11 Weak label is noisy label The fact that labels only exist in clip level is itself treated as noisy label There are some cases that birdcall events that are in labels doesnʼt exist We can trust primary labels but they are distributed in the whole clips Birds in secondary labels are only scarcely hearable There are some cases that birdcall events that arenʼt in labels exist 2/3 of the train data donʼt have secondary labels It is up to the uploaders whether to put secondary labels or not, so there exist some missing labels labels: aldfly, leafly, amered, canwar label: warvir Make sure birdcall events are in the chunk by training with long period chunk Eliminate low-confidence labels by pseudo-labeling Find missing labels in audio clips that donʼt have secondary labels with trained models

Slide 12

Slide 12 text

Sound Event Detection (SED) 12 Two major approach to tackle the task Audio Tagging Sound Event Detection Clip level labeling to the audio input Segment level labeling(time-annotated) to the audio Aggregate in time axis (max, mean, attention,…) Feature Extractor Feature map input (waveform,melspec,…) Feature extraction CNN, etc. Feature Extractor Feature map input (waveform,melspec,…) Pointwise Classifier Classifier Clip-level prediction Frame-level prediction Aggregate in time axis (max, mean, attention,…) The outputs are two: clip-wise prediction and segment-wise prediction Feature extraction CNN, etc.

Slide 13

Slide 13 text

Stage 1 13 Find missing labels with 5fold PANNs • Cnn14DecisionLevelAtt of PANNs (SED model) • Input: 30s randomly cropped chunk←For Weak label • Data Augmentation ←For shift in • Gaussian Noise • Pitch Shift • Volume Up/Down • Loss is applied to both clipwise output and framewise output • Create prediction for the whole train dataset by 5fold training • Add class to secondary labels whose prob > 0.9 and not included in primary label (only for the clips without secondary labels)←For missing label Attention map Aggregate in time axis after element- wise product Feature Extractor Segment-wise Clip-level prediction (Aggregate with max) Clip-level prediction (Aggregate with attention) ℒ = , ( !"" + 0.5(, ( #!$) Attention map → 0.578(public) / 0.619(private) (Silver medal zone)

Slide 14

Slide 14 text

Stage 2 14 Pseudo labeling with 5fold SED model • Inherit the PANNs architecture • Change CNN encoder to ResNeSt50 • Use additional labels found in stage 1←For missing label • Input: 20s randomly cropped chunk←For Weak label • Data Augmentation ←For shift in () • Gaussian Noise • Pitch Shift • Volume Up/Down • 3channel input (melspec, pcen, melspec ** 1.5) • Loss is applied to both clipwise output and framewise output • Create prediction for the whole train dataset by 5fold training • Correct the original labels with oof prediction←For noisy label 264 n_frame(≠len(y)) Cropped chunk Frame-wise prediction Original clip-level label Aggregate the prediction in the chunk in time axis with max and apply threshold np.logical_and Corrected label Alter the correction level by changing the threshold for diversity → 0.601(public) / 0.655(private) (Gold medal zone)

Slide 15

Slide 15 text

Stage 3 15 Train several models with no-fold training + EMA changing the correction level of the label • PANNs architecture+ ResNeSt50/EfficientNet-B0 • No-fold training with Train Dataset/Train Extended • Use additional labels found in stage 1←For missing label • Input: 20s randomly cropped chunk←For Weak label • Data Augmentation ←For shift in () • Gaussian Noise • Pitch Shift • Volume Up/Down • 3channel input (melspec, pcen, melspec ** 1.5) • Loss is applied to both clipwise output and framewise output(FocalLoss for EfficientNet) • Correct the original labels with oof prediction←For noisy label • Alter the label correction level by changing the threshold on oof pred from 0.3 to 0.7← For shift in (∗) • Weighted average of 11 models(weight is decided based on public LB) ※ no-fold + EMA: Use the whole dataset for training and apply Exponential Moving Average on the weights → 0.628(public) / 0.668(private) (6th ) librosa.power_to_db(librosa.feature.melspectrogram(y, )) librosa.pcen(librosa.feature.melspectrogram(y)) librosa.power_to_db(librosa.feature.melspectrogram(y) ** 1.5)

Slide 16

Slide 16 text

Miscellaneous 16 Reflections • There was a bug in seed fixing function. I found this when I was training the third stage • Loss of reproducibility by code modification because of • 1 experiment 1 script is the best after all • I used Data Augmentation configuration which worked in the first place, but pitch shift is actually super-slow and I could have done more experiments if I had decided not to use it. • Hard voting seems to work very well in the winnerʼs solution, but I even didnʼt try it. What didnʼt work • Mixup • Mix before taking the log(2nd,36th) • Union on the label(3rd ,36th) are reported to have worked • Calltype classification • In fact, there are many calltypes even in a single species • Calls and songs are quite different • I tried 871 class classification using calltype labels but it didnʼt worked • Larger model • According to Oleg(5th ), too large receptive field wasnʼt good for this dataset What I want to do further • Add more models • Continue on further label correction • Use location and elevation for label correction • Use co-occurrence information of bird calls for post-processing • Mix background noise