Pro Yearly is on sale from $80 to $50! »

6th place solution to Cornell Birdcall Identification Challenge

6th place solution to Cornell Birdcall Identification Challenge

2c68dc672293cc3f8a7a57d3af86f15b?s=128

Hidehisa Arai

October 24, 2020
Tweet

Transcript

  1. Cornell Birdcall Identification 6th place solution Team「Deepでポン」 Hidehisa Arai 1

  2. About the competition 2 https://www.kaggle.com/c/birdsong-recognition Train dataset Test dataset The

    task: Build a system that can detect and classify birdcall events in audio clips • Around 20k audio clips • Additional ~25k audio clips (not officially provided but allowed to use) • 264 classes (bird species) • Each audio clip has primary labels and 1/3 of them also have secondary labels • Can be treated as multi-label, but itʼs basically for multi-class classification • Annotated in clip-level • Data from Xeno Canto , which means user- uploaded data. The annotations are also done by the uploaders • File format, sampling rate, audio quality, audio duration, etc. varies a lot. • Around 150 audio clips • Each clip is 10min long • 265 classes (bird species + `nocall` class) • We need to submit 5s chunk level prediction • The dataset is soundscape, which means it contains birdcall events of multiple species (sometimes overlapped) • Multi-label classification problem • Annotation is done in controlled environment with annotators
  3. About the competition 3 https://www.kaggle.com/c/birdsong-recognition Train dataset Test dataset The

    task: Build a system that can detect and classify birdcall events in audio clips • Around 20k audio clips • Additional ~25k audio clips (not officially provided but allowed to use) • 264 classes (bird species) • Each audio clip has primary labels and 1/3 of them also have secondary labels • Can be treated as multi-label, but itʼs basically for multi-class classification • Annotated in clip-level • Data from Xeno Canto , which means user- uploaded data. The annotations are also done by the uploaders • File format, sampling rate, audio quality, audio duration, etc. varies a lot. • Around 150 audio clips • Each clip is 10min long • 265 classes (bird species + `nocall` class) • We need to submit 5s chunk level prediction • The dataset is soundscape, which means it contains birdcall events of multiple species (sometimes overlapped) • Multi-label classification problem • Annotation is done in controlled environment with annotators Large difference between train and test
  4. How to use audio data with NN 4 1D CNN

    approach 2D CNN approach Slow and usually vulnerable to noise Learn fast, often used EnvNet WaveNet encoder Convert raw waveform to (log)(mel)spectrogram and treat as an image Use rich pool of image recognition models to classify spectrogram
  5. Result Public 2nd (0.628) → Private 6th (0.668) 5

  6. Abstract 6 • 3 staged training to gradually remove the

    noise in labels • Sound Event Detection(SED) style training and inference • Weighted blending with 11 models of no-fold training + EMA
  7. Major challenges of the competition Discrepancy between train/test(Domain Shift) Weak

    and noisy labels 1. Source of the audio clips • Train: Uploaded to Xeno Canto by the recorder • Test: (Possibly) taken by microphones set outside 2. Annotation • Train: Clip-wise labels annotated by the uploaders of those clips • Test: Chunk-level labels annotated by trained annotator Labels for train data are clip level (weak label) and has only one primary label There may exist secondary labels but not that trustworthy (noisy label) Sometime uploaders just donʼt put secondary labels even enough multiple species are present in the clip (missing label) Class ʻnocallʼ only exists in test dataset 7
  8. Motivation ‒ domain shift Decompose domain shift : input、: label、∗:

    true label(unobservable) Estimating "#$": "#$" → "#$" is the task !"#$% ≠ (!&'! ) Distribution of input is different between train/test • Comes from sound collection environment • SNR of the test is comparatively small because birds often sing far away • Frequency and type of non- birdcall events are also different !"#$% ∗ ≠ (!&'! ∗ ) Distribution of true label is different between train/test • Comes from the difference of the location • Distribution of birds or call- types are displayed • Test is passive, so nocall class occurs more often than train !"#$% ≠ !&'! Input-output relationship is different between train/test • Comes from the difference of the annotation methods • In train dataset, annotation level varies a lot since the uploaders are the annotators • In test dataset, annotation level is controlled and annotation is chunk-level 8
  9. Motivation ‒ domain shift !"#$% ≠ (!&'! ) Distribution of

    input is different between train/test • Comes from sound collection environment • SNR of the test is comparatively small because birds often sing far away • Frequency and type of non- birdcall events are also different !"#$% ∗ ≠ (!&'! ∗ ) Distribution of true label is different between train/test • Comes from the difference of the location • Distribution of birds or call- types are displayed • Test is passive, so nocall class occurs more often than train !"#$% ≠ !&'! Input-output relationship is different between train/test • Comes from the difference of the annotation methods • In train dataset, annotation level varies a lot since the uploaders are the annotators • In test dataset, annotation level is controlled and annotation is chunk-level 9 Provide all possible variation with Data Augmentation Ensemble of models trained with dataset with different class distribution Make !"#$% closer !&'! by label correction
  10. Motivation‒ label noise Decompose label noise Weak label is noisy

    label The fact that labels only exist in clip level is itself treated as noisy label There are some cases that birdcall events that are in labels doesnʼt exist We can trust primary labels but they are distributed in the whole clips Birds in secondary labels are only scarcely hearable There are some cases that birdcall events that arenʼt in labels exist 2/3 of the train data donʼt have secondary labels It is up to the uploaders whether to put secondary labels or not, so there exist some missing labels 10 labels: aldfly, leafly, amered, canwar label: warvir
  11. Motivation‒ label noise 11 Weak label is noisy label The

    fact that labels only exist in clip level is itself treated as noisy label There are some cases that birdcall events that are in labels doesnʼt exist We can trust primary labels but they are distributed in the whole clips Birds in secondary labels are only scarcely hearable There are some cases that birdcall events that arenʼt in labels exist 2/3 of the train data donʼt have secondary labels It is up to the uploaders whether to put secondary labels or not, so there exist some missing labels labels: aldfly, leafly, amered, canwar label: warvir Make sure birdcall events are in the chunk by training with long period chunk Eliminate low-confidence labels by pseudo-labeling Find missing labels in audio clips that donʼt have secondary labels with trained models
  12. Sound Event Detection (SED) 12 Two major approach to tackle

    the task Audio Tagging Sound Event Detection Clip level labeling to the audio input Segment level labeling(time-annotated) to the audio Aggregate in time axis (max, mean, attention,…) Feature Extractor Feature map input (waveform,melspec,…) Feature extraction CNN, etc. Feature Extractor Feature map input (waveform,melspec,…) Pointwise Classifier Classifier Clip-level prediction Frame-level prediction Aggregate in time axis (max, mean, attention,…) The outputs are two: clip-wise prediction and segment-wise prediction Feature extraction CNN, etc.
  13. Stage 1 13 Find missing labels with 5fold PANNs •

    Cnn14DecisionLevelAtt of PANNs (SED model) • Input: 30s randomly cropped chunk←For Weak label • Data Augmentation ←For shift in • Gaussian Noise • Pitch Shift • Volume Up/Down • Loss is applied to both clipwise output and framewise output • Create prediction for the whole train dataset by 5fold training • Add class to secondary labels whose prob > 0.9 and not included in primary label (only for the clips without secondary labels)←For missing label Attention map Aggregate in time axis after element- wise product Feature Extractor Segment-wise Clip-level prediction (Aggregate with max) Clip-level prediction (Aggregate with attention) ℒ = , ( !"" + 0.5(, ( #!$) Attention map → 0.578(public) / 0.619(private) (Silver medal zone)
  14. Stage 2 14 Pseudo labeling with 5fold SED model •

    Inherit the PANNs architecture • Change CNN encoder to ResNeSt50 • Use additional labels found in stage 1←For missing label • Input: 20s randomly cropped chunk←For Weak label • Data Augmentation ←For shift in () • Gaussian Noise • Pitch Shift • Volume Up/Down • 3channel input (melspec, pcen, melspec ** 1.5) • Loss is applied to both clipwise output and framewise output • Create prediction for the whole train dataset by 5fold training • Correct the original labels with oof prediction←For noisy label 264 n_frame(≠len(y)) Cropped chunk Frame-wise prediction Original clip-level label Aggregate the prediction in the chunk in time axis with max and apply threshold np.logical_and Corrected label Alter the correction level by changing the threshold for diversity → 0.601(public) / 0.655(private) (Gold medal zone)
  15. Stage 3 15 Train several models with no-fold training +

    EMA changing the correction level of the label • PANNs architecture+ ResNeSt50/EfficientNet-B0 • No-fold training with Train Dataset/Train Extended • Use additional labels found in stage 1←For missing label • Input: 20s randomly cropped chunk←For Weak label • Data Augmentation ←For shift in () • Gaussian Noise • Pitch Shift • Volume Up/Down • 3channel input (melspec, pcen, melspec ** 1.5) • Loss is applied to both clipwise output and framewise output(FocalLoss for EfficientNet) • Correct the original labels with oof prediction←For noisy label • Alter the label correction level by changing the threshold on oof pred from 0.3 to 0.7← For shift in (∗) • Weighted average of 11 models(weight is decided based on public LB) ※ no-fold + EMA: Use the whole dataset for training and apply Exponential Moving Average on the weights → 0.628(public) / 0.668(private) (6th ) librosa.power_to_db(librosa.feature.melspectrogram(y, )) librosa.pcen(librosa.feature.melspectrogram(y)) librosa.power_to_db(librosa.feature.melspectrogram(y) ** 1.5)
  16. Miscellaneous 16 Reflections • There was a bug in seed

    fixing function. I found this when I was training the third stage • Loss of reproducibility by code modification because of • 1 experiment 1 script is the best after all • I used Data Augmentation configuration which worked in the first place, but pitch shift is actually super-slow and I could have done more experiments if I had decided not to use it. • Hard voting seems to work very well in the winnerʼs solution, but I even didnʼt try it. What didnʼt work • Mixup • Mix before taking the log(2nd,36th) • Union on the label(3rd ,36th) are reported to have worked • Calltype classification • In fact, there are many calltypes even in a single species • Calls and songs are quite different • I tried 871 class classification using calltype labels but it didnʼt worked • Larger model • According to Oleg(5th ), too large receptive field wasnʼt good for this dataset What I want to do further • Add more models • Continue on further label correction • Use location and elevation for label correction • Use co-occurrence information of bird calls for post-processing • Mix background noise