BirdCLEF2021 Summary

BirdCLEF2021 Summary Competition overview and top solutions Japanese version also
available (searching for a good icon...) start   (@startjapan) (You can jump to each link from the Speaker Deck description section.)

Introduction

Medical student in Hiroshima.   While studying for the national
examination,   I am interning at MNES, a medical venture company, and studying at LAIME, a machine learning study group for students sponsored by MNES.   I won this competition and became a Kaggle Master.   (Thanks to my teammates, though...)   Also, I'm looking for a good icon. kaggleɿ@startjapan   twitter ɿ@startjapanml Introduction

Competition Overview

Competition Overview • A competition to identify the bird species
singing from 5-second audio segments.   (The same organizer was holding a similar competition in 2020.   → We are gonna call it "the 2020 birdcall competition" here.) • The train data is audio taken from a birdcall sharing site called xeno-canto.   (train_short_audio) • The test data is 80 audio files of 10 minutes each, divided into 5 second segments for prediction. (test_soundscapes) • In addition to the above, 20 audio files (10 minutes x 20) are given for validation. (train_soundscapes)

[test_soundscapes]   Test data. The test data is 80 10-minute
soundscapes,   which cannot be accessed without submission.   The data is recorded in 4 locations.   [train_short_audio]   Training data. The audio is organized by bird species.   Total of 62874 audio data.   [train_soundscapes]   The acoustic domain is similar to test_soundscapes.   There are 20 soundscapes of 10 minutes each, which were recorded at two of the four locations   where test_soundscapes was recorded.   [train_metadata.csv]   Metadata for train_short_audio. The shape is (62784, 14).   [train_soundscape_labels.csv / test.csv]   Provide a framework for dividing a 10-minute file   into 5-second segments.   train_soundscape_labels.csv corresponds to train_soundscapes, test.csv corresponds to test_soundscapes, and only the former has the answer label.

Submission & Evaluation • Multiple bird species can be submitted
as predictions for a single segment. • For segments where no birds are singing, the string "nocall" should be submitted. • Submissions will be evaluated based on their row-wise micro averaged F1 score.

Differences from the 2020 birdcall competition

Differences from the 2020 birdcall competition • Existence of train_soundscapes
  - There is a large difference in the acoustic domain between train_short_audio and the test data.   - In this competition, train_soundscapes with an acoustic domain closer to the test data are given.   - They were mostly used for validation purposes, but some people devised ways to use them for training. • We had access to the location information in the test data.   - It was guaranteed that each file name in the test data contained location info (and date info).   - So we needed to incorporate this information into the pipeline in some way. (cf. Starter and some thoughts by @hidehisaarai1213)

EDA (train_short_audio ver.)

Time length of audio per file (train_short_audio) ※ Random sampling
of 1000 ( / 62874) audio files from train_short_audio ※ Horizontal axis : Time length of audio per file [sec] ※ Vertical axis : Frequency (1000 files in total)

How many audio files are there for one species of
bird? (train_short_audio) ※ Horizontal Axis : Number of audio files for each bird (in train_short_audio) ※ Vertical axis : Frequency (397 species in total)

There are some secondary labels that are missing. (train_short_audio) (Cited
from BirdCLEF2021: Exploring the data)

EDA (soundscapes ver.)

We can know the nocall rate of PublicLB by submitting
"nocall" for all lines. BirdCLEF2021 (cf. the 2020 birdcall competition) Private Private Public Public

On the other hand, train_soundscapes has a higher nocall rate.

Distribution of the target variable in train_soundscapes (nocall included) •
The overwhelming majority of them are "nocall." • Some 5-second segments with two or more species ringing

• Some combinations of bird species are frequently observed. Distribution
of the target variable in train_soundscapes (nocall removed)

Number of birds singing in the same 5-second segment in
train_soundscapes.

Basic approach for audio recognition

Basic approach for audio recognition The audio data can be
converted into an image (called a spectrogram) where the horizontal axis is time, the vertical axis is frequency, and each pixel represents the intensity of the signal component. By applying a CNN to it, it can be treated as a conventional image processing. ※ In this competition, the Mel spectrogram, which uses the Mel scale for the vertical axis (frequency), was mainly used.   ※ What is Mel scale?: It means that if the difference of sound frequency on this scale is the same, the difference of sound height perceived by human ear is also the same. CNN (Images are cited from BirdCLEF2021: Processing audio data)

Uniqueness of this competition

Uniqueness of this competition • Train_short_audio has only weak labels.
(weak label problem)   - Labels are assigned to the entire audio data of several tens of seconds.   - It is impossible to know which bird is singing at the level of 5-second segments. • Missing labels in some parts of train_short_audio. (noisy label problem)   - In particular, some of secondary_labels (※) are clearly marked as missing. • Need to incorporate metadata such as recording date and location in some way.   (metadata incorporation) • Information on whether birds are singing in the segment before or after the prediction target may also be meaningful. (incorporating before/after segment information) • Significant difference in nocall rate between train_soundscapes and test_soundscapes. (difficulty in establishing CV strategy) ※ There are two types of labels for train_short_audio: primary_label and secondary_labels.

Top solutions top solutions and approaches The above discussion contains
links to the top solutions.

1st place (ours!) [1st Place] Quick Solution [1st Place] Detailed
Solution

tl;dr 1st stage :   Create a binary nocall detector
using external data. (freefield1010) (1 : some bird is singing / 0 : nocall)   2nd stage :   Create a 397-dimensional multi-label classifier after reducing the weight of the nocall part from train_short_audio with the nocall detector.   3rd stage :   Create another table competition from the results of the nocall detector, metadata, and the results of the 2nd stage. By creating the table competition, we dealt with the weak label problem, noisy label problem, metadata incorporation,   and incorporation of pre/post segment information simultaneously! ※ This is a summary of the Inference Part only, and the 1st stage is omitted.

Why does the table solve the weak label problem &
noisy label problem? • The target variable for the 3rd stage (0: incorrect / 1: correct) is determined in the following way • Solve the weak label problem by combining the output of a multi-label classifier and a nocall detector that can make segment-wise predictions with primary and secondary labels assigned to the entire audio data of several tens of seconds. • If secondary labels are missing, label 0 will be assigned, but the number of samples for label 0 is relatively large, so noise will be buried nicely. (alleviating the noisy label problem)

2nd place 2nd place solution

(Cited from 2nd place solution) 2nd place

2nd place • Extract 30-second segments from train_short_audio, then separate
them into 5-second segments and mixup.   (dealing with weak label problem) • Exclude 3 audio files from train_soundscapes that have no birdsong for 10 minutes & bootstrap sampling.   (robust CV strategy) • Weighting using rating column in metadata / Label smoothing   (dealing with noisy label problem) • Tips for threshold selection   - LB has a lower nocall rate than CV, so lower threshold to predict more birds.   - It is nonsense to use a single probability value as threshold   because the distribution of probability values is different between models, so use percentile-based threshold. • Other tips (post-processing)   - Modify individual probability values based on average predicted probability of each bird.   - Use anterior-posterior segment information.   - Take into account results of nocall detector.   - Remove predictions for unlikely bird species based on time and location information. (metadata incorporation)

4th place 4th place solution

4th place • SED model is used, and input is
10-30 seconds of audio. (dealing with weak label problem)   (cf. Introduction to Sound Event Detection by @hidehisaarai1213) • Use mixup. • Psudo labeling (dealing with noisy label problem) • The final output is a combination of the SED output for the 5-second segment   and the 30-second segment around the 5-second segment.   (incorporating anterior-posterior segment information)   - A small threshold for the 5-second segment and a large threshold for the 30-second segment • As in the second-place solution, bird species judged unlikely to be observed based on time and location information are removed. (metadata incorporation)

5th place 5th place solution

5th place • He was the 2nd place winner in
the 2020 birdcall competition, and he used that as the basis for this competition. • Improvements from the last time:   +1% by changing to SED (※1) / +1% by adjusting threshold / +1% by improving ensemble method   (+ He also narrowed down predictive labels based on regional information. (※2) ) • Augmentation is characteristic:   0.5-3 power of image / N times faster / add sounds such as rain or conversation   add noise / adjust frequency with probability of 0.5   (1st-4th place seems to use only mixup and/or noise addition.) • Primary labels are given label 1, secondary labels are given label 0.3. • Birds observed in one segment are adjusted to be easily picked up in the whole 10-minute audio file. (※3) ※1 : dealing with weak label problem ※2 : metadata incorporation ※3 : incorporating anterior-posterior segment information

8th place 8th place writeup

8th place • 6th place winner in the 2020 birdcall
competition, again using SED. (dealing with the weak label problem) • Use 5 or 20 second segments for training and 40 second segments for inference, the longer the better.   In addition, the inference is done with overlap, like in 0-40 seconds followed by 20-60 seconds.   (incorporating anterior-posterior segment information) • Augmentation: Gaussian noise, pink noise, volume adjustment, pitch shift   (Mixup also worked well, but could not be included in the final submission due to computational resource issues.) • Loss function is characteristic. (BCEFocal2WayLoss) • Primary labels and secondary labels are treated the same way. • Psudo labeling (dealing with noisy label problem) • There are two thresholds, call threshold and nocall threshold, and bird species that exceed the call threshold are considered positive, while segments where no bird species exceeding the nocall threshold are also given "nocall."   (Bird labels and "nocall" can coexist.) • Exclude bird species that should not exist based on regional information, even if they are predicted.   (metadata incorporation) • Calculate F1 score for bird call and nocall lines separately and derive CV as 0.54 * nocall_f1 + 0.46 * call_f1   (robust CV strategy)

9th place 9th Place solution

9th place • 5-7 second segments for training input •
The weights of secondary labels are reduced. • Use mixup. • As much diversity as possible.   - Melspectrograms with different temporal resolutions, hop_lengths of 200 and 320   - Various backbones   - Augmentation: white noise, pink noise, band noise, nocall clips, raising melspectrograms to the power of N • Post-processing   - Post-process using the maximum or average probability of each bird singing in the entire 10 minutes of audio data.   (incorporating anterior-posterior segment information   - Evaluate how likely each bird is to sing based on regional information and post-process using the result.   (metadata interpolation)   - Post-process using the maximum probability of each bird singing in a day.   (metadata interpolation) • Squeeze width of test soundscapes by 2-5% (mostly to reverse far field effects) • He also set two thresholds (for call & nocall) and allowed bird labels and "nocall" to coexist.

11th place My journey (11th solution)

11th place • CPMP, who had dominated the Public LB
for a long time. • He took 18th place in the 2020 birdcall competition and 11th place in the Rainforest competition,   and his solution is based on a mixture of both.   - The 2020 birdcall competition solution : 18th place solution: efficientnet b3   - Rainforest competition solution : 11th place, The 0.931 Magic Explained: Image Classification • CV is calculated by 0.54 * nocall_f1 + 0.46 * call_f1 as in the 8th place solution. (robust CV strategy) • He said it was his main mistake that he did not submit the ensemble version until a few days before the end of the competition, because he assumed that the ensemble would surely increase the score.   (In fact, it did not work for his pipeline.)

That's it, thank you!

BirdCLEF2021 Summary

BirdCLEF2021 Summary

More Decks by start

Other Decks in Programming

Featured

Transcript