Slide 1

Slide 1 text

BirdCLEF2021 Summary Competition overview and top solutions Japanese version also available (searching for a good icon...) start 
 (@startjapan) (You can jump to each link from the Speaker Deck description section.)

Slide 2

Slide 2 text

Introduction

Slide 3

Slide 3 text

Medical student in Hiroshima. 
 While studying for the national examination, 
 I am interning at MNES, a medical venture company, and studying at LAIME, a machine learning study group for students sponsored by MNES. 
 I won this competition and became a Kaggle Master. 
 (Thanks to my teammates, though...) 
 Also, I'm looking for a good icon. kaggleɿ@startjapan 
 twitter ɿ@startjapanml Introduction

Slide 4

Slide 4 text

Competition Overview

Slide 5

Slide 5 text

Competition Overview • A competition to identify the bird species singing from 5-second audio segments. 
 (The same organizer was holding a similar competition in 2020. 
 → We are gonna call it "the 2020 birdcall competition" here.) • The train data is audio taken from a birdcall sharing site called xeno-canto. 
 (train_short_audio) • The test data is 80 audio files of 10 minutes each, divided into 5 second segments for prediction. (test_soundscapes) • In addition to the above, 20 audio files (10 minutes x 20) are given for validation. (train_soundscapes)

Slide 6

Slide 6 text

[test_soundscapes] 
 Test data. The test data is 80 10-minute soundscapes, 
 which cannot be accessed without submission. 
 The data is recorded in 4 locations. 
 [train_short_audio] 
 Training data. The audio is organized by bird species. 
 Total of 62874 audio data. 
 [train_soundscapes] 
 The acoustic domain is similar to test_soundscapes. 
 There are 20 soundscapes of 10 minutes each, which were recorded at two of the four locations 
 where test_soundscapes was recorded. 
 [train_metadata.csv] 
 Metadata for train_short_audio. The shape is (62784, 14). 
 [train_soundscape_labels.csv / test.csv] 
 Provide a framework for dividing a 10-minute file 
 into 5-second segments. 
 train_soundscape_labels.csv corresponds to train_soundscapes, test.csv corresponds to test_soundscapes, and only the former has the answer label.

Slide 7

Slide 7 text

Submission & Evaluation • Multiple bird species can be submitted as predictions for a single segment. • For segments where no birds are singing, the string "nocall" should be submitted. • Submissions will be evaluated based on their row-wise micro averaged F1 score.

Slide 8

Slide 8 text

Differences from the 2020 birdcall competition

Slide 9

Slide 9 text

Differences from the 2020 birdcall competition • Existence of train_soundscapes 
 - There is a large difference in the acoustic domain between train_short_audio and the test data. 
 - In this competition, train_soundscapes with an acoustic domain closer to the test data are given. 
 - They were mostly used for validation purposes, but some people devised ways to use them for training. • We had access to the location information in the test data. 
 - It was guaranteed that each file name in the test data contained location info (and date info). 
 - So we needed to incorporate this information into the pipeline in some way. (cf. Starter and some thoughts by @hidehisaarai1213)

Slide 10

Slide 10 text

EDA (train_short_audio ver.)

Slide 11

Slide 11 text

Time length of audio per file (train_short_audio) ※ Random sampling of 1000 ( / 62874) audio files from train_short_audio ※ Horizontal axis : Time length of audio per file [sec] ※ Vertical axis : Frequency (1000 files in total)

Slide 12

Slide 12 text

How many audio files are there for one species of bird? (train_short_audio) ※ Horizontal Axis : Number of audio files for each bird (in train_short_audio) ※ Vertical axis : Frequency (397 species in total)

Slide 13

Slide 13 text

There are some secondary labels that are missing. (train_short_audio) (Cited from BirdCLEF2021: Exploring the data)

Slide 14

Slide 14 text

EDA (soundscapes ver.)

Slide 15

Slide 15 text

We can know the nocall rate of PublicLB by submitting "nocall" for all lines. BirdCLEF2021 (cf. the 2020 birdcall competition) Private Private Public Public

Slide 16

Slide 16 text

On the other hand, train_soundscapes has a higher nocall rate.

Slide 17

Slide 17 text

Distribution of the target variable in train_soundscapes (nocall included) • The overwhelming majority of them are "nocall." • Some 5-second segments with two or more species ringing

Slide 18

Slide 18 text

• Some combinations of bird species are frequently observed. Distribution of the target variable in train_soundscapes (nocall removed)

Slide 19

Slide 19 text

Number of birds singing in the same 5-second segment in train_soundscapes.

Slide 20

Slide 20 text

Basic approach for audio recognition

Slide 21

Slide 21 text

Basic approach for audio recognition The audio data can be converted into an image (called a spectrogram) where the horizontal axis is time, the vertical axis is frequency, and each pixel represents the intensity of the signal component. By applying a CNN to it, it can be treated as a conventional image processing. ※ In this competition, the Mel spectrogram, which uses the Mel scale for the vertical axis (frequency), was mainly used. 
 ※ What is Mel scale?: It means that if the difference of sound frequency on this scale is the same, the difference of sound height perceived by human ear is also the same. CNN (Images are cited from BirdCLEF2021: Processing audio data)

Slide 22

Slide 22 text

Uniqueness of this competition

Slide 23

Slide 23 text

Uniqueness of this competition • Train_short_audio has only weak labels. (weak label problem) 
 - Labels are assigned to the entire audio data of several tens of seconds. 
 - It is impossible to know which bird is singing at the level of 5-second segments. • Missing labels in some parts of train_short_audio. (noisy label problem) 
 - In particular, some of secondary_labels (※) are clearly marked as missing. • Need to incorporate metadata such as recording date and location in some way. 
 (metadata incorporation) • Information on whether birds are singing in the segment before or after the prediction target may also be meaningful. (incorporating before/after segment information) • Significant difference in nocall rate between train_soundscapes and test_soundscapes. (difficulty in establishing CV strategy) ※ There are two types of labels for train_short_audio: primary_label and secondary_labels.

Slide 24

Slide 24 text

Top solutions top solutions and approaches The above discussion contains links to the top solutions.

Slide 25

Slide 25 text

1st place (ours!) [1st Place] Quick Solution [1st Place] Detailed Solution

Slide 26

Slide 26 text

tl;dr 1st stage : 
 Create a binary nocall detector using external data. (freefield1010) (1 : some bird is singing / 0 : nocall) 
 2nd stage : 
 Create a 397-dimensional multi-label classifier after reducing the weight of the nocall part from train_short_audio with the nocall detector. 
 3rd stage : 
 Create another table competition from the results of the nocall detector, metadata, and the results of the 2nd stage. By creating the table competition, we dealt with the weak label problem, noisy label problem, metadata incorporation, 
 and incorporation of pre/post segment information simultaneously! ※ This is a summary of the Inference Part only, and the 1st stage is omitted.

Slide 27

Slide 27 text

Why does the table solve the weak label problem & noisy label problem? • The target variable for the 3rd stage (0: incorrect / 1: correct) is determined in the following way • Solve the weak label problem by combining the output of a multi-label classifier and a nocall detector that can make segment-wise predictions with primary and secondary labels assigned to the entire audio data of several tens of seconds. • If secondary labels are missing, label 0 will be assigned, but the number of samples for label 0 is relatively large, so noise will be buried nicely. (alleviating the noisy label problem)

Slide 28

Slide 28 text

2nd place 2nd place solution

Slide 29

Slide 29 text

(Cited from 2nd place solution) 2nd place

Slide 30

Slide 30 text

2nd place • Extract 30-second segments from train_short_audio, then separate them into 5-second segments and mixup. 
 (dealing with weak label problem) • Exclude 3 audio files from train_soundscapes that have no birdsong for 10 minutes & bootstrap sampling. 
 (robust CV strategy) • Weighting using rating column in metadata / Label smoothing 
 (dealing with noisy label problem) • Tips for threshold selection 
 - LB has a lower nocall rate than CV, so lower threshold to predict more birds. 
 - It is nonsense to use a single probability value as threshold 
 because the distribution of probability values is different between models, so use percentile-based threshold. • Other tips (post-processing) 
 - Modify individual probability values based on average predicted probability of each bird. 
 - Use anterior-posterior segment information. 
 - Take into account results of nocall detector. 
 - Remove predictions for unlikely bird species based on time and location information. (metadata incorporation)

Slide 31

Slide 31 text

4th place 4th place solution

Slide 32

Slide 32 text

4th place • SED model is used, and input is 10-30 seconds of audio. (dealing with weak label problem) 
 (cf. Introduction to Sound Event Detection by @hidehisaarai1213) • Use mixup. • Psudo labeling (dealing with noisy label problem) • The final output is a combination of the SED output for the 5-second segment 
 and the 30-second segment around the 5-second segment. 
 (incorporating anterior-posterior segment information) 
 - A small threshold for the 5-second segment and a large threshold for the 30-second segment • As in the second-place solution, bird species judged unlikely to be observed based on time and location information are removed. (metadata incorporation)

Slide 33

Slide 33 text

5th place 5th place solution

Slide 34

Slide 34 text

5th place • He was the 2nd place winner in the 2020 birdcall competition, and he used that as the basis for this competition. • Improvements from the last time: 
 +1% by changing to SED (※1) / +1% by adjusting threshold / +1% by improving ensemble method 
 (+ He also narrowed down predictive labels based on regional information. (※2) ) • Augmentation is characteristic: 
 0.5-3 power of image / N times faster / add sounds such as rain or conversation 
 add noise / adjust frequency with probability of 0.5 
 (1st-4th place seems to use only mixup and/or noise addition.) • Primary labels are given label 1, secondary labels are given label 0.3. • Birds observed in one segment are adjusted to be easily picked up in the whole 10-minute audio file. (※3) ※1 : dealing with weak label problem ※2 : metadata incorporation ※3 : incorporating anterior-posterior segment information

Slide 35

Slide 35 text

8th place 8th place writeup

Slide 36

Slide 36 text

8th place • 6th place winner in the 2020 birdcall competition, again using SED. (dealing with the weak label problem) • Use 5 or 20 second segments for training and 40 second segments for inference, the longer the better. 
 In addition, the inference is done with overlap, like in 0-40 seconds followed by 20-60 seconds. 
 (incorporating anterior-posterior segment information) • Augmentation: Gaussian noise, pink noise, volume adjustment, pitch shift 
 (Mixup also worked well, but could not be included in the final submission due to computational resource issues.) • Loss function is characteristic. (BCEFocal2WayLoss) • Primary labels and secondary labels are treated the same way. • Psudo labeling (dealing with noisy label problem) • There are two thresholds, call threshold and nocall threshold, and bird species that exceed the call threshold are considered positive, while segments where no bird species exceeding the nocall threshold are also given "nocall." 
 (Bird labels and "nocall" can coexist.) • Exclude bird species that should not exist based on regional information, even if they are predicted. 
 (metadata incorporation) • Calculate F1 score for bird call and nocall lines separately and derive CV as 0.54 * nocall_f1 + 0.46 * call_f1 
 (robust CV strategy)

Slide 37

Slide 37 text

9th place 9th Place solution

Slide 38

Slide 38 text

9th place • 5-7 second segments for training input • The weights of secondary labels are reduced. • Use mixup. • As much diversity as possible. 
 - Melspectrograms with different temporal resolutions, hop_lengths of 200 and 320 
 - Various backbones 
 - Augmentation: white noise, pink noise, band noise, nocall clips, raising melspectrograms to the power of N • Post-processing 
 - Post-process using the maximum or average probability of each bird singing in the entire 10 minutes of audio data. 
 (incorporating anterior-posterior segment information 
 - Evaluate how likely each bird is to sing based on regional information and post-process using the result. 
 (metadata interpolation) 
 - Post-process using the maximum probability of each bird singing in a day. 
 (metadata interpolation) • Squeeze width of test soundscapes by 2-5% (mostly to reverse far field effects) • He also set two thresholds (for call & nocall) and allowed bird labels and "nocall" to coexist.

Slide 39

Slide 39 text

11th place My journey (11th solution)

Slide 40

Slide 40 text

11th place • CPMP, who had dominated the Public LB for a long time. • He took 18th place in the 2020 birdcall competition and 11th place in the Rainforest competition, 
 and his solution is based on a mixture of both. 
 - The 2020 birdcall competition solution : 18th place solution: efficientnet b3 
 - Rainforest competition solution : 11th place, The 0.931 Magic Explained: Image Classification • CV is calculated by 0.54 * nocall_f1 + 0.46 * call_f1 as in the 8th place solution. (robust CV strategy) • He said it was his main mistake that he did not submit the ensemble version until a few days before the end of the competition, because he assumed that the ensemble would surely increase the score. 
 (In fact, it did not work for his pipeline.)

Slide 41

Slide 41 text

That's it, thank you!