Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BirdCLEF2021 Summary

BirdCLEF2021 Summary


June 13, 2021

More Decks by start

Other Decks in Programming


  1. BirdCLEF2021 Summary Competition overview and top solutions Japanese version also

    available (searching for a good icon...) start 
 (@startjapan) (You can jump to each link from the Speaker Deck description section.)
  2. Medical student in Hiroshima. 
 While studying for the national

 I am interning at MNES, a medical venture company, and studying at LAIME, a machine learning study group for students sponsored by MNES. 
 I won this competition and became a Kaggle Master. 
 (Thanks to my teammates, though...) 
 Also, I'm looking for a good icon. kaggleɿ@startjapan 
 twitter ɿ@startjapanml Introduction
  3. Competition Overview • A competition to identify the bird species

    singing from 5-second audio segments. 
 (The same organizer was holding a similar competition in 2020. 
 → We are gonna call it "the 2020 birdcall competition" here.) • The train data is audio taken from a birdcall sharing site called xeno-canto. 
 (train_short_audio) • The test data is 80 audio files of 10 minutes each, divided into 5 second segments for prediction. (test_soundscapes) • In addition to the above, 20 audio files (10 minutes x 20) are given for validation. (train_soundscapes)
  4. [test_soundscapes] 
 Test data. The test data is 80 10-minute

 which cannot be accessed without submission. 
 The data is recorded in 4 locations. 
 Training data. The audio is organized by bird species. 
 Total of 62874 audio data. 
 The acoustic domain is similar to test_soundscapes. 
 There are 20 soundscapes of 10 minutes each, which were recorded at two of the four locations 
 where test_soundscapes was recorded. 
 Metadata for train_short_audio. The shape is (62784, 14). 
 [train_soundscape_labels.csv / test.csv] 
 Provide a framework for dividing a 10-minute file 
 into 5-second segments. 
 train_soundscape_labels.csv corresponds to train_soundscapes, test.csv corresponds to test_soundscapes, and only the former has the answer label.
  5. Submission & Evaluation • Multiple bird species can be submitted

    as predictions for a single segment. • For segments where no birds are singing, the string "nocall" should be submitted. • Submissions will be evaluated based on their row-wise micro averaged F1 score.
  6. Differences from the 2020 birdcall competition • Existence of train_soundscapes

 - There is a large difference in the acoustic domain between train_short_audio and the test data. 
 - In this competition, train_soundscapes with an acoustic domain closer to the test data are given. 
 - They were mostly used for validation purposes, but some people devised ways to use them for training. • We had access to the location information in the test data. 
 - It was guaranteed that each file name in the test data contained location info (and date info). 
 - So we needed to incorporate this information into the pipeline in some way. (cf. Starter and some thoughts by @hidehisaarai1213)
  7. Time length of audio per file (train_short_audio) ※ Random sampling

    of 1000 ( / 62874) audio files from train_short_audio ※ Horizontal axis : Time length of audio per file [sec] ※ Vertical axis : Frequency (1000 files in total)
  8. How many audio files are there for one species of

    bird? (train_short_audio) ※ Horizontal Axis : Number of audio files for each bird (in train_short_audio) ※ Vertical axis : Frequency (397 species in total)
  9. We can know the nocall rate of PublicLB by submitting

    "nocall" for all lines. BirdCLEF2021 (cf. the 2020 birdcall competition) Private Private Public Public
  10. Distribution of the target variable in train_soundscapes (nocall included) •

    The overwhelming majority of them are "nocall." • Some 5-second segments with two or more species ringing
  11. • Some combinations of bird species are frequently observed. Distribution

    of the target variable in train_soundscapes (nocall removed)
  12. Basic approach for audio recognition The audio data can be

    converted into an image (called a spectrogram) where the horizontal axis is time, the vertical axis is frequency, and each pixel represents the intensity of the signal component. By applying a CNN to it, it can be treated as a conventional image processing. ※ In this competition, the Mel spectrogram, which uses the Mel scale for the vertical axis (frequency), was mainly used. 
 ※ What is Mel scale?: It means that if the difference of sound frequency on this scale is the same, the difference of sound height perceived by human ear is also the same. CNN (Images are cited from BirdCLEF2021: Processing audio data)
  13. Uniqueness of this competition • Train_short_audio has only weak labels.

    (weak label problem) 
 - Labels are assigned to the entire audio data of several tens of seconds. 
 - It is impossible to know which bird is singing at the level of 5-second segments. • Missing labels in some parts of train_short_audio. (noisy label problem) 
 - In particular, some of secondary_labels (※) are clearly marked as missing. • Need to incorporate metadata such as recording date and location in some way. 
 (metadata incorporation) • Information on whether birds are singing in the segment before or after the prediction target may also be meaningful. (incorporating before/after segment information) • Significant difference in nocall rate between train_soundscapes and test_soundscapes. (difficulty in establishing CV strategy) ※ There are two types of labels for train_short_audio: primary_label and secondary_labels.
  14. tl;dr 1st stage : 
 Create a binary nocall detector

    using external data. (freefield1010) (1 : some bird is singing / 0 : nocall) 
 2nd stage : 
 Create a 397-dimensional multi-label classifier after reducing the weight of the nocall part from train_short_audio with the nocall detector. 
 3rd stage : 
 Create another table competition from the results of the nocall detector, metadata, and the results of the 2nd stage. By creating the table competition, we dealt with the weak label problem, noisy label problem, metadata incorporation, 
 and incorporation of pre/post segment information simultaneously! ※ This is a summary of the Inference Part only, and the 1st stage is omitted.
  15. Why does the table solve the weak label problem &

    noisy label problem? • The target variable for the 3rd stage (0: incorrect / 1: correct) is determined in the following way • Solve the weak label problem by combining the output of a multi-label classifier and a nocall detector that can make segment-wise predictions with primary and secondary labels assigned to the entire audio data of several tens of seconds. • If secondary labels are missing, label 0 will be assigned, but the number of samples for label 0 is relatively large, so noise will be buried nicely. (alleviating the noisy label problem)
  16. 2nd place • Extract 30-second segments from train_short_audio, then separate

    them into 5-second segments and mixup. 
 (dealing with weak label problem) • Exclude 3 audio files from train_soundscapes that have no birdsong for 10 minutes & bootstrap sampling. 
 (robust CV strategy) • Weighting using rating column in metadata / Label smoothing 
 (dealing with noisy label problem) • Tips for threshold selection 
 - LB has a lower nocall rate than CV, so lower threshold to predict more birds. 
 - It is nonsense to use a single probability value as threshold 
 because the distribution of probability values is different between models, so use percentile-based threshold. • Other tips (post-processing) 
 - Modify individual probability values based on average predicted probability of each bird. 
 - Use anterior-posterior segment information. 
 - Take into account results of nocall detector. 
 - Remove predictions for unlikely bird species based on time and location information. (metadata incorporation)
  17. 4th place • SED model is used, and input is

    10-30 seconds of audio. (dealing with weak label problem) 
 (cf. Introduction to Sound Event Detection by @hidehisaarai1213) • Use mixup. • Psudo labeling (dealing with noisy label problem) • The final output is a combination of the SED output for the 5-second segment 
 and the 30-second segment around the 5-second segment. 
 (incorporating anterior-posterior segment information) 
 - A small threshold for the 5-second segment and a large threshold for the 30-second segment • As in the second-place solution, bird species judged unlikely to be observed based on time and location information are removed. (metadata incorporation)
  18. 5th place • He was the 2nd place winner in

    the 2020 birdcall competition, and he used that as the basis for this competition. • Improvements from the last time: 
 +1% by changing to SED (※1) / +1% by adjusting threshold / +1% by improving ensemble method 
 (+ He also narrowed down predictive labels based on regional information. (※2) ) • Augmentation is characteristic: 
 0.5-3 power of image / N times faster / add sounds such as rain or conversation 
 add noise / adjust frequency with probability of 0.5 
 (1st-4th place seems to use only mixup and/or noise addition.) • Primary labels are given label 1, secondary labels are given label 0.3. • Birds observed in one segment are adjusted to be easily picked up in the whole 10-minute audio file. (※3) ※1 : dealing with weak label problem ※2 : metadata incorporation ※3 : incorporating anterior-posterior segment information
  19. 8th place • 6th place winner in the 2020 birdcall

    competition, again using SED. (dealing with the weak label problem) • Use 5 or 20 second segments for training and 40 second segments for inference, the longer the better. 
 In addition, the inference is done with overlap, like in 0-40 seconds followed by 20-60 seconds. 
 (incorporating anterior-posterior segment information) • Augmentation: Gaussian noise, pink noise, volume adjustment, pitch shift 
 (Mixup also worked well, but could not be included in the final submission due to computational resource issues.) • Loss function is characteristic. (BCEFocal2WayLoss) • Primary labels and secondary labels are treated the same way. • Psudo labeling (dealing with noisy label problem) • There are two thresholds, call threshold and nocall threshold, and bird species that exceed the call threshold are considered positive, while segments where no bird species exceeding the nocall threshold are also given "nocall." 
 (Bird labels and "nocall" can coexist.) • Exclude bird species that should not exist based on regional information, even if they are predicted. 
 (metadata incorporation) • Calculate F1 score for bird call and nocall lines separately and derive CV as 0.54 * nocall_f1 + 0.46 * call_f1 
 (robust CV strategy)
  20. 9th place • 5-7 second segments for training input •

    The weights of secondary labels are reduced. • Use mixup. • As much diversity as possible. 
 - Melspectrograms with different temporal resolutions, hop_lengths of 200 and 320 
 - Various backbones 
 - Augmentation: white noise, pink noise, band noise, nocall clips, raising melspectrograms to the power of N • Post-processing 
 - Post-process using the maximum or average probability of each bird singing in the entire 10 minutes of audio data. 
 (incorporating anterior-posterior segment information 
 - Evaluate how likely each bird is to sing based on regional information and post-process using the result. 
 (metadata interpolation) 
 - Post-process using the maximum probability of each bird singing in a day. 
 (metadata interpolation) • Squeeze width of test soundscapes by 2-5% (mostly to reverse far field effects) • He also set two thresholds (for call & nocall) and allowed bird labels and "nocall" to coexist.
  21. 11th place • CPMP, who had dominated the Public LB

    for a long time. • He took 18th place in the 2020 birdcall competition and 11th place in the Rainforest competition, 
 and his solution is based on a mixture of both. 
 - The 2020 birdcall competition solution : 18th place solution: efficientnet b3 
 - Rainforest competition solution : 11th place, The 0.931 Magic Explained: Image Classification • CV is calculated by 0.54 * nocall_f1 + 0.46 * call_f1 as in the 8th place solution. (robust CV strategy) • He said it was his main mistake that he did not submit the ensemble version until a few days before the end of the competition, because he assumed that the ensemble would surely increase the score. 
 (In fact, it did not work for his pipeline.)