Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BirdCLEF2021 Summary

BirdCLEF2021 Summary


June 13, 2021

More Decks by start

Other Decks in Programming


  1. BirdCLEF2021 Summary
    Competition overview and top solutions

    Japanese version also available
    (searching for a good icon...)

    (You can jump to each link from the Speaker Deck description section.)

    View full-size slide

  2. Introduction

    View full-size slide

  3. Medical student in Hiroshima.

    While studying for the national examination,

    I am interning at MNES, a medical venture company,
    and studying at LAIME, a machine learning study
    group for students sponsored by MNES.

    I won this competition and became a Kaggle Master.

    (Thanks to my teammates, though...)

    Also, I'm looking for a good icon.

    twitter ɿ@startjapanml

    View full-size slide

  4. Competition Overview

    View full-size slide

  5. Competition Overview
    • A competition to identify the bird species singing from 5-second audio segments.

    (The same organizer was holding a similar competition in 2020.

    → We are gonna call it "the 2020 birdcall competition" here.)

    • The train data is audio taken from a birdcall sharing site called xeno-canto.


    • The test data is 80 audio files of 10 minutes each, divided into 5 second segments for prediction.

    • In addition to the above, 20 audio files (10 minutes x 20) are given for validation.

    View full-size slide

  6. [test_soundscapes]

    Test data. The test data is 80 10-minute soundscapes,

    which cannot be accessed without submission.

    The data is recorded in 4 locations.


    Training data. The audio is organized by bird species.

    Total of 62874 audio data.


    The acoustic domain is similar to test_soundscapes.

    There are 20 soundscapes of 10 minutes each, which were
    recorded at two of the four locations

    where test_soundscapes was recorded.


    Metadata for train_short_audio. The shape is (62784, 14).

    [train_soundscape_labels.csv / test.csv]

    Provide a framework for dividing a 10-minute file

    into 5-second segments.

    train_soundscape_labels.csv corresponds to
    train_soundscapes, test.csv corresponds to
    test_soundscapes, and only the former has the answer label.

    View full-size slide

  7. Submission & Evaluation
    • Multiple bird species can be submitted as predictions for a single segment.

    • For segments where no birds are singing, the string "nocall" should be submitted.

    • Submissions will be evaluated based on their row-wise micro averaged F1 score.

    View full-size slide

  8. Differences from the 2020 birdcall competition

    View full-size slide

  9. Differences from the 2020 birdcall competition
    • Existence of train_soundscapes

    - There is a large difference in the acoustic domain between train_short_audio and the test data.

    - In this competition, train_soundscapes with an acoustic domain closer to the test data are given.

    - They were mostly used for validation purposes, but some people devised ways to use them for training.

    • We had access to the location information in the test data.

    - It was guaranteed that each file name in the test data contained location info (and date info).

    - So we needed to incorporate this information into the pipeline in some way.
    (cf. Starter and some thoughts by @hidehisaarai1213)

    View full-size slide

  10. EDA (train_short_audio ver.)

    View full-size slide

  11. Time length of audio per file (train_short_audio)
    ※ Random sampling of 1000 ( / 62874) audio files from train_short_audio

    ※ Horizontal axis : Time length of audio per file [sec]

    ※ Vertical axis : Frequency (1000 files in total)

    View full-size slide

  12. How many audio files are there for one species of bird? (train_short_audio)
    ※ Horizontal Axis : Number of audio files for each bird (in train_short_audio)

    ※ Vertical axis : Frequency (397 species in total)

    View full-size slide

  13. There are some secondary labels that are missing. (train_short_audio)
    (Cited from BirdCLEF2021: Exploring the data)

    View full-size slide

  14. EDA (soundscapes ver.)

    View full-size slide

  15. We can know the nocall rate of PublicLB by submitting "nocall" for all lines.
    (cf. the 2020 birdcall competition)
    Private Public

    View full-size slide

  16. On the other hand, train_soundscapes has a higher nocall rate.

    View full-size slide

  17. Distribution of the target variable in train_soundscapes (nocall included)
    • The overwhelming majority of them are "nocall."

    • Some 5-second segments with two or more species ringing

    View full-size slide

  18. • Some combinations of bird species are frequently observed.
    Distribution of the target variable in train_soundscapes (nocall removed)

    View full-size slide

  19. Number of birds singing in the same 5-second segment in train_soundscapes.

    View full-size slide

  20. Basic approach for audio recognition

    View full-size slide

  21. Basic approach for audio recognition
    The audio data can be converted into an image
    (called a spectrogram) where the horizontal
    axis is time, the vertical axis is frequency, and
    each pixel represents the intensity of the signal
    component. By applying a CNN to it, it can be
    treated as a conventional image processing.
    ※ In this competition, the Mel spectrogram, which uses the Mel scale for the vertical axis (frequency), was mainly used.

    ※ What is Mel scale?: It means that if the difference of sound frequency on this scale is the same, the difference of sound
    height perceived by human ear is also the same.
    (Images are cited from BirdCLEF2021: Processing audio data)

    View full-size slide

  22. Uniqueness of this competition

    View full-size slide

  23. Uniqueness of this competition
    • Train_short_audio has only weak labels. (weak label problem)

    - Labels are assigned to the entire audio data of several tens of seconds.

    - It is impossible to know which bird is singing at the level of 5-second segments.

    • Missing labels in some parts of train_short_audio. (noisy label problem)

    - In particular, some of secondary_labels (※) are clearly marked as missing.

    • Need to incorporate metadata such as recording date and location in some way.

    (metadata incorporation)

    • Information on whether birds are singing in the segment before or after the prediction target
    may also be meaningful. (incorporating before/after segment information)

    • Significant difference in nocall rate between train_soundscapes and test_soundscapes.
    (difficulty in establishing CV strategy)
    ※ There are two types of labels for train_short_audio: primary_label and secondary_labels.

    View full-size slide

  24. Top solutions
    top solutions and approaches

    The above discussion contains links to the top solutions.

    View full-size slide

  25. 1st place (ours!)
    [1st Place] Quick Solution

    [1st Place] Detailed Solution

    View full-size slide

  26. tl;dr
    1st stage :

    Create a binary nocall detector using external data. (freefield1010) (1 : some bird is singing / 0 : nocall)

    2nd stage :

    Create a 397-dimensional multi-label classifier after reducing the weight of the nocall part from train_short_audio with the nocall detector.

    3rd stage :

    Create another table competition from the results of the nocall detector, metadata, and the results of the 2nd stage.

    By creating the table competition, we dealt with the weak label problem, noisy label problem, metadata incorporation,

    and incorporation of pre/post segment information simultaneously!
    ※ This is a summary of the Inference Part only, and the 1st stage is omitted.

    View full-size slide

  27. Why does the table solve the weak label problem & noisy label problem?
    • The target variable for the 3rd stage (0: incorrect / 1: correct) is determined in the following way
    • Solve the weak label problem by combining the output of a multi-label classifier and a nocall detector that can
    make segment-wise predictions with primary and secondary labels assigned to the entire audio data of several
    tens of seconds.

    • If secondary labels are missing, label 0 will be assigned, but the number of samples for label 0 is relatively large,
    so noise will be buried nicely. (alleviating the noisy label problem)

    View full-size slide

  28. 2nd place
    2nd place solution

    View full-size slide

  29. (Cited from 2nd place solution)
    2nd place

    View full-size slide

  30. 2nd place
    • Extract 30-second segments from train_short_audio, then separate them into 5-second segments and mixup.

    (dealing with weak label problem)

    • Exclude 3 audio files from train_soundscapes that have no birdsong for 10 minutes & bootstrap sampling.

    (robust CV strategy)

    • Weighting using rating column in metadata / Label smoothing

    (dealing with noisy label problem)

    • Tips for threshold selection

    - LB has a lower nocall rate than CV, so lower threshold to predict more birds.

    - It is nonsense to use a single probability value as threshold

    because the distribution of probability values is different between models, so use percentile-based threshold.

    • Other tips (post-processing)

    - Modify individual probability values based on average predicted probability of each bird.

    - Use anterior-posterior segment information.

    - Take into account results of nocall detector.

    - Remove predictions for unlikely bird species based on time and location information. (metadata incorporation)

    View full-size slide

  31. 4th place
    4th place solution

    View full-size slide

  32. 4th place
    • SED model is used, and input is 10-30 seconds of audio. (dealing with weak label problem)

    (cf. Introduction to Sound Event Detection by @hidehisaarai1213)

    • Use mixup.

    • Psudo labeling (dealing with noisy label problem)

    • The final output is a combination of the SED output for the 5-second segment

    and the 30-second segment around the 5-second segment.

    (incorporating anterior-posterior segment information)

    - A small threshold for the 5-second segment and a large threshold for the 30-second segment

    • As in the second-place solution, bird species judged unlikely to be observed based on time and
    location information are removed. (metadata incorporation)

    View full-size slide

  33. 5th place
    5th place solution

    View full-size slide

  34. 5th place
    • He was the 2nd place winner in the 2020 birdcall competition, and he used that as the basis for this competition.

    • Improvements from the last time:

    +1% by changing to SED (※1) / +1% by adjusting threshold / +1% by improving ensemble method

    (+ He also narrowed down predictive labels based on regional information. (※2) )

    • Augmentation is characteristic:

    0.5-3 power of image / N times faster / add sounds such as rain or conversation

    add noise / adjust frequency with probability of 0.5

    (1st-4th place seems to use only mixup and/or noise addition.)

    • Primary labels are given label 1, secondary labels are given label 0.3.

    • Birds observed in one segment are adjusted to be easily picked up in the whole 10-minute audio file. (※3)
    ※1 : dealing with weak label problem

    ※2 : metadata incorporation

    ※3 : incorporating anterior-posterior segment information

    View full-size slide

  35. 8th place
    8th place writeup

    View full-size slide

  36. 8th place
    • 6th place winner in the 2020 birdcall competition, again using SED. (dealing with the weak label problem)

    • Use 5 or 20 second segments for training and 40 second segments for inference, the longer the better.

    In addition, the inference is done with overlap, like in 0-40 seconds followed by 20-60 seconds.

    (incorporating anterior-posterior segment information)

    • Augmentation: Gaussian noise, pink noise, volume adjustment, pitch shift

    (Mixup also worked well, but could not be included in the final submission due to computational resource issues.)

    • Loss function is characteristic. (BCEFocal2WayLoss)

    • Primary labels and secondary labels are treated the same way.

    • Psudo labeling (dealing with noisy label problem)

    • There are two thresholds, call threshold and nocall threshold, and bird species that exceed the call threshold are considered
    positive, while segments where no bird species exceeding the nocall threshold are also given "nocall."

    (Bird labels and "nocall" can coexist.)

    • Exclude bird species that should not exist based on regional information, even if they are predicted.

    (metadata incorporation)

    • Calculate F1 score for bird call and nocall lines separately and derive CV as 0.54 * nocall_f1 + 0.46 * call_f1

    (robust CV strategy)

    View full-size slide

  37. 9th place
    9th Place solution

    View full-size slide

  38. 9th place
    • 5-7 second segments for training input

    • The weights of secondary labels are reduced.

    • Use mixup.

    • As much diversity as possible.

    - Melspectrograms with different temporal resolutions, hop_lengths of 200 and 320

    - Various backbones

    - Augmentation: white noise, pink noise, band noise, nocall clips, raising melspectrograms to the power of N

    • Post-processing

    - Post-process using the maximum or average probability of each bird singing in the entire 10 minutes of audio data.

    (incorporating anterior-posterior segment information

    - Evaluate how likely each bird is to sing based on regional information and post-process using the result.

    (metadata interpolation)

    - Post-process using the maximum probability of each bird singing in a day.

    (metadata interpolation)

    • Squeeze width of test soundscapes by 2-5% (mostly to reverse far field effects)

    • He also set two thresholds (for call & nocall) and allowed bird labels and "nocall" to coexist.

    View full-size slide

  39. 11th place
    My journey (11th solution)

    View full-size slide

  40. 11th place
    • CPMP, who had dominated the Public LB for a long time.

    • He took 18th place in the 2020 birdcall competition and 11th place in the Rainforest competition,

    and his solution is based on a mixture of both.

    - The 2020 birdcall competition solution : 18th place solution: efficientnet b3

    - Rainforest competition solution : 11th place, The 0.931 Magic Explained: Image Classification

    • CV is calculated by 0.54 * nocall_f1 + 0.46 * call_f1 as in the 8th place solution. (robust CV strategy)

    • He said it was his main mistake that he did not submit the ensemble version until a few days before the end of the
    competition, because he assumed that the ensemble would surely increase the score.

    (In fact, it did not work for his pipeline.)

    View full-size slide

  41. That's it, thank you!

    View full-size slide