Upgrade to Pro — share decks privately, control downloads, hide ads and more …

6th place solution to Cornell Birdcall Identification Challenge

6th place solution to Cornell Birdcall Identification Challenge

Hidehisa Arai

October 24, 2020
Tweet

More Decks by Hidehisa Arai

Other Decks in Technology

Transcript

  1. Cornell Birdcall
    Identification
    6th place solution
    Team「Deepでポン」
    Hidehisa Arai
    1

    View Slide

  2. About the competition
    2
    https://www.kaggle.com/c/birdsong-recognition
    Train dataset Test dataset
    The task: Build a system that can detect and classify birdcall events in audio clips
    • Around 20k audio clips
    • Additional ~25k audio clips (not officially
    provided but allowed to use)
    • 264 classes (bird species)
    • Each audio clip has primary labels and 1/3 of
    them also have secondary labels
    • Can be treated as multi-label, but itʼs
    basically for multi-class classification
    • Annotated in clip-level
    • Data from Xeno Canto , which means user-
    uploaded data. The annotations are also done
    by the uploaders
    • File format, sampling rate, audio quality, audio
    duration, etc. varies a lot.
    • Around 150 audio clips
    • Each clip is 10min long
    • 265 classes (bird species + `nocall` class)
    • We need to submit 5s chunk level prediction
    • The dataset is soundscape, which means it
    contains birdcall events of multiple species
    (sometimes overlapped)
    • Multi-label classification problem
    • Annotation is done in controlled environment
    with annotators

    View Slide

  3. About the competition
    3
    https://www.kaggle.com/c/birdsong-recognition
    Train dataset Test dataset
    The task: Build a system that can detect and classify birdcall events in audio clips
    • Around 20k audio clips
    • Additional ~25k audio clips (not officially
    provided but allowed to use)
    • 264 classes (bird species)
    • Each audio clip has primary labels and 1/3 of
    them also have secondary labels
    • Can be treated as multi-label, but itʼs
    basically for multi-class classification
    • Annotated in clip-level
    • Data from Xeno Canto , which means user-
    uploaded data. The annotations are also done
    by the uploaders
    • File format, sampling rate, audio quality, audio
    duration, etc. varies a lot.
    • Around 150 audio clips
    • Each clip is 10min long
    • 265 classes (bird species + `nocall` class)
    • We need to submit 5s chunk level prediction
    • The dataset is soundscape, which means it
    contains birdcall events of multiple species
    (sometimes overlapped)
    • Multi-label classification problem
    • Annotation is done in controlled environment
    with annotators
    Large difference between train and test

    View Slide

  4. How to use audio data with NN
    4
    1D CNN approach 2D CNN approach
    Slow and usually vulnerable to noise Learn fast, often used
    EnvNet
    WaveNet encoder
    Convert raw waveform to
    (log)(mel)spectrogram and treat as an image
    Use rich pool of image recognition
    models to classify spectrogram

    View Slide

  5. Result
    Public 2nd (0.628) → Private 6th (0.668)
    5

    View Slide

  6. Abstract
    6
    • 3 staged training to gradually remove the noise in labels
    • Sound Event Detection(SED) style training and inference
    • Weighted blending with 11 models of no-fold training + EMA

    View Slide

  7. Major challenges of the competition
    Discrepancy between train/test(Domain Shift) Weak and noisy labels
    1. Source of the audio clips
    • Train: Uploaded to Xeno Canto by the
    recorder
    • Test: (Possibly) taken by
    microphones set outside
    2. Annotation
    • Train: Clip-wise labels annotated by
    the uploaders of those clips
    • Test: Chunk-level labels annotated by
    trained annotator
    Labels for train data are clip level (weak
    label) and has only one primary label
    There may exist secondary labels but not
    that trustworthy (noisy label)
    Sometime uploaders just donʼt put secondary
    labels even enough multiple species are
    present in the clip (missing label)
    Class ʻnocallʼ only exists in test dataset
    7

    View Slide

  8. Motivation ‒ domain shift
    Decompose domain shift
    : input、: label、∗: true label(unobservable)
    Estimating "#$": "#$" → "#$"
    is the task
    !"#$%
    ≠ (!&'!
    )
    Distribution of input is
    different between train/test
    • Comes from sound
    collection environment
    • SNR of the test is
    comparatively small
    because birds often sing
    far away
    • Frequency and type of non-
    birdcall events are also
    different
    !"#$%
    ∗ ≠ (!&'!
    ∗ )
    Distribution of true label is
    different between train/test
    • Comes from the difference
    of the location
    • Distribution of birds or call-
    types are displayed
    • Test is passive, so nocall
    class occurs more often
    than train
    !"#$%
    ≠ !&'!
    Input-output relationship is
    different between train/test
    • Comes from the difference
    of the annotation methods
    • In train dataset, annotation
    level varies a lot since the
    uploaders are the
    annotators
    • In test dataset, annotation
    level is controlled and
    annotation is chunk-level
    8

    View Slide

  9. Motivation ‒ domain shift
    !"#$%
    ≠ (!&'!
    )
    Distribution of input is
    different between train/test
    • Comes from sound
    collection environment
    • SNR of the test is
    comparatively small
    because birds often sing
    far away
    • Frequency and type of non-
    birdcall events are also
    different
    !"#$%
    ∗ ≠ (!&'!
    ∗ )
    Distribution of true label is
    different between train/test
    • Comes from the difference
    of the location
    • Distribution of birds or call-
    types are displayed
    • Test is passive, so nocall
    class occurs more often
    than train
    !"#$%
    ≠ !&'!
    Input-output relationship is
    different between train/test
    • Comes from the difference
    of the annotation methods
    • In train dataset, annotation
    level varies a lot since the
    uploaders are the
    annotators
    • In test dataset, annotation
    level is controlled and
    annotation is chunk-level
    9
    Provide all possible variation
    with Data Augmentation
    Ensemble of models trained
    with dataset with different
    class distribution
    Make !"#$%
    closer !&'!
    by
    label correction

    View Slide

  10. Motivation‒ label noise
    Decompose label noise
    Weak label is noisy label
    The fact that labels only exist
    in clip level is itself treated as
    noisy label
    There are some cases that birdcall
    events that are in labels doesnʼt exist
    We can trust primary labels but
    they are distributed in the whole
    clips
    Birds in secondary labels are only
    scarcely hearable
    There are some cases that birdcall
    events that arenʼt in labels exist
    2/3 of the train data donʼt have
    secondary labels
    It is up to the uploaders
    whether to put secondary
    labels or not, so there exist
    some missing labels
    10
    labels: aldfly, leafly, amered, canwar label: warvir

    View Slide

  11. Motivation‒ label noise
    11
    Weak label is noisy label
    The fact that labels only exist
    in clip level is itself treated as
    noisy label
    There are some cases that birdcall
    events that are in labels doesnʼt exist
    We can trust primary labels but
    they are distributed in the whole
    clips
    Birds in secondary labels are only
    scarcely hearable
    There are some cases that birdcall
    events that arenʼt in labels exist
    2/3 of the train data donʼt have
    secondary labels
    It is up to the uploaders
    whether to put secondary
    labels or not, so there exist
    some missing labels
    labels: aldfly, leafly, amered, canwar label: warvir
    Make sure birdcall events are
    in the chunk by training with
    long period chunk
    Eliminate low-confidence
    labels by pseudo-labeling
    Find missing labels in audio
    clips that donʼt have
    secondary labels with trained
    models

    View Slide

  12. Sound Event Detection (SED)
    12
    Two major approach to tackle the task
    Audio Tagging Sound Event Detection
    Clip level labeling to the audio input Segment level labeling(time-annotated) to the audio
    Aggregate in time axis
    (max, mean, attention,…)
    Feature Extractor
    Feature map
    input
    (waveform,melspec,…)
    Feature extraction
    CNN, etc. Feature Extractor
    Feature map
    input
    (waveform,melspec,…)
    Pointwise
    Classifier
    Classifier
    Clip-level prediction
    Frame-level prediction
    Aggregate in time axis
    (max, mean, attention,…)
    The outputs are two: clip-wise prediction
    and segment-wise prediction
    Feature extraction
    CNN, etc.

    View Slide

  13. Stage 1
    13
    Find missing labels with 5fold PANNs
    • Cnn14DecisionLevelAtt of PANNs (SED model)
    • Input: 30s randomly cropped chunk←For Weak label
    • Data Augmentation ←For shift in
    • Gaussian Noise
    • Pitch Shift
    • Volume Up/Down
    • Loss is applied to both clipwise output and
    framewise output
    • Create prediction for the whole train dataset by
    5fold training
    • Add class to secondary labels whose prob > 0.9 and
    not included in primary label (only for the clips
    without secondary labels)←For missing label
    Attention map
    Aggregate in time
    axis after element-
    wise product
    Feature Extractor
    Segment-wise
    Clip-level prediction
    (Aggregate with max)
    Clip-level prediction
    (Aggregate with attention)
    ℒ = , (
    !"" + 0.5(, (
    #!$)
    Attention map
    → 0.578(public) / 0.619(private) (Silver
    medal zone)

    View Slide

  14. Stage 2
    14
    Pseudo labeling with 5fold SED model
    • Inherit the PANNs architecture
    • Change CNN encoder to ResNeSt50
    • Use additional labels found in stage 1←For missing
    label
    • Input: 20s randomly cropped chunk←For Weak label
    • Data Augmentation ←For shift in ()
    • Gaussian Noise
    • Pitch Shift
    • Volume Up/Down
    • 3channel input (melspec, pcen, melspec ** 1.5)
    • Loss is applied to both clipwise output and
    framewise output
    • Create prediction for the whole train dataset by
    5fold training
    • Correct the original labels with oof prediction←For
    noisy label
    264
    n_frame(≠len(y))
    Cropped chunk
    Frame-wise prediction
    Original clip-level label
    Aggregate the prediction
    in the chunk in time axis
    with max and apply
    threshold
    np.logical_and
    Corrected label
    Alter the correction level by changing
    the threshold for diversity
    → 0.601(public) / 0.655(private) (Gold medal zone)

    View Slide

  15. Stage 3
    15
    Train several models with no-fold training + EMA changing
    the correction level of the label
    • PANNs architecture+ ResNeSt50/EfficientNet-B0
    • No-fold training with Train Dataset/Train Extended
    • Use additional labels found in stage 1←For missing label
    • Input: 20s randomly cropped chunk←For Weak label
    • Data Augmentation ←For shift in ()
    • Gaussian Noise
    • Pitch Shift
    • Volume Up/Down
    • 3channel input (melspec, pcen, melspec ** 1.5)
    • Loss is applied to both clipwise output and framewise
    output(FocalLoss for EfficientNet)
    • Correct the original labels with oof prediction←For noisy label
    • Alter the label correction level by changing the threshold on oof
    pred from 0.3 to 0.7← For shift in (∗)
    • Weighted average of 11 models(weight is decided based on
    public LB)
    ※ no-fold + EMA: Use the whole dataset for
    training and apply Exponential Moving
    Average on the weights
    → 0.628(public) / 0.668(private) (6th )
    librosa.power_to_db(librosa.feature.melspectrogram(y, ))
    librosa.pcen(librosa.feature.melspectrogram(y))
    librosa.power_to_db(librosa.feature.melspectrogram(y) ** 1.5)

    View Slide

  16. Miscellaneous
    16
    Reflections
    • There was a bug in seed fixing
    function. I found this when I was
    training the third stage
    • Loss of reproducibility by code
    modification because of
    • 1 experiment 1 script is the
    best after all
    • I used Data Augmentation
    configuration which worked in the
    first place, but pitch shift is
    actually super-slow and I could
    have done more experiments if I
    had decided not to use it.
    • Hard voting seems to work very
    well in the winnerʼs solution, but I
    even didnʼt try it.
    What didnʼt work
    • Mixup
    • Mix before taking the log(2nd,36th)
    • Union on the label(3rd ,36th)
    are reported to have worked
    • Calltype classification
    • In fact, there are many calltypes
    even in a single species
    • Calls and songs are quite different
    • I tried 871 class classification
    using calltype labels but it didnʼt
    worked
    • Larger model
    • According to Oleg(5th ), too large
    receptive field wasnʼt good for this
    dataset
    What I want to do further
    • Add more models
    • Continue on further label correction
    • Use location and elevation for label
    correction
    • Use co-occurrence information of
    bird calls for post-processing
    • Mix background noise

    View Slide