Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cornell Birdcall 36th place solution

Maxwell
September 16, 2020

Cornell Birdcall 36th place solution

Maxwell

September 16, 2020
Tweet

More Decks by Maxwell

Other Decks in Science

Transcript

  1. 1D Features
    Resources:
    TITAN RTX, 1080Ti x 2, 2080Ti x 2 *1
    Cornell Birdcall Identification
    Kaggle
    XenoCant
    Extended
    XenoCant*2
    Copyright 2020 @ Maxwell_110
    *1 My room became tropical.
    *2 Credit to Vopani
    *3 Kerneler-kun had become
    a notebook expert! Check
    his profile.
    Feature Extraction
    Training / Prediction
    BirdVox
    ff1010
    fs2019
    ESC50
    264 birds
    nocall
    Trim silent parts
    (librosa.effects.trim)
    Load Audio
    (librosa.load)
    Resample with
    22.05 kHz
    remove only
    silent start/end parts
    Log Mel Spectrogram (2D)
    ( librosa.feature.melspectrogram
    librosa.core.power_to_db )
    Audio Data (1D)
     5 - 10 (s) variable audio length
     64 nmel, 10ms hop, 80ms sfft
     Event aware extraction
     5 (s) constant audio length
     Event aware extraction
    64
    500 - 1000
    5 x 22050
    1
    Augmentation
     p: 0.5
     width / height shift:
    0.2 / 0.1
     Scale: -0.05 / +0.05
    2
    Random Eraser
     p: 0.5
     erase num: 1
     width: [0, 0.1]
     height: [0.1, 0.3]
     fill with -1
    Standardize
    [- 1, + 1]
    To 2D Models
    To 1D Models
     p: 0.5
     width shift: 0.2
     NoiseInjection
    Augmentation Standardize
    [- 1, + 1]
    cut out
    at random
    2D
    Features
    ResNet 18
    nocall
    Binary Model
    (call / nocall )
    3
    4 Multi-Label Model (264 types)
    Multi-Task-Learning (MTL) for primary and noisy background labels
    call
    2D Models
    2048
    nodes
    +
    BN
    +
    ReLu
    512
    nodes
    +
    BN
    +
    ReLu
    2D: GAP
    /
    1D: MAP*4
    2048
    nodes
    +
    BN
    +
    ReLu
    512
    nodes
    +
    BN
    +
    ReLu
    2D: GAP
    /
    1D: MAP*4
    264
    nodes
    primary
    labels
    264
    nodes
    background
    labels
    ResNet 18
    2D Models
    MTL
    Loss
    3 stage scratch learning
    1. primary only
    -
    ,
    = [, ]
    - 2D: 200 epochs + Early Stopping (ES)
    Adam, CyclicLR 1e-4 ~ e-3
    - 1D: 100 epochs + ES
    SGD, CosineAnnealing 1e-1 ~ e-6
    => Adam, ReduceLROnP 1e-4
    2. + background
    -
    ,
    = [, ]
    - Adam (5e-5)
    - ReduceLROnP (x 0.25)
    3. + Psuedo Labeling
    -
    ,
    = [, ]
    - Adam (5e-5)
    - ReduceLROnP (x 0.25)
    - Predictions of backgrounds more
    than 0.15 are added to primary labels
    as soft labels in primary branch. All
    values are clipped between 0 and 1.
    primary branch
    background branch
    PANNs 1D
    1D Model
    Blending
    Public: 36 th (0.623)
    Private: 39 th (0.580)
    *3
    • Class-Wise Blending
    - For each bird class
    - Optimize blend coefficients with BCE loss
    • Class-Wise threshold optimization
    - For each bird class
    - Maximize macro-F1 (not sample-wise)
    • 3 Epoch ensemble for PANNs 1D
    • 5 Fold ensemble for ResNet18 , PANNs 1D
    Model Architecture
    *4 mean on time axis,
    max and mean on freq axis

    View full-size slide

  2. 1 Event Aware Extraction
    Example: XC341516.mp3 (ebird_code: brespa, 60 sec)
    brespa
    bug
    Naïve Random Extraction (5 sec)
    waveform
    logmel
    logmel
    freq-wise
    max
    nocall, no event
    => mislabeling
    logmel freq-wise max
    aware extraction
    noisy event / call
    => sometimes works, other times fails.
    logmel freq-wise max
    aware extraction
    with removing lower frequencies
    bug
    brespa
     Using logmel spectrum, extract audio chunks which contain
    logmel frequency-wise max intensity over the threshold.
     Because almost of all Bird call frequencies range from 1 ~ 8 kHz,
    get the better chunks for training via removing lower frequencies (~ 300Hz).
     Variable extraction length, 5 - 10 sec, also helps to get a good signal,
    meanwhile 2D models are well generalized to variable length.
    threshold

    View full-size slide