Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network

Hendrik Schreiber
September 24, 2018

A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network

We present musical tempo estimation system based solely on a convolutional neural network (CNN). Contrary to existing systems our system estimates the tempo directly from a conventional mel-spectrogram in a single step. This is achieved by framing tempo estimation as a multi-class classification problem using a network architecture that is inspired by conventional approaches. The system’s CNN has been trained with the union of three datasets covering a large variety of genres and tempi using problem-specific data augmentation techniques. As input the system requires only 11.9s of audio and is therefore suitable for local as well as global tempo estimation. When used as a global estimator, it performs as well as or better than other state-of-the-art algorithms. Especially the exact estimation of tempo without tempo octave confusion is significantly improved. As local estimator it can be used to identify and visualize tempo drift in musical performances.
https://www.youtube.com/watch?v=w-fsuRbAVuo&t=1h21m55s

Hendrik Schreiber

September 24, 2018
Tweet

More Decks by Hendrik Schreiber

Other Decks in Science

Transcript

  1. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    A Single-Step Approach to Musical Tempo Estimation
    Using a Convolutional Neural Network
    Hendrik Schreiber

    tagtraum industries incorporated
    [email protected] / @h_schreiber
    Meinard Müller

    International AudioLabs Erlangen
    [email protected]

    View full-size slide

  2. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Tempo Estimation System
    Spectrogram
    Onset/Beat

    Detection
    Tempo Estimation
    Traditional System

    View full-size slide

  3. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Tempo Estimation System
    Mel-Spectrogram
    CNN-based Tempo
    Classification
    Proposed System
    “eliminating the middle-man”

    View full-size slide

  4. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    CNN Architecture
    mel-spectrogram
    short pattern
    matching
    avg-pooling 5x1
    avg-pooling 2x1
    avg-pooling 2x1
    avg-pooling 2x1
    mf_mod
    mf_mod
    mf_mod
    mf_mod
    long pattern
    matching classification
    stepwise pooling along
    the frequency axis,
    long filters along time axis
    Input, 1x40x256
    Output, 256
    Batch Normalization Convolutional Layer
    Fully Connected Layer DO Dropout
    short filters
    along time
    axis
    dense layers,
    softmax
    mel-spectrogram
    short pattern
    matching
    avg-pooling 5x1
    avg-pooling 2x1
    avg-pooling 2x1
    avg-pooling 2x1
    mf_mod
    mf_mod
    mf_mod
    mf_mod
    long pattern
    matching classification
    stepwise pooling along
    the frequency axis,
    long filters along time axis
    Input, 1x40x256
    Output, 256
    Batch Normalization Convolutional Layer
    Fully Connected Layer DO Dropout
    short filters
    along time
    axis
    dense layers,
    softmax

    View full-size slide

  5. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    CNN Architecture
    mel-spectrogram
    short pattern
    matching
    avg-pooling 5x1
    avg-pooling 2x1
    avg-pooling 2x1
    avg-pooling 2x1
    mf_mod
    mf_mod
    mf_mod
    mf_mod
    long pattern
    matching classification
    stepwise pooling along
    the frequency axis,
    long filters along time axis
    Input, 1x40x256
    Output, 256
    Batch Normalization Convolutional Layer
    Fully Connected Layer DO Dropout
    short filters
    along time
    axis
    dense layers,
    softmax
    mel-spectrogram
    short pattern
    matching
    avg-pooling 5x1
    avg-pooling 2x1
    avg-pooling 2x1
    avg-pooling 2x1
    mf_mod
    mf_mod
    mf_mod
    mf_mod
    long pattern
    matching classification
    stepwise pooling along
    the frequency axis,
    long filters along time axis
    Input, 1x40x256
    Output, 256
    Batch Normalization Convolutional Layer
    Fully Connected Layer DO Dropout
    short filters
    along time
    axis
    dense layers,
    softmax
    avg-pooling
    BN
    CONV
    24x1x32
    CONV
    24x1x64
    CONV
    24x1x96
    CONV
    24x1x128
    CONV
    24x1x192
    CONV
    24x1x256
    concatenation
    mf_mod
    CONV
    36x1x1
    next layer
    previous layer
    Pons, Jordi, and Xavier Serra. "Designing efficient architectures for modeling temporal features with convolutional neural networks.”
    IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, March 2017.
    Inspired by
    Pons and Serra

    View full-size slide

  6. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    CNN Architecture
    mel-spectrogram
    short pattern
    matching
    avg-pooling 5x1
    avg-pooling 2x1
    avg-pooling 2x1
    avg-pooling 2x1
    mf_mod
    mf_mod
    mf_mod
    mf_mod
    long pattern
    matching classification
    stepwise pooling along
    the frequency axis,
    long filters along time axis
    Input, 1x40x256
    Output, 256
    Batch Normalization Convolutional Layer
    Fully Connected Layer DO Dropout
    short filters
    along time
    axis
    dense layers,
    softmax
    mel-spectrogram
    short pattern
    matching
    avg-pooling 5x1
    avg-pooling 2x1
    avg-pooling 2x1
    avg-pooling 2x1
    mf_mod
    mf_mod
    mf_mod
    mf_mod
    long pattern
    matching classification
    stepwise pooling along
    the frequency axis,
    long filters along time axis
    Input, 1x40x256
    Output, 256
    Batch Normalization Convolutional Layer
    Fully Connected Layer DO Dropout
    short filters
    along time
    axis
    dense layers,
    softmax
    avg-pooling
    BN
    CONV
    24x1x32
    CONV
    24x1x64
    CONV
    24x1x96
    CONV
    24x1x128
    CONV
    24x1x192
    CONV
    24x1x256
    concatenation
    mf_mod
    CONV
    36x1x1
    next layer
    previous layer
    Pons, Jordi, and Xavier Serra. "Designing efficient architectures for modeling temporal features with convolutional neural networks.”
    IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, March 2017.
    Inspired by
    Pons and Serra
    Pooling along
    frequency axis
    Convolution along
    time axis

    View full-size slide

  7. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Training Datasets
    • LMD Tempo—derived from Lakh MIDI Dataset [1]
    • MTG Tempo—tempo annotations for MTG Key [2]
    • Eball—Extended Ballroom [3] sans Ballroom
    [1] Colin Raffel. "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching". PhD Thesis, 2016.

    [2] Ángel Faraldo, Sergi Jordà, and Perfecto Herrera. A multi-profile method for key estimation in EDM. In Proceedings of the AES International Conference on Semantic Audio, Erlangen, Germany, June 2017.

    [3] Ugo Marchand and Geoffroy Peeters. The extended ballroom dataset. In Late Breaking Demo of the International Conference on Music Information Retrieval (ISMIR), New York, NY, USA, 2016.
    NEW!
    NEW!
    Combined: 8,596 samples

    View full-size slide

  8. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Scale & Crop Data Augmentation
    Requires label
    adjustment!

    View full-size slide

  9. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Local Tempo Estimation
    “Honky Tonk Women” by The Rolling Stones

    View full-size slide

  10. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Local Tempo Estimation
    “Honky Tonk Women” by The Rolling Stones

    View full-size slide

  11. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Local Tempo Estimation
    “Typhoon” by Foreign Beggars/Chasing Shadows

    View full-size slide

  12. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Local Tempo Estimation
    “Typhoon” by Foreign Beggars/Chasing Shadows

    View full-size slide

  13. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Global Tempo Estimation
    On ACM Mirum, ISMIR04, Ballroom, Hainsworth, GTzan, SMC, GiantSteps
    Combined: 74.2% (‐from 69.5%)
    Dataset Avg: 69.3% (‐from 66.7%)
    Accuracy1

    View full-size slide

  14. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Global Tempo Estimation
    On ACM Mirum, ISMIR04, Ballroom, Hainsworth, GTzan, SMC, GiantSteps
    Combined: 92.1% (‑from 93.6%)
    Dataset Avg: 86.4% (‑from 89.9%)
    Accuracy2

    View full-size slide

  15. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Summary

    View full-size slide

  16. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Summary
    • Consolidates multi-component approach
    • Completely data-driven, no heuristics
    • Fewer octave-errors
    • Suitable for global and local tempo estimation

    View full-size slide

  17. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Thank you.
    New datasets are available at:
    http://www.tagtraum.com/tempo_estimation.html
    Tempo estimation code is available at:
    https://github.com/hendriks73/tempo-cnn

    View full-size slide