A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network

A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network

We present musical tempo estimation system based solely on a convolutional neural network (CNN). Contrary to existing systems our system estimates the tempo directly from a conventional mel-spectrogram in a single step. This is achieved by framing tempo estimation as a multi-class classification problem using a network architecture that is inspired by conventional approaches. The system’s CNN has been trained with the union of three datasets covering a large variety of genres and tempi using problem-specific data augmentation techniques. As input the system requires only 11.9s of audio and is therefore suitable for local as well as global tempo estimation. When used as a global estimator, it performs as well as or better than other state-of-the-art algorithms. Especially the exact estimation of tempo without tempo octave confusion is significantly improved. As local estimator it can be used to identify and visualize tempo drift in musical performances.
https://www.youtube.com/watch?v=w-fsuRbAVuo&t=1h21m55s

5956d4677f50a8584f8a127d3240103d?s=128

Hendrik Schreiber

September 24, 2018
Tweet

Transcript

  1. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network Hendrik Schreiber
 tagtraum industries incorporated hs@tagtraum.com / @h_schreiber Meinard Müller
 International AudioLabs Erlangen meinard.mueller@audiolabs-erlangen.de
  2. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Tempo Estimation System Spectrogram Onset/Beat
 Detection Tempo Estimation Traditional System
  3. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Tempo Estimation System Mel-Spectrogram CNN-based Tempo Classification Proposed System “eliminating the middle-man”
  4. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg CNN Architecture mel-spectrogram short pattern matching avg-pooling 5x1 avg-pooling 2x1 avg-pooling 2x1 avg-pooling 2x1 mf_mod mf_mod mf_mod mf_mod long pattern matching classification stepwise pooling along the frequency axis, long filters along time axis Input, 1x40x256 Output, 256 Batch Normalization Convolutional Layer Fully Connected Layer DO Dropout short filters along time axis dense layers, softmax mel-spectrogram short pattern matching avg-pooling 5x1 avg-pooling 2x1 avg-pooling 2x1 avg-pooling 2x1 mf_mod mf_mod mf_mod mf_mod long pattern matching classification stepwise pooling along the frequency axis, long filters along time axis Input, 1x40x256 Output, 256 Batch Normalization Convolutional Layer Fully Connected Layer DO Dropout short filters along time axis dense layers, softmax
  5. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg CNN Architecture mel-spectrogram short pattern matching avg-pooling 5x1 avg-pooling 2x1 avg-pooling 2x1 avg-pooling 2x1 mf_mod mf_mod mf_mod mf_mod long pattern matching classification stepwise pooling along the frequency axis, long filters along time axis Input, 1x40x256 Output, 256 Batch Normalization Convolutional Layer Fully Connected Layer DO Dropout short filters along time axis dense layers, softmax mel-spectrogram short pattern matching avg-pooling 5x1 avg-pooling 2x1 avg-pooling 2x1 avg-pooling 2x1 mf_mod mf_mod mf_mod mf_mod long pattern matching classification stepwise pooling along the frequency axis, long filters along time axis Input, 1x40x256 Output, 256 Batch Normalization Convolutional Layer Fully Connected Layer DO Dropout short filters along time axis dense layers, softmax avg-pooling BN CONV 24x1x32 CONV 24x1x64 CONV 24x1x96 CONV 24x1x128 CONV 24x1x192 CONV 24x1x256 concatenation mf_mod CONV 36x1x1 next layer previous layer Pons, Jordi, and Xavier Serra. "Designing efficient architectures for modeling temporal features with convolutional neural networks.” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, March 2017. Inspired by Pons and Serra
  6. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg CNN Architecture mel-spectrogram short pattern matching avg-pooling 5x1 avg-pooling 2x1 avg-pooling 2x1 avg-pooling 2x1 mf_mod mf_mod mf_mod mf_mod long pattern matching classification stepwise pooling along the frequency axis, long filters along time axis Input, 1x40x256 Output, 256 Batch Normalization Convolutional Layer Fully Connected Layer DO Dropout short filters along time axis dense layers, softmax mel-spectrogram short pattern matching avg-pooling 5x1 avg-pooling 2x1 avg-pooling 2x1 avg-pooling 2x1 mf_mod mf_mod mf_mod mf_mod long pattern matching classification stepwise pooling along the frequency axis, long filters along time axis Input, 1x40x256 Output, 256 Batch Normalization Convolutional Layer Fully Connected Layer DO Dropout short filters along time axis dense layers, softmax avg-pooling BN CONV 24x1x32 CONV 24x1x64 CONV 24x1x96 CONV 24x1x128 CONV 24x1x192 CONV 24x1x256 concatenation mf_mod CONV 36x1x1 next layer previous layer Pons, Jordi, and Xavier Serra. "Designing efficient architectures for modeling temporal features with convolutional neural networks.” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, March 2017. Inspired by Pons and Serra Pooling along frequency axis Convolution along time axis
  7. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Training Datasets • LMD Tempo—derived from Lakh MIDI Dataset [1] • MTG Tempo—tempo annotations for MTG Key [2] • Eball—Extended Ballroom [3] sans Ballroom [1] Colin Raffel. "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching". PhD Thesis, 2016.
 [2] Ángel Faraldo, Sergi Jordà, and Perfecto Herrera. A multi-profile method for key estimation in EDM. In Proceedings of the AES International Conference on Semantic Audio, Erlangen, Germany, June 2017.
 [3] Ugo Marchand and Geoffroy Peeters. The extended ballroom dataset. In Late Breaking Demo of the International Conference on Music Information Retrieval (ISMIR), New York, NY, USA, 2016. NEW! NEW! Combined: 8,596 samples
  8. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Scale & Crop Data Augmentation Requires label adjustment!
  9. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Local Tempo Estimation “Honky Tonk Women” by The Rolling Stones
  10. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Local Tempo Estimation “Honky Tonk Women” by The Rolling Stones
  11. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Local Tempo Estimation “Typhoon” by Foreign Beggars/Chasing Shadows
  12. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Local Tempo Estimation “Typhoon” by Foreign Beggars/Chasing Shadows
  13. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Global Tempo Estimation On ACM Mirum, ISMIR04, Ballroom, Hainsworth, GTzan, SMC, GiantSteps Combined: 74.2% (‐from 69.5%) Dataset Avg: 69.3% (‐from 66.7%) Accuracy1
  14. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Global Tempo Estimation On ACM Mirum, ISMIR04, Ballroom, Hainsworth, GTzan, SMC, GiantSteps Combined: 92.1% (‑from 93.6%) Dataset Avg: 86.4% (‑from 89.9%) Accuracy2
  15. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Summary
  16. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Summary • Consolidates multi-component approach • Completely data-driven, no heuristics • Fewer octave-errors • Suitable for global and local tempo estimation
  17. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Thank you. New datasets are available at: http://www.tagtraum.com/tempo_estimation.html Tempo estimation code is available at: https://github.com/hendriks73/tempo-cnn