A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network

INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS
and Universität Erlangen-Nürnberg A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network Hendrik Schreiber  tagtraum industries incorporated [email protected] / @h_schreiber Meinard Müller  International AudioLabs Erlangen [email protected]

and Universität Erlangen-Nürnberg Tempo Estimation System Spectrogram Onset/Beat  Detection Tempo Estimation Traditional System

and Universität Erlangen-Nürnberg Tempo Estimation System Mel-Spectrogram CNN-based Tempo Classiﬁcation Proposed System “eliminating the middle-man”

and Universität Erlangen-Nürnberg CNN Architecture mel-spectrogram short pattern matching avg-pooling 5x1 avg-pooling 2x1 avg-pooling 2x1 avg-pooling 2x1 mf_mod mf_mod mf_mod mf_mod long pattern matching classification stepwise pooling along the frequency axis, long filters along time axis Input, 1x40x256 Output, 256 Batch Normalization Convolutional Layer Fully Connected Layer DO Dropout short filters along time axis dense layers, softmax mel-spectrogram short pattern matching avg-pooling 5x1 avg-pooling 2x1 avg-pooling 2x1 avg-pooling 2x1 mf_mod mf_mod mf_mod mf_mod long pattern matching classification stepwise pooling along the frequency axis, long filters along time axis Input, 1x40x256 Output, 256 Batch Normalization Convolutional Layer Fully Connected Layer DO Dropout short filters along time axis dense layers, softmax

and Universität Erlangen-Nürnberg CNN Architecture mel-spectrogram short pattern matching avg-pooling 5x1 avg-pooling 2x1 avg-pooling 2x1 avg-pooling 2x1 mf_mod mf_mod mf_mod mf_mod long pattern matching classification stepwise pooling along the frequency axis, long filters along time axis Input, 1x40x256 Output, 256 Batch Normalization Convolutional Layer Fully Connected Layer DO Dropout short filters along time axis dense layers, softmax mel-spectrogram short pattern matching avg-pooling 5x1 avg-pooling 2x1 avg-pooling 2x1 avg-pooling 2x1 mf_mod mf_mod mf_mod mf_mod long pattern matching classification stepwise pooling along the frequency axis, long filters along time axis Input, 1x40x256 Output, 256 Batch Normalization Convolutional Layer Fully Connected Layer DO Dropout short filters along time axis dense layers, softmax avg-pooling BN CONV 24x1x32 CONV 24x1x64 CONV 24x1x96 CONV 24x1x128 CONV 24x1x192 CONV 24x1x256 concatenation mf_mod CONV 36x1x1 next layer previous layer Pons, Jordi, and Xavier Serra. "Designing efﬁcient architectures for modeling temporal features with convolutional neural networks.” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, March 2017. Inspired by Pons and Serra

and Universität Erlangen-Nürnberg CNN Architecture mel-spectrogram short pattern matching avg-pooling 5x1 avg-pooling 2x1 avg-pooling 2x1 avg-pooling 2x1 mf_mod mf_mod mf_mod mf_mod long pattern matching classification stepwise pooling along the frequency axis, long filters along time axis Input, 1x40x256 Output, 256 Batch Normalization Convolutional Layer Fully Connected Layer DO Dropout short filters along time axis dense layers, softmax mel-spectrogram short pattern matching avg-pooling 5x1 avg-pooling 2x1 avg-pooling 2x1 avg-pooling 2x1 mf_mod mf_mod mf_mod mf_mod long pattern matching classification stepwise pooling along the frequency axis, long filters along time axis Input, 1x40x256 Output, 256 Batch Normalization Convolutional Layer Fully Connected Layer DO Dropout short filters along time axis dense layers, softmax avg-pooling BN CONV 24x1x32 CONV 24x1x64 CONV 24x1x96 CONV 24x1x128 CONV 24x1x192 CONV 24x1x256 concatenation mf_mod CONV 36x1x1 next layer previous layer Pons, Jordi, and Xavier Serra. "Designing efﬁcient architectures for modeling temporal features with convolutional neural networks.” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, March 2017. Inspired by Pons and Serra Pooling along frequency axis Convolution along time axis

and Universität Erlangen-Nürnberg Training Datasets • LMD Tempo—derived from Lakh MIDI Dataset [1] • MTG Tempo—tempo annotations for MTG Key [2] • Eball—Extended Ballroom [3] sans Ballroom [1] Colin Raffel. "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching". PhD Thesis, 2016.  [2] Ángel Faraldo, Sergi Jordà, and Perfecto Herrera. A multi-profile method for key estimation in EDM. In Proceedings of the AES International Conference on Semantic Audio, Erlangen, Germany, June 2017.  [3] Ugo Marchand and Geoffroy Peeters. The extended ballroom dataset. In Late Breaking Demo of the International Conference on Music Information Retrieval (ISMIR), New York, NY, USA, 2016. NEW! NEW! Combined: 8,596 samples

and Universität Erlangen-Nürnberg Scale & Crop Data Augmentation Requires label adjustment!

and Universität Erlangen-Nürnberg Local Tempo Estimation “Honky Tonk Women” by The Rolling Stones

and Universität Erlangen-Nürnberg Local Tempo Estimation “Typhoon” by Foreign Beggars/Chasing Shadows

and Universität Erlangen-Nürnberg Global Tempo Estimation On ACM Mirum, ISMIR04, Ballroom, Hainsworth, GTzan, SMC, GiantSteps Combined: 74.2% (‐from 69.5%) Dataset Avg: 69.3% (‐from 66.7%) Accuracy1

and Universität Erlangen-Nürnberg Global Tempo Estimation On ACM Mirum, ISMIR04, Ballroom, Hainsworth, GTzan, SMC, GiantSteps Combined: 92.1% (‑from 93.6%) Dataset Avg: 86.4% (‑from 89.9%) Accuracy2

and Universität Erlangen-Nürnberg Summary

and Universität Erlangen-Nürnberg Summary • Consolidates multi-component approach • Completely data-driven, no heuristics • Fewer octave-errors • Suitable for global and local tempo estimation

and Universität Erlangen-Nürnberg Thank you. New datasets are available at: http://www.tagtraum.com/tempo_estimation.html Tempo estimation code is available at: https://github.com/hendriks73/tempo-cnn

A Single-Step Approach to Musical Tempo Estimat...

A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network

Hendrik Schreiber

More Decks by Hendrik Schreiber

Other Decks in Science

Featured

Transcript

INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS