Musical Tempo Estimation  with  Convolutional Neural Networks

INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS
and Universität Erlangen-Nürnberg Musical Tempo Estimation  with  Convolutional Neural Networks Hendrik Schreiber  tagtraum industries incorporated [email protected] @h_schreiber

and Universität Erlangen-Nürnberg About Me ๏ Doing business as tagtraum industries since 2004

and Universität Erlangen-Nürnberg About Me ๏ Doing business as tagtraum industries since 2004 ๏ Author of beaTunes

and Universität Erlangen-Nürnberg About Me ๏ Doing business as tagtraum industries since 2004 ๏ Author of beaTunes ๏ Ph.D. candidate at Meinard Müller’s lab

and Universität Erlangen-Nürnberg www. beaTunes.com Music Analysis Wikipedia Integration Segmentation Key, Tempo,  Mood, Color Metadata Correction Matching Windows & Mac shameless product placement

and Universität Erlangen-Nürnberg This Talk ๏ About: Tempo Estimation with CNNs ๏ Yes, we are going to get close to or exceed the state of the art (SOTA) ๏ But: Just showing how that’s done won’t take long and we won’t learn much

and Universität Erlangen-Nürnberg Let’s take the scenic route!

and Universität Erlangen-Nürnberg This Talk ๏ What is tempo and tempo estimation? ๏ Simple digital signal processing baseline ๏ Translate classic approach to the CNN world ๏ Note issues, attempt to solve them ๏ Question why and how we did all this

and Universität Erlangen-Nürnberg Tempo Estimation ๏ Usually global tempo estimation ๏ Works pretty well (MIREX) ๏ Remaining challenge: Octave errors

and Universität Erlangen-Nürnberg Tempo Estimation System Spectrogram Onset/Beat  Detection Tempo Estimation Traditional System Often peak picking or some other heuristic

and Universität Erlangen-Nürnberg Tempo Estimation System Mel-Spectrogram CNN-based Tempo Classiﬁcation Proposed System “eliminating the middle-man”

and Universität Erlangen-Nürnberg Case Study A Traditional  Tempo Estimation System

and Universität Erlangen-Nürnberg What is Tempo? Beats per Minute

and Universität Erlangen-Nürnberg What is Tempo? Periodic Increase in Volume, in Certain Frequency Bands

and Universität Erlangen-Nürnberg What is Tempo? Periodic Increase in Volume, in Certain Frequency Bands 2 1 3

and Universität Erlangen-Nürnberg Periodic Increase in Volume, in Certain Frequency Bands ๏ Framewise signal energy (sliding window) ๏ Short-term pattern: Compute difference between adjacent frames ๏ Increase: Keep only positive values (half wave rectiﬁcation)

and Universität Erlangen-Nürnberg Periodic Increase in Volume, in Certain Frequency Bands ๏ Framewise signal energy (sliding window) ๏ Short-term pattern: Compute difference between adjacent frames ๏ Increase: Keep only positive values (half wave rectiﬁcation) 1

and Universität Erlangen-Nürnberg Increase in Volume

and Universität Erlangen-Nürnberg Increase in Volume First 2 seconds: How many beats do you hear?

and Universität Erlangen-Nürnberg Increase in Volume First 2 seconds: 4 Beats?

and Universität Erlangen-Nürnberg Periodic Increase in Volume, in Certain Frequency Bands ๏ Log-compressed power spectrum (de-emphasize low freqs) ๏ Bandwise derivation (compare apples with apples)

and Universität Erlangen-Nürnberg Periodic Increase in Volume, in Certain Frequency Bands ๏ Log-compressed power spectrum (de-emphasize low freqs) ๏ Bandwise derivation (compare apples with apples) 2

and Universität Erlangen-Nürnberg In Certain Frequency Bands

and Universität Erlangen-Nürnberg In Certain Frequency Bands First 2 seconds: 4 Beats!

and Universität Erlangen-Nürnberg Periodic Increase in Volume, in Certain Frequency Bands ๏ Long-term pattern:  Fourier transform the Onset Strength Signal (OSS)  from time domain to frequency domain ๏ Peak picking

and Universität Erlangen-Nürnberg Periodic Increase in Volume, in Certain Frequency Bands ๏ Long-term pattern:  Fourier transform the Onset Strength Signal (OSS)  from time domain to frequency domain ๏ Peak picking 3

and Universität Erlangen-Nürnberg Periodic Peak at 153.3 BPM

and Universität Erlangen-Nürnberg Thanks.  Fun signal processing intro.

and Universität Erlangen-Nürnberg But wasn’t this talk supposed to be about Deep Learning? Thanks.  Fun signal processing intro.

and Universität Erlangen-Nürnberg Reminder: Convolutional Neural Network (CNN) ๏ Usually used for image recognition ๏ Each neuron has a limited receptive ﬁeld ๏ Parameters are shared between neurons Image source: aphex34, https://en.wikipedia.org/wiki/Convolutional_neural_network

and Universität Erlangen-Nürnberg Reminder: Mel-Spectrogram ๏ Mel scale is a perceptual scale of pitches judged by listeners to be equal in distance to each other. ๏ A Mel-spectrogram is a regular spectrogram rescaled along the Y axis using the Mel scale. ๏ Compared to regular linear scale, reduction of number of dimensions (i.e. pooling). ๏ Contrary to regular images, the two axes have completely different meaning. Frequency in Mel Time in frames

and Universität Erlangen-Nürnberg General Approach ๏ Input: 11.9s (256 frames), 40 band Mel-spectrogram (20—5,000Hz)  ➔ Possible range: 0-645 BPM  [ Nyquist frequency = 60*Fs/2 = 60*(256/11.9s)/2 = 645 BPM ] ๏ Output: Treat as classiﬁcation problem (256 classes ↦ 30-286 BPM)

and Universität Erlangen-Nürnberg Translation to CNN-Speak 1. Increase in Volume: Convolutional layer with narrow receptive ﬁeld

and Universität Erlangen-Nürnberg Translation to CNN-Speak 1. Increase in Volume: Convolutional layer with narrow receptive ﬁeld 2. In Certain Freq. Bands: Use rectangular ﬁlters (along time axis)

and Universität Erlangen-Nürnberg Translation to CNN-Speak 1. Increase in Volume: Convolutional layer with narrow receptive field 2. In Certain Freq. Bands: Use rectangular filters (along time axis) 3. Periodic: Convolutional layer with wide receptive field

and Universität Erlangen-Nürnberg Translation to CNN-Speak 1. Increase in Volume: Convolutional layer with narrow receptive field 2. In Certain Freq. Bands: Use rectangular filters (along time axis) 3. Periodic: Convolutional layer with wide receptive field 4. Classification based on extracted features

and Universität Erlangen-Nürnberg Increase in Volume,  in Certain Frequency Bands Short, rectangular kernels Bands: 1 Frames: 3 Frequency/Band Time/Frame 256 40 0

and Universität Erlangen-Nürnberg Increase in Volume,  in Certain Frequency Bands Short, rectangular kernels Bands: 1 Frames: 3 Frequency/Band Time/Frame 256 40 0 Increase in Volume,  in Certain Frequency Bands

and Universität Erlangen-Nürnberg Summarize Frames Frequency/Band Time/Frame Rectangular average pooling Bands: 40 Frames: 1 256 40 0

and Universität Erlangen-Nürnberg Periodic Frequency/Band Time/Frame Long, rectangular kernels Bands: 1 Frames: 256 256 40 0

and Universität Erlangen-Nürnberg Tempo CNN 4 (1x3) Conv2D, ReLU (40x1)  AvgPooling2D 256 (256)  Conv1D, ReLU (40x256)  Mel-Spectrogram 256 (1) Conv1D, ReLU GlobalAvg  Pooling1D (256) softmax argmax  mapped to interval  [30, 286] Feature Extraction Classiﬁcation

and Universität Erlangen-Nürnberg Nice idea. Does it work?

and Universität Erlangen-Nürnberg EDM Training ๏ EDM Training dataset - MTG Tempo [1], based on MTG Key [2] - 1,156 tracks, 2min duration each - 80/20 train/validation split ๏ Randomly picked 256 frames segments ๏ Optimizer Adam, lr=0.001 ๏ Batch size 32 ๏ Early stopping with patience 100 [1] Hendrik Schreiber, Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.  In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.  [2] Ángel Faraldo, Sergi Jordà, and Perfecto Herrera. A multi-proﬁle method for key estimation in EDM.  In Proceedings of the AES International Conference on Semantic Audio, Erlangen, Germany, June 2017.

and Universität Erlangen-Nürnberg EDM Evaluation EDM Test Dataset ๏ GiantSteps Tempo ๏ 661 tracks ๏ New annotations from [1]   Metrics ๏ Acc1: Correct, allowing a 4% tolerance ๏ Acc2: Correct, allowing a 4% tolerance  and factors 2, 3, ½, ⅓ [1] Hendrik Schreiber, Meinard Müller. A Crowdsourced Experiment for Tempo Estimation of Electronic Dance Music.  In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.

and Universität Erlangen-Nürnberg DSP Baseline Energy (50-210, 46ms) Energy (90-180, 46ms) Energy (90-180, 23ms) Bandwise (50-210, 46ms) Bandwise (90-180, 46ms) Bandwise (90-180, 23ms) Accuracy in % 0% 20% 40% 60% 80% 100% 65.8% 58.4% 52.7% 62.6% 62.5% 64.8% 59.0% 51.3% 39.8% 57.9% 57.8% 54.8% Acc1 Acc2

and Universität Erlangen-Nürnberg Evaluation DSP Baseline Simple FCN Accuracy in % 0% 20% 40% 60% 80% 100% 80.9% 65.8% 71.1% 59.0% Acc1 Acc2 +12.1pp +15.1pp

and Universität Erlangen-Nürnberg

and Universität Erlangen-Nürnberg Evaluation DSP Baseline Simple FCN Accuracy in % 0% 20% 40% 60% 80% 100% 80.9% 65.8% 71.1% 59.0% Acc1 Acc2 σ=6.4 σ=5.7 Average of 3 runs Dirty little secret:  Substantial standard deviations

and Universität Erlangen-Nürnberg Evaluation DSP Baseline Simple FCN Accuracy in % 0% 20% 40% 60% 80% 100% 80.9% 65.8% 71.1% 59.0% Acc1 Acc2 σ=6.4 σ=5.7 Average of 3 runs Dirty little secret:  Substantial standard deviations Overﬁtting?!

and Universität Erlangen-Nürnberg Let’s Drop Some! 4 (1x3) Conv2D, ReLU (40x1)  AvgPooling2D 256 (256)  Conv1D, ReLU (40x256)  Mel-Spectrogram 256 (1) Conv1D, ReLU GlobalAvg  Pooling1D (256) softmax argmax  mapped to interval  [30, 286] Feature Extraction Classiﬁcation Dropout  p=0.5 Dropout  p=0.5

and Universität Erlangen-Nürnberg Scale & Crop Data Augmentation Requires label adjustment! Quantized scaling to 80%, 84%, 88%, …, 116%, 120%

and Universität Erlangen-Nürnberg Evaluation DSP Baseline Simple FCN +Dropout 0.5 +Dropout 0.5 & Augmentation Accuracy in % 0% 20% 40% 60% 80% 100% 97.2% 92.9% 80.9% 65.8% 86.8% 81.0% 71.1% 59.0% Acc1 Acc2 σ=6.4 σ=5.7 σ=2.1 σ=0.3 σ=0.8 σ=0.2 +15.7pp +9.9pp Acc1

and Universität Erlangen-Nürnberg Evaluation DSP Baseline Simple FCN +Dropout 0.5 +Dropout 0.5 & Augmentation Accuracy in % 0% 20% 40% 60% 80% 100% 97.2% 92.9% 80.9% 65.8% 86.8% 81.0% 71.1% 59.0% Acc1 Acc2 σ=6.4 σ=5.7 σ=2.1 σ=0.3 σ=0.8 σ=0.2 +16.3pp +12.0pp Acc2

and Universität Erlangen-Nürnberg EDM Benchmarking [1] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017. [2] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.  In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. DSP Baseline Schreiber17 [1] Best FCN Accuracy in % 0% 20% 40% 60% 80% 100% 97.2% 95.2% 65.8% 86.8% 63.1% 59.0% Acc1 Acc2 σ=0.8 σ=0.2 Multi-Step Single-Step (CNN) Fewer octave-errors!

and Universität Erlangen-Nürnberg EDM Benchmarking [1] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017. [2] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.  In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. DSP Baseline Schreiber17 [1] Best FCN Schreiber18 [2] Accuracy in % 0% 20% 40% 60% 80% 100% 97.6% 97.2% 95.2% 65.8% 82.5% 86.8% 63.1% 59.0% Acc1 Acc2 σ=0.8 σ=0.2 Multi-Step Single-Step (CNN) +4.3pp Beats SOTA Acc1!

and Universität Erlangen-Nürnberg , but… Sure,

and Universität Erlangen-Nürnberg Does this generalize?

and Universität Erlangen-Nürnberg Does this generalize? To other genres?

and Universität Erlangen-Nürnberg GTZAN[1] Benchmarking [1] George Tzanetakis and Perry Cook. Musical genre classiﬁcation of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5):293–302, 2002. [2] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb ﬁlters.  In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [3] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017. DSP Baseline SOTA [2][3] Best FCN Accuracy in % 0% 20% 40% 60% 80% 100% 86.7% 95.0% 68.8% 50.5% 71.0% 53.0% Acc1 Acc2 σ=2.2 σ=0.2 Lots of octave-errors! 1,000 tracks from 10 genres, balanced 36.2pp

and Universität Erlangen-Nürnberg Ballroom[1] Benchmarking DSP Baseline SOTA [2][3] Best FCN Accuracy in % 0% 20% 40% 60% 80% 100% 81.2% 98.7% 78.7% 56.5% 92.0% 46.1% Acc1 Acc2 [1] Fabien Gouyon, Anssi P. Klapuri, Simon Dixon, Miguel Alonso, George Tzanetakis, Christian Uhle, and Pedro Cano. An experimental comparison of audio tempo induction algorithms. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1832–1844, 2006. [2] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb ﬁlters.  In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [3] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.  In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. σ=1.8 σ=0.5 698 tracks from ballroom dance genres 24.7pp Lots of octave-errors!

and Universität Erlangen-Nürnberg Testset Tempo Distributions nd its signal ences (m , 720] : (1) m, k) were ength ution ctrum l [7], here- oosts (2) 0 10 20 % of tr N = 222 0 10 20 30 % of tracks Ballroom µ = 129.77, = 39.61 N = 698 20 – 30 30 – 40 40 – 50 50 – 60 60 – 70 70 – 80 80 – 90 90 – 100 100 – 110 110 – 120 120 – 130 130 – 140 140 – 150 150 – 160 160 – 170 170 – 180 180 – 190 190 – 200 200 – 210 210 – 220 220 – 230 230 – 240 240 – 250 250 – 260 0 10 20 30 Tempo intervals in BPM % of tracks GiantSteps µ = 136.66, = 28.33 N = 664 Figure 1. Tempo distributions for the test datasets. highest value of BE , divide its frequency by 4 to find the first harmonic, and finally convert its associated frequency to BPM: 60 tures, and then compare our results with those from other methods. Finally, in Section 5, we present our conclusions. 2. TEMPO ESTIMATION To lay the groundwork for our error correction method, we first describe a simple tempo estimation algorithm, then introduce several test datasets and discuss common pitfalls. In Section 2.5, we introduce performance metrics and describe observed errors. 2.1 Algorithm To estimate the dominant pulse we follow the approach taken in [24], which is similar to [23, 28]: We first convert the signal to mono and downsample to 11025 Hz. Then we compute the power spectrum Y of 93 ms windows with half overlap, by applying a Hamming window and performing an STFT. The power for each bin k 2 [0 : K] := {0, 1, 2, . . . , K} at time m 2 [0 : M] := {0, 1, 2, . . . , M} is given by Y (m, k), its positive logarithmic power Yln (m, k) := ln (1000 · Y (m, k) + 1), and its frequency by F(k) given in Hz. We define the onset signal strength OSS(m) as the sum of the bandwise differences between the logarithmic powers Yln (m, k) and Yln (m 1, k) for those k where the frequency F(k) 2 [30, 720] 0 10 20 % of tr 0 10 20 30 % of tracks ISMIR2004 Songs µ = 89.80, = 27.83 N = 464 0 10 20 30 % of tracks GTZAN µ = 94.55, = 24.39 N = 999 0 10 20 30 % of tracks ACM MIRUM µ = 102.72, = 32.58 N = 1410 0 10 20 30 % of tracks Hainsworth µ = 113.30, = 28.78 N = 222 20 30 acks Ballroom µ = 129.77, = 39.61 mic power Yln (m, k) := ln (1000 · Y (m, k) + 1), and its frequency by F(k) given in Hz. We define the onset signal strength OSS(m) as the sum of the bandwise differences between the logarithmic powers Yln (m, k) and Yln (m 1, k) for those k where the frequency F(k) 2 [30, 720] and Y (m, k) is greater than ↵Y (m 1, k) (see [16]): I(m, k) = 8 < : 1 if Y (m, k) > ↵Y (m 1, k) and F(k) 2 [30, 720], 0 otherwise (1) OSS(m) = X k (Yln (m, k) Yln (m 1, k)) · I(m, k) Both the factor ↵ = 1.76 and the frequency range were found experimentally [24]. The OSS(m) is transformed using a DFT with length 8192. At the given sample rate, this ensures a resolution of 0.156 BPM. The peaks of the resulting beat spectrum B represent the strength of BPM values in the signal [7], but do not take harmonics into account [10, 21]. There- fore we derive an enhanced beat spectrum BE that boosts frequencies supported by harmonics: BE (k) = 2 X |B(bk/2i + 0.5c)| (2) 0 10 20 % of tr N = 222 0 10 20 30 % of tracks Ballroom µ = 129.77, = 39.61 N = 698 20 – 30 30 – 40 40 – 50 50 – 60 60 – 70 70 – 80 80 – 90 90 – 100 100 – 110 110 – 120 120 – 130 130 – 140 140 – 150 150 – 160 160 – 170 170 – 180 180 – 190 190 – 200 200 – 210 210 – 220 220 – 230 230 – 240 240 – 250 250 – 260 0 10 20 30 Tempo intervals in BPM % of tracks GiantSteps µ = 136.66, = 28.33 N = 664 Figure 1. Tempo distributions for the test datasets. highest value of BE , divide its frequency by 4 to find the first harmonic, and finally convert its associated frequency to BPM: 60 μ=136.7 σ=28.3 μ=95.6 σ=24.4

and Universität Erlangen-Nürnberg Testset Tempo Distributions nd its signal ences (m , 720] : (1) m, k) were ength ution ctrum l [7], here- oosts (2) 0 10 20 % of tr N = 222 0 10 20 30 % of tracks Ballroom µ = 129.77, = 39.61 N = 698 20 – 30 30 – 40 40 – 50 50 – 60 60 – 70 70 – 80 80 – 90 90 – 100 100 – 110 110 – 120 120 – 130 130 – 140 140 – 150 150 – 160 160 – 170 170 – 180 180 – 190 190 – 200 200 – 210 210 – 220 220 – 230 230 – 240 240 – 250 250 – 260 0 10 20 30 Tempo intervals in BPM % of tracks GiantSteps µ = 136.66, = 28.33 N = 664 Figure 1. Tempo distributions for the test datasets. highest value of BE , divide its frequency by 4 to find the first harmonic, and finally convert its associated frequency to BPM: 60 mic power Yln (m, k) := ln (1000 · Y (m, k) + 1), and its frequency by F(k) given in Hz. We define the onset signal strength OSS(m) as the sum of the bandwise differences between the logarithmic powers Yln (m, k) and Yln (m 1, k) for those k where the frequency F(k) 2 [30, 720] and Y (m, k) is greater than ↵Y (m 1, k) (see [16]): I(m, k) = 8 < : 1 if Y (m, k) > ↵Y (m 1, k) and F(k) 2 [30, 720], 0 otherwise (1) OSS(m) = X k (Yln (m, k) Yln (m 1, k)) · I(m, k) Both the factor ↵ = 1.76 and the frequency range were found experimentally [24]. The OSS(m) is transformed using a DFT with length 8192. At the given sample rate, this ensures a resolution of 0.156 BPM. The peaks of the resulting beat spectrum B represent the strength of BPM values in the signal [7], but do not take harmonics into account [10, 21]. There- fore we derive an enhanced beat spectrum BE that boosts frequencies supported by harmonics: BE (k) = 2 X |B(bk/2i + 0.5c)| (2) 0 10 20 % of tr N = 222 0 10 20 30 % of tracks Ballroom µ = 129.77, = 39.61 N = 698 20 – 30 30 – 40 40 – 50 50 – 60 60 – 70 70 – 80 80 – 90 90 – 100 100 – 110 110 – 120 120 – 130 130 – 140 140 – 150 150 – 160 160 – 170 170 – 180 180 – 190 190 – 200 200 – 210 210 – 220 220 – 230 230 – 240 240 – 250 250 – 260 0 10 20 30 Tempo intervals in BPM % of tracks GiantSteps µ = 136.66, = 28.33 N = 664 Figure 1. Tempo distributions for the test datasets. highest value of BE , divide its frequency by 4 to find the first harmonic, and finally convert its associated frequency to BPM: 60 μ=136.7 σ=28.3 μ=129.8 σ=39.6 Then we compute the power spectrum Y of 93 ms windows with half overlap, by applying a Hamming window and performing an STFT. The power for each bin k 2 [0 : K] := {0, 1, 2, . . . , K} at time m 2 [0 : M] := {0, 1, 2, . . . , M} is given by Y (m, k), its positive logarithmic power Yln (m, k) := ln (1000 · Y (m, k) + 1), and its frequency by F(k) given in Hz. We define the onset signal strength OSS(m) as the sum of the bandwise differences between the logarithmic powers Yln (m, k) and Yln (m 1, k) for those k where the frequency F(k) 2 [30, 720] and Y (m, k) is greater than ↵Y (m 1, k) (see [16]): I(m, k) = 8 < : 1 if Y (m, k) > ↵Y (m 1, k) and F(k) 2 [30, 720], 0 otherwise (1) OSS(m) = X k (Yln (m, k) Yln (m 1, k)) · I(m, k) Both the factor ↵ = 1.76 and the frequency range were found experimentally [24]. The OSS(m) is transformed using a DFT with length 8192. At the given sample rate, this ensures a resolution of 0.156 BPM. The peaks of the resulting beat spectrum B represent the strength of BPM values in the signal [7], but do not take harmonics into account [10, 21]. There- 0 10 20 % of tr N = 1410 0 10 20 30 % of tracks Hainsworth µ = 113.30, = 28.78 N = 222 0 10 20 30 % of tracks Ballroom µ = 129.77, = 39.61 N = 698 20 – 30 30 – 40 40 – 50 50 – 60 60 – 70 70 – 80 80 – 90 90 – 100 100 – 110 110 – 120 120 – 130 130 – 140 140 – 150 150 – 160 160 – 170 170 – 180 180 – 190 190 – 200 200 – 210 210 – 220 220 – 230 230 – 240 240 – 250 250 – 260 0 10 20 30 Tempo intervals in BPM % of tracks GiantSteps µ = 136.66, = 28.33 N = 664 Figure 1. Tempo distributions for the test datasets.

and Universität Erlangen-Nürnberg Limits of Augmentation  Sheep will always stay sheep,  no matter how you scale,  rotate, crop, or shear.

and Universität Erlangen-Nürnberg More (diverse) training data probably wouldn’t hurt…

and Universität Erlangen-Nürnberg Diverse Training ๏ MTG Tempo—tempo annotations for MTG Key [1], N=1,159 ๏ LMD Tempo—derived from Lakh MIDI Dataset [2], N=3,611 ๏ Eball—Extended Ballroom [3] sans Ballroom, N=3,826 [1] Ángel Faraldo, Sergi Jordà, and Perfecto Herrera. A multi-profile method for key estimation in EDM. In Proceedings of the AES International Conference on Semantic Audio, Erlangen, Germany, June 2017.  [2] Colin Raffel. "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching". PhD Thesis, 2016.  [3] Ugo Marchand and Geoffroy Peeters. The extended ballroom dataset. In Late Breaking Demo of the International Conference on Music Information Retrieval (ISMIR), New York, NY, USA, 2016. Combined: 8,596 samples EDM Rock/Pop Ballroom

and Universität Erlangen-Nürnberg Diverse Training ๏ MTG Tempo—tempo annotations for MTG Key [1], N=1,159 ๏ LMD Tempo—derived from Lakh MIDI Dataset [2], N=3,611 ๏ Eball—Extended Ballroom [3] sans Ballroom, N=3,826 [1] Ángel Faraldo, Sergi Jordà, and Perfecto Herrera. A multi-profile method for key estimation in EDM. In Proceedings of the AES International Conference on Semantic Audio, Erlangen, Germany, June 2017.  [2] Colin Raffel. "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching". PhD Thesis, 2016.  [3] Ugo Marchand and Geoffroy Peeters. The extended ballroom dataset. In Late Breaking Demo of the International Conference on Music Information Retrieval (ISMIR), New York, NY, USA, 2016. Combined: 8,596 samples EDM Rock/Pop Ballroom Still MIA:  Jazz, World, Classical, Reggae, …

and Universität Erlangen-Nürnberg Diverse Training ๏ Randomly picked 256 frames segments ๏ Optimizer Adam, lr=0.001 ๏ Batch size 32 ๏ 90/10 train/validation split ๏ Early stopping with patience 150

and Universität Erlangen-Nürnberg GTZAN Benchmarking [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb ﬁlters.  In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [2] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017. Trained on EDM Trained on Diverse SOTA [1][2] Accuracy in % 0% 20% 40% 60% 80% 100% 95.0% 91.0% 86.7% 71.0% 58.2% 50.5% Acc1 Acc2 σ=2.2 σ=0.2 σ=1.4 σ=0.4 Clearly an improvement, but not SOTA +7.7pp

and Universität Erlangen-Nürnberg Tempo Distributions tures, and then compare our results with those from other methods. Finally, in Section 5, we present our conclusions. 2. TEMPO ESTIMATION To lay the groundwork for our error correction method, we first describe a simple tempo estimation algorithm, then introduce several test datasets and discuss common pitfalls. In Section 2.5, we introduce performance metrics and describe observed errors. 2.1 Algorithm To estimate the dominant pulse we follow the approach taken in [24], which is similar to [23, 28]: We first convert the signal to mono and downsample to 11025 Hz. Then we compute the power spectrum Y of 93 ms windows with half overlap, by applying a Hamming window and performing an STFT. The power for each bin k 2 [0 : K] := {0, 1, 2, . . . , K} at time m 2 [0 : M] := {0, 1, 2, . . . , M} is given by Y (m, k), its positive logarithmic power Yln (m, k) := ln (1000 · Y (m, k) + 1), and its frequency by F(k) given in Hz. We define the onset signal strength OSS(m) as the sum of the bandwise differences between the logarithmic powers Yln (m, k) and Yln (m 1, k) for those k where the frequency F(k) 2 [30, 720] 0 10 20 % of tr 0 10 20 30 % of tracks ISMIR2004 Songs µ = 89.80, = 27.83 N = 464 0 10 20 30 % of tracks GTZAN µ = 94.55, = 24.39 N = 999 0 10 20 30 % of tracks ACM MIRUM µ = 102.72, = 32.58 N = 1410 0 10 20 30 % of tracks Hainsworth µ = 113.30, = 28.78 N = 222 20 30 acks Ballroom µ = 129.77, = 39.61 mic power Yln (m, k) := ln (1000 · Y (m, k) + 1), and its frequency by F(k) given in Hz. We define the onset signal strength OSS(m) as the sum of the bandwise differences between the logarithmic powers Yln (m, k) and Yln (m 1, k) for those k where the frequency F(k) 2 [30, 720] and Y (m, k) is greater than ↵Y (m 1, k) (see [16]): I(m, k) = 8 < : 1 if Y (m, k) > ↵Y (m 1, k) and F(k) 2 [30, 720], 0 otherwise (1) OSS(m) = X k (Yln (m, k) Yln (m 1, k)) · I(m, k) Both the factor ↵ = 1.76 and the frequency range were found experimentally [24]. The OSS(m) is transformed using a DFT with length 8192. At the given sample rate, this ensures a resolution of 0.156 BPM. The peaks of the resulting beat spectrum B represent the strength of BPM values in the signal [7], but do not take harmonics into account [10, 21]. There- fore we derive an enhanced beat spectrum BE that boosts frequencies supported by harmonics: BE (k) = 2 X |B(bk/2i + 0.5c)| (2) 0 10 20 % of tr N = 222 0 10 20 30 % of tracks Ballroom µ = 129.77, = 39.61 N = 698 20 – 30 30 – 40 40 – 50 50 – 60 60 – 70 70 – 80 80 – 90 90 – 100 100 – 110 110 – 120 120 – 130 130 – 140 140 – 150 150 – 160 160 – 170 170 – 180 180 – 190 190 – 200 200 – 210 210 – 220 220 – 230 230 – 240 240 – 250 250 – 260 0 10 20 30 Tempo intervals in BPM % of tracks GiantSteps µ = 136.66, = 28.33 N = 664 Figure 1. Tempo distributions for the test datasets. highest value of BE , divide its frequency by 4 to find the first harmonic, and finally convert its associated frequency to BPM: 60 40 – 50 50 – 60 60 – 70 70 – 80 80 – 90 90 – 100 100 – 110 110 – 120 120 – 130 130 – 140 140 – 150 150 – 160 160 – 170 170 – 180 180 – 190 190 – 200 200 – 210 210 – 220 0 10 20 30 Tempo intervals in BPM % of tracks µ = 121.32, = 30.52 N = 8, 596 Figure 1: Tempo distribution for the Train dataset con- sisting of LMD Tempo, MTG Tempo, and EBall. sisting of multiple components (“layers”) that has evolved naturally. But to the best of our knowledge, nobody has replaced the traditional multi-component architecture with a single deep neural network (DNN) yet. In this paper we describe a CNN-based approach that estimates the local end, we estimated the tempo of the matched audio pre- views using the algorithm from [31]. Then the associated MIDI files were parsed for tempo change messages. If the value of more than half the tempo messages for a given preview were within 2% of the estimated tempo, we as- sumed the estimated tempo of the audio excerpts to be correct and added it to LMD Tempo. This resulted in 3,611 audio tracks. We were able to match more than 76% of the tracks to the Million Song Dataset (MSD) genre annotations from [29]. Of the matched tracks 29% were labeled rock, 27% pop, 5% r&b, 5% dance, 5% country, 4% latin, and 3% electronic. Less than 2% of the tracks were labeled jazz, soundtrack, world and others. Thus it is fair to characterize LMD Tempo as a good cross-section of popular music. 2.2 MTG Tempo The MTG Key dataset was created by Faraldo [8] as a Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 99 Diverse μ=121.3 σ=30.5 μ=94.6 σ=24.4

and Universität Erlangen-Nürnberg Ballroom Benchmarking Trained on EDM Trained on Diverse SOTA [1][2] Accuracy in % 0% 20% 40% 60% 80% 100% 98.7% 95.0% 81.2% 92.0% 85.9% 56.5% Acc1 Acc2 σ=1.8 σ=0.5 σ=2.7 σ=0.2 Huge improvement, but not quite SOTA [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb ﬁlters.  In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [2] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.  In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. +29.4pp

and Universität Erlangen-Nürnberg EDM Benchmarking Trained on EDM Trained on Diverse SOTA [1] Accuracy in % 0% 20% 40% 60% 80% 100% 97.6% 96.5% 97.2% 82.5% 87.9% 86.8% Acc1 Acc2 σ=0.8 σ=0.2 σ=0.8 σ=0.5 [1] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.  In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. Results mostly unchanged

and Universität Erlangen-Nürnberg Adding more samples made a diﬀerence. Can we squeeze more out of the data by improving the network architecture?

and Universität Erlangen-Nürnberg Going Deeper Input 4 (1x3) Conv2D Dropout 256 (256) Conv1D Dropout 256 (1) Conv1D AvgPooling2D GlobalAvgPooling Softmax 328.208 parameters

and Universität Erlangen-Nürnberg Going Deeper Input 4 (1x3) Conv2D Dropout 256 (256) Conv1D Dropout 256 (1) Conv1D AvgPooling2D GlobalAvgPooling Softmax 328.208 parameters Input 4 (1x3) Conv2D Dropout 256 (256) Conv1D Dropout 256 (1) Conv1D AvgPooling2D GlobalAvgPooling Softmax 256 (256) Conv1D 17.105.680 parameters

and Universität Erlangen-Nürnberg Going Deeper Input 4 (1x3) Conv2D Dropout 256 (256) Conv1D Dropout 256 (1) Conv1D AvgPooling2D GlobalAvgPooling Softmax 328.208 parameters Input 4 (1x3) Conv2D Dropout 256 (256) Conv1D Dropout 256 (1) Conv1D AvgPooling2D GlobalAvgPooling Softmax 256 (256) Conv1D 17.105.680 parameters Input 4 (1x3) Conv2D Dropout 256 (256) Conv1D Dropout 256 (1) Conv1D AvgPooling2D GlobalAvgPooling Softmax 256 (256) Conv1D 256 (256) Conv1D 33.883.152 parameters Parameter  Overkill!

and Universität Erlangen-Nürnberg Going Deeper What if we combined ideas from [1] and [2]: ๏ Bottleneck layers (1x1 conv) 㱺 dimensionality reduction ๏ Filter bank (not every filter needs to be 256 frames long) Stepwise pooling along frequency axis possible BatchNormalization to avoid covariate shift Add layers with short filters [1] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. [2] Pons, Jordi, and Xavier Serra. "Designing efficient architectures for modeling temporal features with convolutional neural networks.” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, March 2017.

and Universität Erlangen-Nürnberg Going Deeper (Kx1) AvgPooling2D 24 (1x32) C2D 24 (1x64) C2D 24 (1x96) C2D 24 (1x128) C2D 24 (1x192) C2D 24 (1x256) C2D Concatenation 34 (1x1) Conv2D mf_mod[1] BatchNormalization stepwise pooling along frequency axis ﬁlterbank bottleneck Input BN + 16 (1x5) Conv2D mf_mod K=5 BN + Dropout 256 (1x1) Conv2D GlobalAvgPooling2D Softmax 2.319.956 parameters BN + 16 (1x5) Conv2D BN + 16 (1x5) Conv2D mf_mod K=2 mf_mod K=2 mf_mod K=2 [1] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.  In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018.

and Universität Erlangen-Nürnberg GTZAN Benchmarking [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb ﬁlters.  In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [2] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017. Shallow Net Mf_mod Net SOTA [1][2] Accuracy in % 0% 20% 40% 60% 80% 100% 95.0% 92.7% 91.0% 71.0% 63.9% 58.2% Acc1 Acc2 σ=1.4 σ=0.4 σ=1.9 σ=0.7 Again an improvement, but not SOTA +5.7pp

and Universität Erlangen-Nürnberg Ballroom Benchmarking Shallow Net Mf_mod Net SOTA [1][2] Accuracy in % 0% 20% 40% 60% 80% 100% 98.7% 94.7% 95.0% 92.0% 88.0% 85.9% Acc1 Acc2 σ=2.7 σ=0.2 σ=1.7 σ=0.4 Slight Acc1 improvement, but not quite SOTA [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb ﬁlters.  In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [2] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.  In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. +2.1pp

and Universität Erlangen-Nürnberg EDM Benchmarking Shallow Net Mf_mod Net SOTA [1] Accuracy in % 0% 20% 40% 60% 80% 100% 97.6% 97.4% 96.5% 82.5% 89.2% 87.9% Acc1 Acc2 σ=0.8 σ=0.5 σ=0.3 σ=0.3 [1] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.  In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. Results slightly improved +1.3pp +0.9pp

and Universität Erlangen-Nürnberg All results got a little better:  There seems to be room for improvement.

and Universität Erlangen-Nürnberg All results got a little better:  There seems to be room for improvement. But how?    Let’s try something old!

and Universität Erlangen-Nürnberg VGG-Style Net[1] 1.198.704 parameters 16*2K (5x5) Conv2D BatchNormalization 16*2K (3x3) Conv2D BatchNormalization (2x2) MaxPooling2D* vgg_mod Dropout 0.3 Input vgg_mod K=0 256 (1x1) Conv2D GlobalAvgPooling2D Softmax vgg_mod K=1 vgg_mod K=2 vgg_mod K=2 vgg_mod K=3 vgg_mod K=3 * pooling along frequency axis with size 2 only as long as there is something left to pool [1] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).

and Universität Erlangen-Nürnberg GTZAN Benchmarking [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb ﬁlters.  In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [2] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017. Mf_mod Net VGG Net SOTA [1][2] Accuracy in % 0% 20% 40% 60% 80% 100% 95.0% 91.7% 92.7% 71.0% 63.5% 63.9% Acc1 Acc2 σ=1.9 σ=0.7 σ=3.2 σ=0.4 No improvement, but as good as specialized network

and Universität Erlangen-Nürnberg Ballroom Benchmarking Mf_mod Net VGG Net SOTA [1][2] Accuracy in % 0% 20% 40% 60% 80% 100% 98.7% 95.1% 94.7% 92.0% 91.6% 88.0% Acc1 Acc2 σ=1.7 σ=0.4 σ=1.7 σ=0.1 Acc1 now SOTA! Why? ? [3] [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb ﬁlters.  In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [2] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.  In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. [3] Sturm, Bob L. "A simple method to determine if a music information retrieval system is a “horse”." IEEE Transactions on Multimedia 16.6 (2014): 1636-1644. +3.6pp

and Universität Erlangen-Nürnberg EDM Benchmarking Mf_mod Net VGG Net SOTA [1] Accuracy in % 0% 20% 40% 60% 80% 100% 97.6% 97.1% 97.4% 82.5% 88.8% 89.2% Acc1 Acc2 σ=0.3 σ=0.3 σ=1.2 σ=0.1 [1] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.  In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. Similar results as specialized approaches

and Universität Erlangen-Nürnberg What if we use a VGG style network,  but only rectangular ﬁlters?

and Universität Erlangen-Nürnberg Rect VGG-Style Net 320.304 parameters 16*2K (1x5) Conv2D BatchNormalization 16*2K (1x3) Conv2D BatchNormalization (2x2) MaxPooling2D rect_vgg_mod Dropout 0.3 Input rect_vgg_mod K=0 256 (1x1) Conv2D GlobalAvgPooling2D Softmax rect_vgg_mod K=1 rect_vgg_mod K=2 rect_vgg_mod K=2 rect_vgg_mod K=3 rect_vgg_mod K=3 same as before, but with rectangular (1x5) and (1x3) ﬁlters

and Universität Erlangen-Nürnberg GTZAN Benchmarking [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb ﬁlters.  In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [2] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017. VGG Net Rect VGG Net SOTA [1][2] Accuracy in % 0% 20% 40% 60% 80% 100% 95.0% 91.6% 91.7% 71.0% 65.7% 63.5% Acc1 Acc2 σ=3.2 σ=0.4 σ=0.9 σ=0.1 Slight Acc1 improvement +2.2pp

and Universität Erlangen-Nürnberg Ballroom Benchmarking VGG Net Rect VGG Net SOTA [1][2] Accuracy in % 0% 20% 40% 60% 80% 100% 98.7% 94.6% 95.1% 92.0% 86.2% 91.6% Acc1 Acc2 σ=1.7 σ=0.4 σ=1.6 σ=0.1 Below SOTA again.  Timbral information must add clues about genre and therefore tempo! [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb ﬁlters.  In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [2] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.  In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. -5.4pp

and Universität Erlangen-Nürnberg EDM Benchmarking VGG Net Rect VGG Net SOTA [1] Accuracy in % 0% 20% 40% 60% 80% 100% 97.6% 96.9% 97.1% 82.5% 87.6% 88.8% Acc1 Acc2 σ=1.2 σ=0.1 σ=0.8 σ=0.2 [1] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.  In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. Slightly lower than square VGG -1.2pp

and Universität Erlangen-Nürnberg That was a lot of graphs and architectures. Let’s summarize.

and Universität Erlangen-Nürnberg Multi-Step Dataset Averages: Acc1 [1] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017. [2] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb ﬁlters.  In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [3] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.  In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. DSP + RandomForest CNN - similar to Mf_mod BLSTM + Comb Filters Shallow Net MF_mod Net VGG Net Rect VGG Net Schreiber17 [1] Böck/Madmom [2] Schreiber18 [3] Accuracy 1 in % 0% 20% 40% 60% 80% 100% 81.3% 72.8% 68.6% 79.8% 81.3% 80.4% 77.3% 81.3% Max

and Universität Erlangen-Nürnberg Dataset Averages: Acc2 Shallow Net MF_mod Net VGG Net Rect VGG Net Schreiber17 [1] Böck/Madmom [2] Schreiber18 [3] Accuracy 2 in % 0% 20% 40% 60% 80% 100% 95.9% 96.2% 95.2% 94.4% 94.6% 94.9% 94.2% 96.2% Max [1] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017. [2] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb ﬁlters.  In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [3] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.  In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. DSP + RandomForest CNN - similar to Mf_mod BLSTM + Comb Filters

and Universität Erlangen-Nürnberg Apparently we have a couple of systems that work pretty well. What can we do with them?

and Universität Erlangen-Nürnberg Applications ๏ DJ apps ๏ Annotation supported browsing ๏ Content-based recommender systems ๏ … ๏ Anything else that’s remotely cool?

and Universität Erlangen-Nürnberg Local Tempo Estimation “Honky Tonk Women” by The Rolling Stones

and Universität Erlangen-Nürnberg Local Tempo Estimation “Typhoon” by Foreign Beggars/Chasing Shadows

and Universität Erlangen-Nürnberg Conclusions I ๏ Consolidates multi-component approach

and Universität Erlangen-Nürnberg Conclusions I ๏ Consolidates multi-component approach ๏ Completely data-driven, no heuristics

and Universität Erlangen-Nürnberg Conclusions I ๏ Consolidates multi-component approach ๏ Completely data-driven, no heuristics ๏ Fewer octave-errors

and Universität Erlangen-Nürnberg Conclusions I ๏ Consolidates multi-component approach ๏ Completely data-driven, no heuristics ๏ Fewer octave-errors ๏ Suitable for global and local tempo estimation

and Universität Erlangen-Nürnberg Conclusions II ๏ Great tempo estimation results are possible with simple, shallow networks

and Universität Erlangen-Nürnberg Conclusions II ๏ Great tempo estimation results are possible with simple, shallow networks ๏ Training data and data augmentation are key

and Universität Erlangen-Nürnberg Conclusions II ๏ Great tempo estimation results are possible with simple, shallow networks ๏ Training data and data augmentation are key ๏ Rectangular ﬁlters are sufﬁcient for tempo estimation

and Universität Erlangen-Nürnberg Conclusions II ๏ Great tempo estimation results are possible with simple, shallow networks ๏ Training data and data augmentation are key ๏ Rectangular filters are sufficient for tempo estimation ๏ Square filters can improve results further, but beware of horses

and Universität Erlangen-Nürnberg Conclusions II ๏ Great tempo estimation results are possible with simple, shallow networks ๏ Training data and data augmentation are key ๏ Rectangular filters are sufficient for tempo estimation ๏ Square filters can improve results further, but beware of horses ๏ Yes, spectrograms differ from images, but don’t re-invent the wheel

and Universität Erlangen-Nürnberg Questions ๏ If the DSP-ignorant VGG approach works so well, why bother with trying to design a DSP-informed network architecture? ๏ Should tempo estimators use timbral or genre information? ๏ What’s with this 4% tolerance? Doesn’t it make results useless for DJs?

and Universität Erlangen-Nürnberg Thank you. Hendrik Schreiber  tagtraum industries incorporated [email protected] @h_schreiber

Musical Tempo Estimation with Convolutional N...

Musical Tempo Estimation with Convolutional Neural Networks

More Decks by Hendrik Schreiber

Other Decks in Science

Featured

Transcript

Musical Tempo Estimation  with  Convolutional N...

Musical Tempo Estimation  with  Convolutional Neural Networks