Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Musical Tempo Estimation
 with 
Convolutional N...

Musical Tempo Estimation
 with 
Convolutional Neural Networks

Hendrik Schreiber

November 19, 2018
Tweet

More Decks by Hendrik Schreiber

Other Decks in Science

Transcript

  1. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Musical Tempo Estimation
 with
 Convolutional Neural Networks Hendrik Schreiber
 tagtraum industries incorporated [email protected] @h_schreiber
  2. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg About Me ๏ Doing business as tagtraum industries since 2004
  3. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg About Me ๏ Doing business as tagtraum industries since 2004 ๏ Author of beaTunes
  4. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg About Me ๏ Doing business as tagtraum industries since 2004 ๏ Author of beaTunes ๏ Ph.D. candidate at Meinard Müller’s lab
  5. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg www. beaTunes.com Music Analysis Wikipedia Integration Segmentation Key, Tempo,
 Mood, Color Metadata Correction Matching Windows & Mac shameless product placement
  6. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg www. beaTunes.com Music Analysis Wikipedia Integration Segmentation Key, Tempo,
 Mood, Color Metadata Correction Matching Windows & Mac shameless product placement
  7. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg This Talk ๏ About: Tempo Estimation with CNNs ๏ Yes, we are going to get close to or exceed the state of the art (SOTA) ๏ But: Just showing how that’s done won’t take long and we won’t learn much
  8. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Let’s take the scenic route!
  9. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg This Talk ๏ What is tempo and tempo estimation? ๏ Simple digital signal processing baseline ๏ Translate classic approach to the CNN world ๏ Note issues, attempt to solve them ๏ Question why and how we did all this
  10. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Tempo Estimation ๏ Usually global tempo estimation ๏ Works pretty well (MIREX) ๏ Remaining challenge: Octave errors
  11. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Tempo Estimation System Spectrogram Onset/Beat
 Detection Tempo Estimation Traditional System Often peak picking or some other heuristic
  12. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Tempo Estimation System Mel-Spectrogram CNN-based Tempo Classification Proposed System “eliminating the middle-man”
  13. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Case Study A Traditional
 Tempo Estimation System
  14. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Case Study A Traditional
 Tempo Estimation System
  15. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg What is Tempo? Beats per Minute
  16. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg What is Tempo? Periodic Increase in Volume, in Certain Frequency Bands
  17. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg What is Tempo? Periodic Increase in Volume, in Certain Frequency Bands 2 1 3
  18. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Periodic Increase in Volume, in Certain Frequency Bands ๏ Framewise signal energy (sliding window) ๏ Short-term pattern: Compute difference between adjacent frames ๏ Increase: Keep only positive values (half wave rectification)
  19. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Periodic Increase in Volume, in Certain Frequency Bands ๏ Framewise signal energy (sliding window) ๏ Short-term pattern: Compute difference between adjacent frames ๏ Increase: Keep only positive values (half wave rectification) 1
  20. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Increase in Volume
  21. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Increase in Volume
  22. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Increase in Volume
  23. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Increase in Volume First 2 seconds: How many beats do you hear?
  24. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Increase in Volume First 2 seconds: How many beats do you hear?
  25. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Increase in Volume First 2 seconds: 4 Beats?
  26. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Periodic Increase in Volume, in Certain Frequency Bands ๏ Log-compressed power spectrum (de-emphasize low freqs) ๏ Bandwise derivation (compare apples with apples)
  27. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Periodic Increase in Volume, in Certain Frequency Bands ๏ Log-compressed power spectrum (de-emphasize low freqs) ๏ Bandwise derivation (compare apples with apples) 2
  28. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg In Certain Frequency Bands
  29. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg In Certain Frequency Bands
  30. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg In Certain Frequency Bands
  31. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg In Certain Frequency Bands
  32. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg In Certain Frequency Bands
  33. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg In Certain Frequency Bands First 2 seconds: 4 Beats!
  34. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Periodic Increase in Volume, in Certain Frequency Bands ๏ Long-term pattern:
 Fourier transform the Onset Strength Signal (OSS)
 from time domain to frequency domain ๏ Peak picking
  35. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Periodic Increase in Volume, in Certain Frequency Bands ๏ Long-term pattern:
 Fourier transform the Onset Strength Signal (OSS)
 from time domain to frequency domain ๏ Peak picking 3
  36. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Periodic Peak at 153.3 BPM
  37. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Thanks.
 Fun signal processing intro.
  38. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg But wasn’t this talk supposed to be about Deep Learning? Thanks.
 Fun signal processing intro.
  39. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Reminder: Convolutional Neural Network (CNN) ๏ Usually used for image recognition ๏ Each neuron has a limited receptive field ๏ Parameters are shared between neurons Image source: aphex34, https://en.wikipedia.org/wiki/Convolutional_neural_network
  40. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Reminder: Mel-Spectrogram ๏ Mel scale is a perceptual scale of pitches judged by listeners to be equal in distance to each other. ๏ A Mel-spectrogram is a regular spectrogram rescaled along the Y axis using the Mel scale. ๏ Compared to regular linear scale, reduction of number of dimensions (i.e. pooling). ๏ Contrary to regular images, the two axes have completely different meaning. Frequency in Mel Time in frames
  41. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg General Approach ๏ Input: 11.9s (256 frames), 40 band Mel-spectrogram (20—5,000Hz)
 ➔ Possible range: 0-645 BPM
 [ Nyquist frequency = 60*Fs/2 = 60*(256/11.9s)/2 = 645 BPM ] ๏ Output: Treat as classification problem (256 classes ↦ 30-286 BPM)
  42. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Translation to CNN-Speak 1. Increase in Volume: Convolutional layer with narrow receptive field
  43. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Translation to CNN-Speak 1. Increase in Volume: Convolutional layer with narrow receptive field 2. In Certain Freq. Bands: Use rectangular filters (along time axis)
  44. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Translation to CNN-Speak 1. Increase in Volume: Convolutional layer with narrow receptive field 2. In Certain Freq. Bands: Use rectangular filters (along time axis) 3. Periodic: Convolutional layer with wide receptive field
  45. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Translation to CNN-Speak 1. Increase in Volume: Convolutional layer with narrow receptive field 2. In Certain Freq. Bands: Use rectangular filters (along time axis) 3. Periodic: Convolutional layer with wide receptive field 4. Classification based on extracted features
  46. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Increase in Volume,
 in Certain Frequency Bands Short, rectangular kernels Bands: 1 Frames: 3 Frequency/Band Time/Frame 256 40 0
  47. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Increase in Volume,
 in Certain Frequency Bands Short, rectangular kernels Bands: 1 Frames: 3 Frequency/Band Time/Frame 256 40 0 Increase in Volume,
 in Certain Frequency Bands
  48. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Summarize Frames Frequency/Band Time/Frame Rectangular average pooling Bands: 40 Frames: 1 256 40 0
  49. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Periodic Frequency/Band Time/Frame Long, rectangular kernels Bands: 1 Frames: 256 256 40 0
  50. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Tempo CNN 4 (1x3) Conv2D, ReLU (40x1)
 AvgPooling2D 256 (256)
 Conv1D, ReLU (40x256)
 Mel-Spectrogram 256 (1) Conv1D, ReLU GlobalAvg
 Pooling1D (256) softmax argmax
 mapped to interval
 [30, 286] Feature Extraction Classification
  51. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Nice idea. Does it work?
  52. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Nice idea. Does it work?
  53. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg EDM Training ๏ EDM Training dataset - MTG Tempo [1], based on MTG Key [2] - 1,156 tracks, 2min duration each - 80/20 train/validation split ๏ Randomly picked 256 frames segments ๏ Optimizer Adam, lr=0.001 ๏ Batch size 32 ๏ Early stopping with patience 100 [1] Hendrik Schreiber, Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.
 In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.
 [2] Ángel Faraldo, Sergi Jordà, and Perfecto Herrera. A multi-profile method for key estimation in EDM.
 In Proceedings of the AES International Conference on Semantic Audio, Erlangen, Germany, June 2017.
  54. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg EDM Evaluation EDM Test Dataset ๏ GiantSteps Tempo ๏ 661 tracks ๏ New annotations from [1] 
 Metrics ๏ Acc1: Correct, allowing a 4% tolerance ๏ Acc2: Correct, allowing a 4% tolerance
 and factors 2, 3, ½, ⅓ [1] Hendrik Schreiber, Meinard Müller. A Crowdsourced Experiment for Tempo Estimation of Electronic Dance Music.
 In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.
  55. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg DSP Baseline Energy (50-210, 46ms) Energy (90-180, 46ms) Energy (90-180, 23ms) Bandwise (50-210, 46ms) Bandwise (90-180, 46ms) Bandwise (90-180, 23ms) Accuracy in % 0% 20% 40% 60% 80% 100% 65.8% 58.4% 52.7% 62.6% 62.5% 64.8% 59.0% 51.3% 39.8% 57.9% 57.8% 54.8% Acc1 Acc2
  56. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Evaluation DSP Baseline Simple FCN Accuracy in % 0% 20% 40% 60% 80% 100% 80.9% 65.8% 71.1% 59.0% Acc1 Acc2 +12.1pp +15.1pp
  57. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Evaluation DSP Baseline Simple FCN Accuracy in % 0% 20% 40% 60% 80% 100% 80.9% 65.8% 71.1% 59.0% Acc1 Acc2 σ=6.4 σ=5.7 Average of 3 runs Dirty little secret:
 Substantial standard deviations
  58. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Evaluation DSP Baseline Simple FCN Accuracy in % 0% 20% 40% 60% 80% 100% 80.9% 65.8% 71.1% 59.0% Acc1 Acc2 σ=6.4 σ=5.7 Average of 3 runs Dirty little secret:
 Substantial standard deviations Overfitting?!
  59. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Let’s Drop Some! 4 (1x3) Conv2D, ReLU (40x1)
 AvgPooling2D 256 (256)
 Conv1D, ReLU (40x256)
 Mel-Spectrogram 256 (1) Conv1D, ReLU GlobalAvg
 Pooling1D (256) softmax argmax
 mapped to interval
 [30, 286] Feature Extraction Classification Dropout
 p=0.5 Dropout
 p=0.5
  60. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Scale & Crop Data Augmentation Requires label adjustment! Quantized scaling to 80%, 84%, 88%, …, 116%, 120%
  61. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Evaluation DSP Baseline Simple FCN +Dropout 0.5 +Dropout 0.5 & Augmentation Accuracy in % 0% 20% 40% 60% 80% 100% 97.2% 92.9% 80.9% 65.8% 86.8% 81.0% 71.1% 59.0% Acc1 Acc2 σ=6.4 σ=5.7 σ=2.1 σ=0.3 σ=0.8 σ=0.2 +15.7pp +9.9pp Acc1
  62. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Evaluation DSP Baseline Simple FCN +Dropout 0.5 +Dropout 0.5 & Augmentation Accuracy in % 0% 20% 40% 60% 80% 100% 97.2% 92.9% 80.9% 65.8% 86.8% 81.0% 71.1% 59.0% Acc1 Acc2 σ=6.4 σ=5.7 σ=2.1 σ=0.3 σ=0.8 σ=0.2 +16.3pp +12.0pp Acc2
  63. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg EDM Benchmarking [1] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017. [2] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.
 In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. DSP Baseline Schreiber17 [1] Best FCN Accuracy in % 0% 20% 40% 60% 80% 100% 97.2% 95.2% 65.8% 86.8% 63.1% 59.0% Acc1 Acc2 σ=0.8 σ=0.2 Multi-Step Single-Step (CNN) Fewer octave-errors!
  64. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg EDM Benchmarking [1] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017. [2] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.
 In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. DSP Baseline Schreiber17 [1] Best FCN Schreiber18 [2] Accuracy in % 0% 20% 40% 60% 80% 100% 97.6% 97.2% 95.2% 65.8% 82.5% 86.8% 63.1% 59.0% Acc1 Acc2 σ=0.8 σ=0.2 Multi-Step Single-Step (CNN) +4.3pp Beats SOTA Acc1!
  65. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg , but… Sure,
  66. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Does this generalize?
  67. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Does this generalize? To other genres?
  68. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg GTZAN[1] Benchmarking [1] George Tzanetakis and Perry Cook. Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5):293–302, 2002. [2] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.
 In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [3] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017. DSP Baseline SOTA [2][3] Best FCN Accuracy in % 0% 20% 40% 60% 80% 100% 86.7% 95.0% 68.8% 50.5% 71.0% 53.0% Acc1 Acc2 σ=2.2 σ=0.2 Lots of octave-errors! 1,000 tracks from 10 genres, balanced 36.2pp
  69. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Ballroom[1] Benchmarking DSP Baseline SOTA [2][3] Best FCN Accuracy in % 0% 20% 40% 60% 80% 100% 81.2% 98.7% 78.7% 56.5% 92.0% 46.1% Acc1 Acc2 [1] Fabien Gouyon, Anssi P. Klapuri, Simon Dixon, Miguel Alonso, George Tzanetakis, Christian Uhle, and Pedro Cano. An experimental comparison of audio tempo induction algorithms. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1832–1844, 2006. [2] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.
 In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [3] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.
 In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. σ=1.8 σ=0.5 698 tracks from ballroom dance genres 24.7pp Lots of octave-errors!
  70. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Testset Tempo Distributions nd its signal ences (m , 720] : (1) m, k) were ength ution ctrum l [7], here- oosts (2) 0 10 20 % of tr N = 222 0 10 20 30 % of tracks Ballroom µ = 129.77, = 39.61 N = 698 20 – 30 30 – 40 40 – 50 50 – 60 60 – 70 70 – 80 80 – 90 90 – 100 100 – 110 110 – 120 120 – 130 130 – 140 140 – 150 150 – 160 160 – 170 170 – 180 180 – 190 190 – 200 200 – 210 210 – 220 220 – 230 230 – 240 240 – 250 250 – 260 0 10 20 30 Tempo intervals in BPM % of tracks GiantSteps µ = 136.66, = 28.33 N = 664 Figure 1. Tempo distributions for the test datasets. highest value of BE , divide its frequency by 4 to find the first harmonic, and finally convert its associated frequency to BPM: 60 tures, and then compare our results with those from other methods. Finally, in Section 5, we present our conclusions. 2. TEMPO ESTIMATION To lay the groundwork for our error correction method, we first describe a simple tempo estimation algorithm, then in- troduce several test datasets and discuss common pitfalls. In Section 2.5, we introduce performance metrics and de- scribe observed errors. 2.1 Algorithm To estimate the dominant pulse we follow the approach taken in [24], which is similar to [23, 28]: We first con- vert the signal to mono and downsample to 11025 Hz. Then we compute the power spectrum Y of 93 ms win- dows with half overlap, by applying a Hamming win- dow and performing an STFT. The power for each bin k 2 [0 : K] := {0, 1, 2, . . . , K} at time m 2 [0 : M] := {0, 1, 2, . . . , M} is given by Y (m, k), its positive logarith- mic power Yln (m, k) := ln (1000 · Y (m, k) + 1), and its frequency by F(k) given in Hz. We define the onset signal strength OSS(m) as the sum of the bandwise differences between the logarithmic powers Yln (m, k) and Yln (m 1, k) for those k where the frequency F(k) 2 [30, 720] 0 10 20 % of tr 0 10 20 30 % of tracks ISMIR2004 Songs µ = 89.80, = 27.83 N = 464 0 10 20 30 % of tracks GTZAN µ = 94.55, = 24.39 N = 999 0 10 20 30 % of tracks ACM MIRUM µ = 102.72, = 32.58 N = 1410 0 10 20 30 % of tracks Hainsworth µ = 113.30, = 28.78 N = 222 20 30 acks Ballroom µ = 129.77, = 39.61 mic power Yln (m, k) := ln (1000 · Y (m, k) + 1), and its frequency by F(k) given in Hz. We define the onset signal strength OSS(m) as the sum of the bandwise differences between the logarithmic powers Yln (m, k) and Yln (m 1, k) for those k where the frequency F(k) 2 [30, 720] and Y (m, k) is greater than ↵Y (m 1, k) (see [16]): I(m, k) = 8 < : 1 if Y (m, k) > ↵Y (m 1, k) and F(k) 2 [30, 720], 0 otherwise (1) OSS(m) = X k (Yln (m, k) Yln (m 1, k)) · I(m, k) Both the factor ↵ = 1.76 and the frequency range were found experimentally [24]. The OSS(m) is transformed using a DFT with length 8192. At the given sample rate, this ensures a resolution of 0.156 BPM. The peaks of the resulting beat spectrum B represent the strength of BPM values in the signal [7], but do not take harmonics into account [10, 21]. There- fore we derive an enhanced beat spectrum BE that boosts frequencies supported by harmonics: BE (k) = 2 X |B(bk/2i + 0.5c)| (2) 0 10 20 % of tr N = 222 0 10 20 30 % of tracks Ballroom µ = 129.77, = 39.61 N = 698 20 – 30 30 – 40 40 – 50 50 – 60 60 – 70 70 – 80 80 – 90 90 – 100 100 – 110 110 – 120 120 – 130 130 – 140 140 – 150 150 – 160 160 – 170 170 – 180 180 – 190 190 – 200 200 – 210 210 – 220 220 – 230 230 – 240 240 – 250 250 – 260 0 10 20 30 Tempo intervals in BPM % of tracks GiantSteps µ = 136.66, = 28.33 N = 664 Figure 1. Tempo distributions for the test datasets. highest value of BE , divide its frequency by 4 to find the first harmonic, and finally convert its associated frequency to BPM: 60 μ=136.7 σ=28.3 μ=95.6 σ=24.4
  71. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Testset Tempo Distributions nd its signal ences (m , 720] : (1) m, k) were ength ution ctrum l [7], here- oosts (2) 0 10 20 % of tr N = 222 0 10 20 30 % of tracks Ballroom µ = 129.77, = 39.61 N = 698 20 – 30 30 – 40 40 – 50 50 – 60 60 – 70 70 – 80 80 – 90 90 – 100 100 – 110 110 – 120 120 – 130 130 – 140 140 – 150 150 – 160 160 – 170 170 – 180 180 – 190 190 – 200 200 – 210 210 – 220 220 – 230 230 – 240 240 – 250 250 – 260 0 10 20 30 Tempo intervals in BPM % of tracks GiantSteps µ = 136.66, = 28.33 N = 664 Figure 1. Tempo distributions for the test datasets. highest value of BE , divide its frequency by 4 to find the first harmonic, and finally convert its associated frequency to BPM: 60 mic power Yln (m, k) := ln (1000 · Y (m, k) + 1), and its frequency by F(k) given in Hz. We define the onset signal strength OSS(m) as the sum of the bandwise differences between the logarithmic powers Yln (m, k) and Yln (m 1, k) for those k where the frequency F(k) 2 [30, 720] and Y (m, k) is greater than ↵Y (m 1, k) (see [16]): I(m, k) = 8 < : 1 if Y (m, k) > ↵Y (m 1, k) and F(k) 2 [30, 720], 0 otherwise (1) OSS(m) = X k (Yln (m, k) Yln (m 1, k)) · I(m, k) Both the factor ↵ = 1.76 and the frequency range were found experimentally [24]. The OSS(m) is transformed using a DFT with length 8192. At the given sample rate, this ensures a resolution of 0.156 BPM. The peaks of the resulting beat spectrum B represent the strength of BPM values in the signal [7], but do not take harmonics into account [10, 21]. There- fore we derive an enhanced beat spectrum BE that boosts frequencies supported by harmonics: BE (k) = 2 X |B(bk/2i + 0.5c)| (2) 0 10 20 % of tr N = 222 0 10 20 30 % of tracks Ballroom µ = 129.77, = 39.61 N = 698 20 – 30 30 – 40 40 – 50 50 – 60 60 – 70 70 – 80 80 – 90 90 – 100 100 – 110 110 – 120 120 – 130 130 – 140 140 – 150 150 – 160 160 – 170 170 – 180 180 – 190 190 – 200 200 – 210 210 – 220 220 – 230 230 – 240 240 – 250 250 – 260 0 10 20 30 Tempo intervals in BPM % of tracks GiantSteps µ = 136.66, = 28.33 N = 664 Figure 1. Tempo distributions for the test datasets. highest value of BE , divide its frequency by 4 to find the first harmonic, and finally convert its associated frequency to BPM: 60 μ=136.7 σ=28.3 μ=129.8 σ=39.6 Then we compute the power spectrum Y of 93 ms win- dows with half overlap, by applying a Hamming win- dow and performing an STFT. The power for each bin k 2 [0 : K] := {0, 1, 2, . . . , K} at time m 2 [0 : M] := {0, 1, 2, . . . , M} is given by Y (m, k), its positive logarith- mic power Yln (m, k) := ln (1000 · Y (m, k) + 1), and its frequency by F(k) given in Hz. We define the onset signal strength OSS(m) as the sum of the bandwise differences between the logarithmic powers Yln (m, k) and Yln (m 1, k) for those k where the frequency F(k) 2 [30, 720] and Y (m, k) is greater than ↵Y (m 1, k) (see [16]): I(m, k) = 8 < : 1 if Y (m, k) > ↵Y (m 1, k) and F(k) 2 [30, 720], 0 otherwise (1) OSS(m) = X k (Yln (m, k) Yln (m 1, k)) · I(m, k) Both the factor ↵ = 1.76 and the frequency range were found experimentally [24]. The OSS(m) is transformed using a DFT with length 8192. At the given sample rate, this ensures a resolution of 0.156 BPM. The peaks of the resulting beat spectrum B represent the strength of BPM values in the signal [7], but do not take harmonics into account [10, 21]. There- 0 10 20 % of tr N = 1410 0 10 20 30 % of tracks Hainsworth µ = 113.30, = 28.78 N = 222 0 10 20 30 % of tracks Ballroom µ = 129.77, = 39.61 N = 698 20 – 30 30 – 40 40 – 50 50 – 60 60 – 70 70 – 80 80 – 90 90 – 100 100 – 110 110 – 120 120 – 130 130 – 140 140 – 150 150 – 160 160 – 170 170 – 180 180 – 190 190 – 200 200 – 210 210 – 220 220 – 230 230 – 240 240 – 250 250 – 260 0 10 20 30 Tempo intervals in BPM % of tracks GiantSteps µ = 136.66, = 28.33 N = 664 Figure 1. Tempo distributions for the test datasets.
  72. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Limits of Augmentation
 Sheep will always stay sheep,
 no matter how you scale,
 rotate, crop, or shear.
  73. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg More (diverse) training data probably wouldn’t hurt…
  74. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Diverse Training ๏ MTG Tempo—tempo annotations for MTG Key [1], N=1,159 ๏ LMD Tempo—derived from Lakh MIDI Dataset [2], N=3,611 ๏ Eball—Extended Ballroom [3] sans Ballroom, N=3,826 [1] Ángel Faraldo, Sergi Jordà, and Perfecto Herrera. A multi-profile method for key estimation in EDM. In Proceedings of the AES International Conference on Semantic Audio, Erlangen, Germany, June 2017.
 [2] Colin Raffel. "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching". PhD Thesis, 2016.
 [3] Ugo Marchand and Geoffroy Peeters. The extended ballroom dataset. In Late Breaking Demo of the International Conference on Music Information Retrieval (ISMIR), New York, NY, USA, 2016. Combined: 8,596 samples EDM Rock/Pop Ballroom
  75. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Diverse Training ๏ MTG Tempo—tempo annotations for MTG Key [1], N=1,159 ๏ LMD Tempo—derived from Lakh MIDI Dataset [2], N=3,611 ๏ Eball—Extended Ballroom [3] sans Ballroom, N=3,826 [1] Ángel Faraldo, Sergi Jordà, and Perfecto Herrera. A multi-profile method for key estimation in EDM. In Proceedings of the AES International Conference on Semantic Audio, Erlangen, Germany, June 2017.
 [2] Colin Raffel. "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching". PhD Thesis, 2016.
 [3] Ugo Marchand and Geoffroy Peeters. The extended ballroom dataset. In Late Breaking Demo of the International Conference on Music Information Retrieval (ISMIR), New York, NY, USA, 2016. Combined: 8,596 samples EDM Rock/Pop Ballroom Still MIA:
 Jazz, World, Classical, Reggae, …
  76. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Diverse Training ๏ Randomly picked 256 frames segments ๏ Optimizer Adam, lr=0.001 ๏ Batch size 32 ๏ 90/10 train/validation split ๏ Early stopping with patience 150
  77. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg GTZAN Benchmarking [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.
 In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [2] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017. Trained on EDM Trained on Diverse SOTA [1][2] Accuracy in % 0% 20% 40% 60% 80% 100% 95.0% 91.0% 86.7% 71.0% 58.2% 50.5% Acc1 Acc2 σ=2.2 σ=0.2 σ=1.4 σ=0.4 Clearly an improvement, but not SOTA +7.7pp
  78. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Tempo Distributions tures, and then compare our results with those from other methods. Finally, in Section 5, we present our conclusions. 2. TEMPO ESTIMATION To lay the groundwork for our error correction method, we first describe a simple tempo estimation algorithm, then in- troduce several test datasets and discuss common pitfalls. In Section 2.5, we introduce performance metrics and de- scribe observed errors. 2.1 Algorithm To estimate the dominant pulse we follow the approach taken in [24], which is similar to [23, 28]: We first con- vert the signal to mono and downsample to 11025 Hz. Then we compute the power spectrum Y of 93 ms win- dows with half overlap, by applying a Hamming win- dow and performing an STFT. The power for each bin k 2 [0 : K] := {0, 1, 2, . . . , K} at time m 2 [0 : M] := {0, 1, 2, . . . , M} is given by Y (m, k), its positive logarith- mic power Yln (m, k) := ln (1000 · Y (m, k) + 1), and its frequency by F(k) given in Hz. We define the onset signal strength OSS(m) as the sum of the bandwise differences between the logarithmic powers Yln (m, k) and Yln (m 1, k) for those k where the frequency F(k) 2 [30, 720] 0 10 20 % of tr 0 10 20 30 % of tracks ISMIR2004 Songs µ = 89.80, = 27.83 N = 464 0 10 20 30 % of tracks GTZAN µ = 94.55, = 24.39 N = 999 0 10 20 30 % of tracks ACM MIRUM µ = 102.72, = 32.58 N = 1410 0 10 20 30 % of tracks Hainsworth µ = 113.30, = 28.78 N = 222 20 30 acks Ballroom µ = 129.77, = 39.61 mic power Yln (m, k) := ln (1000 · Y (m, k) + 1), and its frequency by F(k) given in Hz. We define the onset signal strength OSS(m) as the sum of the bandwise differences between the logarithmic powers Yln (m, k) and Yln (m 1, k) for those k where the frequency F(k) 2 [30, 720] and Y (m, k) is greater than ↵Y (m 1, k) (see [16]): I(m, k) = 8 < : 1 if Y (m, k) > ↵Y (m 1, k) and F(k) 2 [30, 720], 0 otherwise (1) OSS(m) = X k (Yln (m, k) Yln (m 1, k)) · I(m, k) Both the factor ↵ = 1.76 and the frequency range were found experimentally [24]. The OSS(m) is transformed using a DFT with length 8192. At the given sample rate, this ensures a resolution of 0.156 BPM. The peaks of the resulting beat spectrum B represent the strength of BPM values in the signal [7], but do not take harmonics into account [10, 21]. There- fore we derive an enhanced beat spectrum BE that boosts frequencies supported by harmonics: BE (k) = 2 X |B(bk/2i + 0.5c)| (2) 0 10 20 % of tr N = 222 0 10 20 30 % of tracks Ballroom µ = 129.77, = 39.61 N = 698 20 – 30 30 – 40 40 – 50 50 – 60 60 – 70 70 – 80 80 – 90 90 – 100 100 – 110 110 – 120 120 – 130 130 – 140 140 – 150 150 – 160 160 – 170 170 – 180 180 – 190 190 – 200 200 – 210 210 – 220 220 – 230 230 – 240 240 – 250 250 – 260 0 10 20 30 Tempo intervals in BPM % of tracks GiantSteps µ = 136.66, = 28.33 N = 664 Figure 1. Tempo distributions for the test datasets. highest value of BE , divide its frequency by 4 to find the first harmonic, and finally convert its associated frequency to BPM: 60 40 – 50 50 – 60 60 – 70 70 – 80 80 – 90 90 – 100 100 – 110 110 – 120 120 – 130 130 – 140 140 – 150 150 – 160 160 – 170 170 – 180 180 – 190 190 – 200 200 – 210 210 – 220 0 10 20 30 Tempo intervals in BPM % of tracks µ = 121.32, = 30.52 N = 8, 596 Figure 1: Tempo distribution for the Train dataset con- sisting of LMD Tempo, MTG Tempo, and EBall. sisting of multiple components (“layers”) that has evolved naturally. But to the best of our knowledge, nobody has replaced the traditional multi-component architecture with a single deep neural network (DNN) yet. In this paper we describe a CNN-based approach that estimates the local end, we estimated the tempo of the matched audio pre- views using the algorithm from [31]. Then the associated MIDI files were parsed for tempo change messages. If the value of more than half the tempo messages for a given preview were within 2% of the estimated tempo, we as- sumed the estimated tempo of the audio excerpts to be cor- rect and added it to LMD Tempo. This resulted in 3,611 audio tracks. We were able to match more than 76% of the tracks to the Million Song Dataset (MSD) genre annota- tions from [29]. Of the matched tracks 29% were labeled rock, 27% pop, 5% r&b, 5% dance, 5% country, 4% latin, and 3% electronic. Less than 2% of the tracks were labeled jazz, soundtrack, world and others. Thus it is fair to characterize LMD Tempo as a good cross-section of popular music. 2.2 MTG Tempo The MTG Key dataset was created by Faraldo [8] as a Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 99 Diverse μ=121.3 σ=30.5 μ=94.6 σ=24.4
  79. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Ballroom Benchmarking Trained on EDM Trained on Diverse SOTA [1][2] Accuracy in % 0% 20% 40% 60% 80% 100% 98.7% 95.0% 81.2% 92.0% 85.9% 56.5% Acc1 Acc2 σ=1.8 σ=0.5 σ=2.7 σ=0.2 Huge improvement, but not quite SOTA [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.
 In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [2] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.
 In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. +29.4pp
  80. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg EDM Benchmarking Trained on EDM Trained on Diverse SOTA [1] Accuracy in % 0% 20% 40% 60% 80% 100% 97.6% 96.5% 97.2% 82.5% 87.9% 86.8% Acc1 Acc2 σ=0.8 σ=0.2 σ=0.8 σ=0.5 [1] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.
 In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. Results mostly unchanged
  81. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Adding more samples made a difference. Can we squeeze more out of the data by improving the network architecture?
  82. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Going Deeper Input 4 (1x3) Conv2D Dropout 256 (256) Conv1D Dropout 256 (1) Conv1D AvgPooling2D GlobalAvgPooling Softmax 328.208 parameters
  83. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Going Deeper Input 4 (1x3) Conv2D Dropout 256 (256) Conv1D Dropout 256 (1) Conv1D AvgPooling2D GlobalAvgPooling Softmax 328.208 parameters Input 4 (1x3) Conv2D Dropout 256 (256) Conv1D Dropout 256 (1) Conv1D AvgPooling2D GlobalAvgPooling Softmax 256 (256) Conv1D 17.105.680 parameters
  84. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Going Deeper Input 4 (1x3) Conv2D Dropout 256 (256) Conv1D Dropout 256 (1) Conv1D AvgPooling2D GlobalAvgPooling Softmax 328.208 parameters Input 4 (1x3) Conv2D Dropout 256 (256) Conv1D Dropout 256 (1) Conv1D AvgPooling2D GlobalAvgPooling Softmax 256 (256) Conv1D 17.105.680 parameters Input 4 (1x3) Conv2D Dropout 256 (256) Conv1D Dropout 256 (1) Conv1D AvgPooling2D GlobalAvgPooling Softmax 256 (256) Conv1D 256 (256) Conv1D 33.883.152 parameters Parameter
 Overkill!
  85. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Going Deeper What if we combined ideas from [1] and [2]: ๏ Bottleneck layers (1x1 conv) 㱺 dimensionality reduction ๏ Filter bank (not every filter needs to be 256 frames long) Stepwise pooling along frequency axis possible BatchNormalization to avoid covariate shift Add layers with short filters [1] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. [2] Pons, Jordi, and Xavier Serra. "Designing efficient architectures for modeling temporal features with convolutional neural networks.” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, March 2017.
  86. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Going Deeper (Kx1) AvgPooling2D 24 (1x32) C2D 24 (1x64) C2D 24 (1x96) C2D 24 (1x128) C2D 24 (1x192) C2D 24 (1x256) C2D Concatenation 34 (1x1) Conv2D mf_mod[1] BatchNormalization stepwise pooling along frequency axis filterbank bottleneck Input BN + 16 (1x5) Conv2D mf_mod K=5 BN + Dropout 256 (1x1) Conv2D GlobalAvgPooling2D Softmax 2.319.956 parameters BN + 16 (1x5) Conv2D BN + 16 (1x5) Conv2D mf_mod K=2 mf_mod K=2 mf_mod K=2 [1] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.
 In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018.
  87. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg GTZAN Benchmarking [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.
 In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [2] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017. Shallow Net Mf_mod Net SOTA [1][2] Accuracy in % 0% 20% 40% 60% 80% 100% 95.0% 92.7% 91.0% 71.0% 63.9% 58.2% Acc1 Acc2 σ=1.4 σ=0.4 σ=1.9 σ=0.7 Again an improvement, but not SOTA +5.7pp
  88. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Ballroom Benchmarking Shallow Net Mf_mod Net SOTA [1][2] Accuracy in % 0% 20% 40% 60% 80% 100% 98.7% 94.7% 95.0% 92.0% 88.0% 85.9% Acc1 Acc2 σ=2.7 σ=0.2 σ=1.7 σ=0.4 Slight Acc1 improvement, but not quite SOTA [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.
 In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [2] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.
 In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. +2.1pp
  89. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg EDM Benchmarking Shallow Net Mf_mod Net SOTA [1] Accuracy in % 0% 20% 40% 60% 80% 100% 97.6% 97.4% 96.5% 82.5% 89.2% 87.9% Acc1 Acc2 σ=0.8 σ=0.5 σ=0.3 σ=0.3 [1] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.
 In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. Results slightly improved +1.3pp +0.9pp
  90. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg All results got a little better:
 There seems to be room for improvement.
  91. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg All results got a little better:
 There seems to be room for improvement. But how?
 
 Let’s try something old!
  92. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg VGG-Style Net[1] 1.198.704 parameters 16*2K (5x5) Conv2D BatchNormalization 16*2K (3x3) Conv2D BatchNormalization (2x2) MaxPooling2D* vgg_mod Dropout 0.3 Input vgg_mod K=0 256 (1x1) Conv2D GlobalAvgPooling2D Softmax vgg_mod K=1 vgg_mod K=2 vgg_mod K=2 vgg_mod K=3 vgg_mod K=3 * pooling along frequency axis with size 2 only as long as there is something left to pool [1] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
  93. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg GTZAN Benchmarking [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.
 In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [2] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017. Mf_mod Net VGG Net SOTA [1][2] Accuracy in % 0% 20% 40% 60% 80% 100% 95.0% 91.7% 92.7% 71.0% 63.5% 63.9% Acc1 Acc2 σ=1.9 σ=0.7 σ=3.2 σ=0.4 No improvement, but as good as specialized network
  94. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Ballroom Benchmarking Mf_mod Net VGG Net SOTA [1][2] Accuracy in % 0% 20% 40% 60% 80% 100% 98.7% 95.1% 94.7% 92.0% 91.6% 88.0% Acc1 Acc2 σ=1.7 σ=0.4 σ=1.7 σ=0.1 Acc1 now SOTA! Why? ? [3] [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.
 In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [2] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.
 In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. [3] Sturm, Bob L. "A simple method to determine if a music information retrieval system is a “horse”." IEEE Transactions on Multimedia 16.6 (2014): 1636-1644. +3.6pp
  95. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg EDM Benchmarking Mf_mod Net VGG Net SOTA [1] Accuracy in % 0% 20% 40% 60% 80% 100% 97.6% 97.1% 97.4% 82.5% 88.8% 89.2% Acc1 Acc2 σ=0.3 σ=0.3 σ=1.2 σ=0.1 [1] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.
 In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. Similar results as specialized approaches
  96. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg What if we use a VGG style network,
 but only rectangular filters?
  97. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Rect VGG-Style Net 320.304 parameters 16*2K (1x5) Conv2D BatchNormalization 16*2K (1x3) Conv2D BatchNormalization (2x2) MaxPooling2D rect_vgg_mod Dropout 0.3 Input rect_vgg_mod K=0 256 (1x1) Conv2D GlobalAvgPooling2D Softmax rect_vgg_mod K=1 rect_vgg_mod K=2 rect_vgg_mod K=2 rect_vgg_mod K=3 rect_vgg_mod K=3 same as before, but with rectangular (1x5) and (1x3) filters
  98. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg GTZAN Benchmarking [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.
 In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [2] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017. VGG Net Rect VGG Net SOTA [1][2] Accuracy in % 0% 20% 40% 60% 80% 100% 95.0% 91.6% 91.7% 71.0% 65.7% 63.5% Acc1 Acc2 σ=3.2 σ=0.4 σ=0.9 σ=0.1 Slight Acc1 improvement +2.2pp
  99. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Ballroom Benchmarking VGG Net Rect VGG Net SOTA [1][2] Accuracy in % 0% 20% 40% 60% 80% 100% 98.7% 94.6% 95.1% 92.0% 86.2% 91.6% Acc1 Acc2 σ=1.7 σ=0.4 σ=1.6 σ=0.1 Below SOTA again.
 Timbral information must add clues about genre and therefore tempo! [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.
 In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [2] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.
 In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. -5.4pp
  100. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg EDM Benchmarking VGG Net Rect VGG Net SOTA [1] Accuracy in % 0% 20% 40% 60% 80% 100% 97.6% 96.9% 97.1% 82.5% 87.6% 88.8% Acc1 Acc2 σ=1.2 σ=0.1 σ=0.8 σ=0.2 [1] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.
 In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. Slightly lower than square VGG -1.2pp
  101. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg That was a lot of graphs and architectures. Let’s summarize.
  102. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Multi-Step Dataset Averages: Acc1 [1] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017. [2] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.
 In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [3] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.
 In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. DSP + RandomForest CNN - similar to Mf_mod BLSTM + Comb Filters Shallow Net MF_mod Net VGG Net Rect VGG Net Schreiber17 [1] Böck/Madmom [2] Schreiber18 [3] Accuracy 1 in % 0% 20% 40% 60% 80% 100% 81.3% 72.8% 68.6% 79.8% 81.3% 80.4% 77.3% 81.3% Max
  103. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Dataset Averages: Acc2 Shallow Net MF_mod Net VGG Net Rect VGG Net Schreiber17 [1] Böck/Madmom [2] Schreiber18 [3] Accuracy 2 in % 0% 20% 40% 60% 80% 100% 95.9% 96.2% 95.2% 94.4% 94.6% 94.9% 94.2% 96.2% Max [1] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017. [2] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.
 In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015. [3] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.
 In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018. DSP + RandomForest CNN - similar to Mf_mod BLSTM + Comb Filters
  104. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Apparently we have a couple of systems that work pretty well. What can we do with them?
  105. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Applications ๏ DJ apps ๏ Annotation supported browsing ๏ Content-based recommender systems ๏ … ๏ Anything else that’s remotely cool?
  106. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Local Tempo Estimation “Honky Tonk Women” by The Rolling Stones
  107. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Local Tempo Estimation “Honky Tonk Women” by The Rolling Stones
  108. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Local Tempo Estimation “Typhoon” by Foreign Beggars/Chasing Shadows
  109. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Local Tempo Estimation “Typhoon” by Foreign Beggars/Chasing Shadows
  110. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Conclusions I ๏ Consolidates multi-component approach
  111. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Conclusions I ๏ Consolidates multi-component approach ๏ Completely data-driven, no heuristics
  112. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Conclusions I ๏ Consolidates multi-component approach ๏ Completely data-driven, no heuristics ๏ Fewer octave-errors
  113. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Conclusions I ๏ Consolidates multi-component approach ๏ Completely data-driven, no heuristics ๏ Fewer octave-errors ๏ Suitable for global and local tempo estimation
  114. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Conclusions II ๏ Great tempo estimation results are possible with simple, shallow networks
  115. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Conclusions II ๏ Great tempo estimation results are possible with simple, shallow networks ๏ Training data and data augmentation are key
  116. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Conclusions II ๏ Great tempo estimation results are possible with simple, shallow networks ๏ Training data and data augmentation are key ๏ Rectangular filters are sufficient for tempo estimation
  117. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Conclusions II ๏ Great tempo estimation results are possible with simple, shallow networks ๏ Training data and data augmentation are key ๏ Rectangular filters are sufficient for tempo estimation ๏ Square filters can improve results further, but beware of horses
  118. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Conclusions II ๏ Great tempo estimation results are possible with simple, shallow networks ๏ Training data and data augmentation are key ๏ Rectangular filters are sufficient for tempo estimation ๏ Square filters can improve results further, but beware of horses ๏ Yes, spectrograms differ from images, but don’t re-invent the wheel
  119. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Questions ๏ If the DSP-ignorant VGG approach works so well, why bother with trying to design a DSP-informed network architecture? ๏ Should tempo estimators use timbral or genre information? ๏ What’s with this 4% tolerance? Doesn’t it make results useless for DJs?
  120. INTERNATIONAL AUDIO LABORATORIES ERLANGEN A joint institution of Fraunhofer IIS

    and Universität Erlangen-Nürnberg Thank you. Hendrik Schreiber
 tagtraum industries incorporated [email protected] @h_schreiber