Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Musical Tempo Estimation
 with 
Convolutional Neural Networks

Musical Tempo Estimation
 with 
Convolutional Neural Networks

Hendrik Schreiber

November 19, 2018
Tweet

More Decks by Hendrik Schreiber

Other Decks in Science

Transcript

  1. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Musical Tempo Estimation

    with

    Convolutional Neural Networks
    Hendrik Schreiber

    tagtraum industries incorporated
    [email protected] @h_schreiber

    View Slide

  2. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    About Me
    ๏ Doing business as tagtraum industries since 2004

    View Slide

  3. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    About Me
    ๏ Doing business as tagtraum industries since 2004
    ๏ Author of beaTunes

    View Slide

  4. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    About Me
    ๏ Doing business as tagtraum industries since 2004
    ๏ Author of beaTunes
    ๏ Ph.D. candidate at Meinard Müller’s lab

    View Slide

  5. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    www.
    beaTunes.com
    Music Analysis
    Wikipedia
    Integration
    Segmentation
    Key, Tempo,

    Mood, Color
    Metadata
    Correction Matching
    Windows & Mac
    shameless product placement

    View Slide

  6. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    www.
    beaTunes.com
    Music Analysis
    Wikipedia
    Integration
    Segmentation
    Key, Tempo,

    Mood, Color
    Metadata
    Correction Matching
    Windows & Mac
    shameless product placement

    View Slide

  7. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    This Talk
    ๏ About: Tempo Estimation with CNNs
    ๏ Yes, we are going to get close to or exceed the
    state of the art (SOTA)
    ๏ But: Just showing how that’s done won’t take
    long and we won’t learn much

    View Slide

  8. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Let’s take the scenic route!

    View Slide

  9. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    This Talk
    ๏ What is tempo and tempo estimation?
    ๏ Simple digital signal processing baseline
    ๏ Translate classic approach to the CNN world
    ๏ Note issues, attempt to solve them
    ๏ Question why and how we did all this

    View Slide

  10. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Tempo Estimation
    ๏ Usually global tempo estimation
    ๏ Works pretty well (MIREX)
    ๏ Remaining challenge: Octave errors

    View Slide

  11. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Tempo Estimation System
    Spectrogram
    Onset/Beat

    Detection
    Tempo Estimation
    Traditional System
    Often peak
    picking or some
    other heuristic

    View Slide

  12. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Tempo Estimation System
    Mel-Spectrogram
    CNN-based Tempo
    Classification
    Proposed System
    “eliminating the middle-man”

    View Slide

  13. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Case Study
    A Traditional

    Tempo Estimation System

    View Slide

  14. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Case Study
    A Traditional

    Tempo Estimation System

    View Slide

  15. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    What is Tempo?
    Beats per Minute

    View Slide

  16. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    What is Tempo?
    Periodic Increase in Volume, in Certain Frequency Bands

    View Slide

  17. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    What is Tempo?
    Periodic Increase in Volume, in Certain Frequency Bands
    2
    1
    3

    View Slide

  18. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Periodic Increase in Volume,
    in Certain Frequency Bands
    ๏ Framewise signal energy (sliding window)
    ๏ Short-term pattern: Compute difference between adjacent frames
    ๏ Increase: Keep only positive values (half wave rectification)

    View Slide

  19. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Periodic Increase in Volume,
    in Certain Frequency Bands
    ๏ Framewise signal energy (sliding window)
    ๏ Short-term pattern: Compute difference between adjacent frames
    ๏ Increase: Keep only positive values (half wave rectification)
    1

    View Slide

  20. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Increase in Volume

    View Slide

  21. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Increase in Volume

    View Slide

  22. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Increase in Volume

    View Slide

  23. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Increase in Volume
    First 2 seconds:
    How many beats
    do you hear?

    View Slide

  24. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Increase in Volume
    First 2 seconds:
    How many beats
    do you hear?

    View Slide

  25. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Increase in Volume
    First 2 seconds:
    4 Beats?

    View Slide

  26. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Periodic Increase in Volume,
    in Certain Frequency Bands
    ๏ Log-compressed power spectrum (de-emphasize low freqs)
    ๏ Bandwise derivation (compare apples with apples)

    View Slide

  27. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Periodic Increase in Volume,
    in Certain Frequency Bands
    ๏ Log-compressed power spectrum (de-emphasize low freqs)
    ๏ Bandwise derivation (compare apples with apples)
    2

    View Slide

  28. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    In Certain Frequency Bands

    View Slide

  29. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    In Certain Frequency Bands

    View Slide

  30. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    In Certain Frequency Bands

    View Slide

  31. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    In Certain Frequency Bands

    View Slide

  32. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    In Certain Frequency Bands

    View Slide

  33. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    In Certain Frequency Bands
    First 2 seconds:
    4 Beats!

    View Slide

  34. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Periodic Increase in Volume,
    in Certain Frequency Bands
    ๏ Long-term pattern:

    Fourier transform the Onset Strength Signal (OSS)

    from time domain to frequency domain
    ๏ Peak picking

    View Slide

  35. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Periodic Increase in Volume,
    in Certain Frequency Bands
    ๏ Long-term pattern:

    Fourier transform the Onset Strength Signal (OSS)

    from time domain to frequency domain
    ๏ Peak picking
    3

    View Slide

  36. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Periodic
    Peak at
    153.3 BPM

    View Slide

  37. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Thanks.

    Fun signal processing intro.

    View Slide

  38. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    But wasn’t this talk supposed
    to be about Deep Learning?
    Thanks.

    Fun signal processing intro.

    View Slide

  39. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Reminder: Convolutional
    Neural Network (CNN)
    ๏ Usually used for image
    recognition
    ๏ Each neuron has a limited
    receptive field
    ๏ Parameters are shared
    between neurons
    Image source: aphex34, https://en.wikipedia.org/wiki/Convolutional_neural_network

    View Slide

  40. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Reminder: Mel-Spectrogram
    ๏ Mel scale is a perceptual scale of pitches
    judged by listeners to be equal in distance
    to each other.
    ๏ A Mel-spectrogram is a regular
    spectrogram rescaled along the Y axis
    using the Mel scale.
    ๏ Compared to regular linear scale, reduction
    of number of dimensions (i.e. pooling).
    ๏ Contrary to regular images, the two axes
    have completely different meaning.
    Frequency in Mel
    Time in frames

    View Slide

  41. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    General Approach
    ๏ Input: 11.9s (256 frames), 40 band Mel-spectrogram (20—5,000Hz)

    ➔ Possible range: 0-645 BPM

    [ Nyquist frequency = 60*Fs/2 = 60*(256/11.9s)/2 = 645 BPM ]
    ๏ Output: Treat as classification problem (256 classes ↦ 30-286 BPM)

    View Slide

  42. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Translation to CNN-Speak
    1. Increase in Volume: Convolutional layer with narrow receptive field

    View Slide

  43. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Translation to CNN-Speak
    1. Increase in Volume: Convolutional layer with narrow receptive field
    2. In Certain Freq. Bands: Use rectangular filters (along time axis)

    View Slide

  44. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Translation to CNN-Speak
    1. Increase in Volume: Convolutional layer with narrow receptive field
    2. In Certain Freq. Bands: Use rectangular filters (along time axis)
    3. Periodic: Convolutional layer with wide receptive field

    View Slide

  45. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Translation to CNN-Speak
    1. Increase in Volume: Convolutional layer with narrow receptive field
    2. In Certain Freq. Bands: Use rectangular filters (along time axis)
    3. Periodic: Convolutional layer with wide receptive field
    4. Classification based on extracted features

    View Slide

  46. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Increase in Volume,

    in Certain Frequency Bands
    Short, rectangular
    kernels
    Bands: 1
    Frames: 3
    Frequency/Band
    Time/Frame 256
    40
    0

    View Slide

  47. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Increase in Volume,

    in Certain Frequency Bands
    Short, rectangular
    kernels
    Bands: 1
    Frames: 3
    Frequency/Band
    Time/Frame 256
    40
    0
    Increase in Volume,

    in Certain Frequency Bands

    View Slide

  48. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Summarize Frames
    Frequency/Band
    Time/Frame
    Rectangular
    average pooling
    Bands: 40
    Frames: 1
    256
    40
    0

    View Slide

  49. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Periodic
    Frequency/Band
    Time/Frame
    Long, rectangular
    kernels
    Bands: 1
    Frames: 256
    256
    40
    0

    View Slide

  50. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Tempo CNN
    4 (1x3)
    Conv2D, ReLU
    (40x1)

    AvgPooling2D
    256 (256)

    Conv1D, ReLU
    (40x256)

    Mel-Spectrogram
    256 (1)
    Conv1D, ReLU
    GlobalAvg

    Pooling1D
    (256)
    softmax
    argmax

    mapped
    to interval

    [30, 286]
    Feature
    Extraction
    Classification

    View Slide

  51. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Nice idea.
    Does it work?

    View Slide

  52. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Nice idea.
    Does it work?

    View Slide

  53. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    EDM Training
    ๏ EDM Training dataset
    - MTG Tempo [1], based on MTG Key [2]
    - 1,156 tracks, 2min duration each
    - 80/20 train/validation split
    ๏ Randomly picked 256 frames segments
    ๏ Optimizer Adam, lr=0.001
    ๏ Batch size 32
    ๏ Early stopping with patience 100
    [1] Hendrik Schreiber, Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.

    In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.

    [2] Ángel Faraldo, Sergi Jordà, and Perfecto Herrera. A multi-profile method for key estimation in EDM.

    In Proceedings of the AES International Conference on Semantic Audio, Erlangen, Germany, June 2017.

    View Slide

  54. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    EDM Evaluation
    EDM Test Dataset
    ๏ GiantSteps Tempo
    ๏ 661 tracks
    ๏ New annotations from [1]

    Metrics
    ๏ Acc1: Correct, allowing a 4% tolerance
    ๏ Acc2: Correct, allowing a 4% tolerance

    and factors 2, 3, ½, ⅓
    [1] Hendrik Schreiber, Meinard Müller. A Crowdsourced Experiment for Tempo Estimation of Electronic Dance Music.

    In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.

    View Slide

  55. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    DSP Baseline
    Energy (50-210, 46ms)
    Energy (90-180, 46ms)
    Energy (90-180, 23ms)
    Bandwise (50-210, 46ms)
    Bandwise (90-180, 46ms)
    Bandwise (90-180, 23ms)
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    65.8%
    58.4%
    52.7%
    62.6%
    62.5%
    64.8%
    59.0%
    51.3%
    39.8%
    57.9%
    57.8%
    54.8%
    Acc1 Acc2

    View Slide

  56. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Evaluation
    DSP Baseline
    Simple FCN
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    80.9%
    65.8%
    71.1%
    59.0%
    Acc1 Acc2
    +12.1pp
    +15.1pp

    View Slide

  57. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg

    View Slide

  58. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Evaluation
    DSP Baseline
    Simple FCN
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    80.9%
    65.8%
    71.1%
    59.0%
    Acc1 Acc2
    σ=6.4
    σ=5.7
    Average of 3 runs
    Dirty little secret:

    Substantial standard deviations

    View Slide

  59. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Evaluation
    DSP Baseline
    Simple FCN
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    80.9%
    65.8%
    71.1%
    59.0%
    Acc1 Acc2
    σ=6.4
    σ=5.7
    Average of 3 runs
    Dirty little secret:

    Substantial standard deviations
    Overfitting?!

    View Slide

  60. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Let’s Drop Some!
    4 (1x3)
    Conv2D, ReLU
    (40x1)

    AvgPooling2D
    256 (256)

    Conv1D, ReLU
    (40x256)

    Mel-Spectrogram
    256 (1)
    Conv1D, ReLU
    GlobalAvg

    Pooling1D
    (256)
    softmax
    argmax

    mapped
    to interval

    [30, 286]
    Feature
    Extraction
    Classification
    Dropout

    p=0.5
    Dropout

    p=0.5

    View Slide

  61. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Scale & Crop Data Augmentation
    Requires label
    adjustment!
    Quantized
    scaling to
    80%, 84%,
    88%, …,
    116%, 120%

    View Slide

  62. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Evaluation
    DSP Baseline
    Simple FCN
    +Dropout 0.5
    +Dropout 0.5 & Augmentation
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    97.2%
    92.9%
    80.9%
    65.8%
    86.8%
    81.0%
    71.1%
    59.0%
    Acc1 Acc2
    σ=6.4
    σ=5.7
    σ=2.1
    σ=0.3
    σ=0.8
    σ=0.2
    +15.7pp
    +9.9pp
    Acc1

    View Slide

  63. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Evaluation
    DSP Baseline
    Simple FCN
    +Dropout 0.5
    +Dropout 0.5 & Augmentation
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    97.2%
    92.9%
    80.9%
    65.8%
    86.8%
    81.0%
    71.1%
    59.0%
    Acc1 Acc2
    σ=6.4
    σ=5.7
    σ=2.1
    σ=0.3
    σ=0.8
    σ=0.2 +16.3pp
    +12.0pp
    Acc2

    View Slide

  64. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    EDM Benchmarking
    [1] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning.

    In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017.

    [2] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.

    In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018.

    DSP Baseline
    Schreiber17 [1]
    Best FCN
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    97.2%
    95.2%
    65.8%
    86.8%
    63.1%
    59.0%
    Acc1 Acc2
    σ=0.8
    σ=0.2
    Multi-Step
    Single-Step (CNN)
    Fewer octave-errors!

    View Slide

  65. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    EDM Benchmarking
    [1] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning.

    In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017.

    [2] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.

    In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018.

    DSP Baseline
    Schreiber17 [1]
    Best FCN
    Schreiber18 [2]
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    97.6%
    97.2%
    95.2%
    65.8%
    82.5%
    86.8%
    63.1%
    59.0%
    Acc1 Acc2
    σ=0.8
    σ=0.2
    Multi-Step
    Single-Step (CNN)
    +4.3pp
    Beats SOTA
    Acc1!

    View Slide

  66. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg

    View Slide

  67. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    , but…
    Sure,

    View Slide

  68. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Does this generalize?

    View Slide

  69. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Does this generalize?
    To other genres?

    View Slide

  70. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    GTZAN[1] Benchmarking
    [1] George Tzanetakis and Perry Cook. Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5):293–302, 2002.

    [2] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.

    In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015.

    [3] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning.

    In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017.
    DSP Baseline
    SOTA [2][3]
    Best FCN
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    86.7%
    95.0%
    68.8%
    50.5%
    71.0%
    53.0%
    Acc1 Acc2
    σ=2.2
    σ=0.2
    Lots of octave-errors!
    1,000 tracks from 10 genres, balanced
    36.2pp

    View Slide

  71. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Ballroom[1] Benchmarking
    DSP Baseline
    SOTA [2][3]
    Best FCN
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    81.2%
    98.7%
    78.7%
    56.5%
    92.0%
    46.1%
    Acc1 Acc2
    [1] Fabien Gouyon, Anssi P. Klapuri, Simon Dixon, Miguel Alonso, George Tzanetakis, Christian Uhle, and Pedro Cano. An experimental comparison of
    audio tempo induction algorithms. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1832–1844, 2006.

    [2] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.

    In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015.

    [3] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.

    In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018.

    σ=1.8
    σ=0.5
    698 tracks from ballroom dance genres
    24.7pp
    Lots of octave-errors!

    View Slide

  72. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Testset Tempo Distributions
    nd its
    signal
    ences
    (m
    , 720]
    :
    (1)
    m, k)
    were
    ength
    ution
    ctrum
    l [7],
    here-
    oosts
    (2)
    0
    10
    20
    % of tr
    N = 222
    0
    10
    20
    30
    % of tracks
    Ballroom
    µ = 129.77, = 39.61
    N = 698
    20 – 30
    30 – 40
    40 – 50
    50 – 60
    60 – 70
    70 – 80
    80 – 90
    90 – 100
    100 – 110
    110 – 120
    120 – 130
    130 – 140
    140 – 150
    150 – 160
    160 – 170
    170 – 180
    180 – 190
    190 – 200
    200 – 210
    210 – 220
    220 – 230
    230 – 240
    240 – 250
    250 – 260
    0
    10
    20
    30
    Tempo intervals in BPM
    % of tracks
    GiantSteps µ = 136.66, = 28.33
    N = 664
    Figure 1. Tempo distributions for the test datasets.
    highest value of BE
    , divide its frequency by 4 to find the
    first harmonic, and finally convert its associated frequency
    to BPM:
    60
    tures, and then compare our results with those from other
    methods. Finally, in Section 5, we present our conclusions.
    2. TEMPO ESTIMATION
    To lay the groundwork for our error correction method, we
    first describe a simple tempo estimation algorithm, then in-
    troduce several test datasets and discuss common pitfalls.
    In Section 2.5, we introduce performance metrics and de-
    scribe observed errors.
    2.1 Algorithm
    To estimate the dominant pulse we follow the approach
    taken in [24], which is similar to [23, 28]: We first con-
    vert the signal to mono and downsample to 11025 Hz.
    Then we compute the power spectrum Y of 93 ms win-
    dows with half overlap, by applying a Hamming win-
    dow and performing an STFT. The power for each bin
    k 2 [0 : K] := {0, 1, 2, . . . , K} at time m 2 [0 : M] :=
    {0, 1, 2, . . . , M} is given by Y (m, k), its positive logarith-
    mic power Yln
    (m, k) := ln (1000 · Y (m, k) + 1), and its
    frequency by F(k) given in Hz. We define the onset signal
    strength OSS(m) as the sum of the bandwise differences
    between the logarithmic powers Yln
    (m, k) and Yln
    (m
    1, k) for those k where the frequency F(k) 2 [30, 720]
    0
    10
    20
    % of tr
    0
    10
    20
    30
    % of tracks
    ISMIR2004 Songs
    µ = 89.80, = 27.83
    N = 464
    0
    10
    20
    30
    % of tracks
    GTZAN
    µ = 94.55, = 24.39
    N = 999
    0
    10
    20
    30
    % of tracks
    ACM MIRUM
    µ = 102.72, = 32.58
    N = 1410
    0
    10
    20
    30
    % of tracks
    Hainsworth
    µ = 113.30, = 28.78
    N = 222
    20
    30
    acks
    Ballroom
    µ = 129.77, = 39.61
    mic power Yln
    (m, k) := ln (1000 · Y (m, k) + 1), and its
    frequency by F(k) given in Hz. We define the onset signal
    strength OSS(m) as the sum of the bandwise differences
    between the logarithmic powers Yln
    (m, k) and Yln
    (m
    1, k) for those k where the frequency F(k) 2 [30, 720]
    and Y (m, k) is greater than ↵Y (m 1, k) (see [16]):
    I(m, k) =
    8
    <
    :
    1 if Y (m, k) > ↵Y (m 1, k)
    and F(k) 2 [30, 720],
    0 otherwise
    (1)
    OSS(m) =
    X
    k
    (Yln
    (m, k) Yln
    (m 1, k)) · I(m, k)
    Both the factor ↵ = 1.76 and the frequency range were
    found experimentally [24].
    The OSS(m) is transformed using a DFT with length
    8192. At the given sample rate, this ensures a resolution
    of 0.156 BPM. The peaks of the resulting beat spectrum
    B represent the strength of BPM values in the signal [7],
    but do not take harmonics into account [10, 21]. There-
    fore we derive an enhanced beat spectrum BE
    that boosts
    frequencies supported by harmonics:
    BE
    (k) =
    2
    X
    |B(bk/2i + 0.5c)| (2)
    0
    10
    20
    % of tr
    N = 222
    0
    10
    20
    30
    % of tracks
    Ballroom
    µ = 129.77, = 39.61
    N = 698
    20 – 30
    30 – 40
    40 – 50
    50 – 60
    60 – 70
    70 – 80
    80 – 90
    90 – 100
    100 – 110
    110 – 120
    120 – 130
    130 – 140
    140 – 150
    150 – 160
    160 – 170
    170 – 180
    180 – 190
    190 – 200
    200 – 210
    210 – 220
    220 – 230
    230 – 240
    240 – 250
    250 – 260
    0
    10
    20
    30
    Tempo intervals in BPM
    % of tracks
    GiantSteps µ = 136.66, = 28.33
    N = 664
    Figure 1. Tempo distributions for the test datasets.
    highest value of BE
    , divide its frequency by 4 to find the
    first harmonic, and finally convert its associated frequency
    to BPM:
    60
    μ=136.7 σ=28.3 μ=95.6 σ=24.4

    View Slide

  73. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Testset Tempo Distributions
    nd its
    signal
    ences
    (m
    , 720]
    :
    (1)
    m, k)
    were
    ength
    ution
    ctrum
    l [7],
    here-
    oosts
    (2)
    0
    10
    20
    % of tr
    N = 222
    0
    10
    20
    30
    % of tracks
    Ballroom
    µ = 129.77, = 39.61
    N = 698
    20 – 30
    30 – 40
    40 – 50
    50 – 60
    60 – 70
    70 – 80
    80 – 90
    90 – 100
    100 – 110
    110 – 120
    120 – 130
    130 – 140
    140 – 150
    150 – 160
    160 – 170
    170 – 180
    180 – 190
    190 – 200
    200 – 210
    210 – 220
    220 – 230
    230 – 240
    240 – 250
    250 – 260
    0
    10
    20
    30
    Tempo intervals in BPM
    % of tracks
    GiantSteps µ = 136.66, = 28.33
    N = 664
    Figure 1. Tempo distributions for the test datasets.
    highest value of BE
    , divide its frequency by 4 to find the
    first harmonic, and finally convert its associated frequency
    to BPM:
    60
    mic power Yln
    (m, k) := ln (1000 · Y (m, k) + 1), and its
    frequency by F(k) given in Hz. We define the onset signal
    strength OSS(m) as the sum of the bandwise differences
    between the logarithmic powers Yln
    (m, k) and Yln
    (m
    1, k) for those k where the frequency F(k) 2 [30, 720]
    and Y (m, k) is greater than ↵Y (m 1, k) (see [16]):
    I(m, k) =
    8
    <
    :
    1 if Y (m, k) > ↵Y (m 1, k)
    and F(k) 2 [30, 720],
    0 otherwise
    (1)
    OSS(m) =
    X
    k
    (Yln
    (m, k) Yln
    (m 1, k)) · I(m, k)
    Both the factor ↵ = 1.76 and the frequency range were
    found experimentally [24].
    The OSS(m) is transformed using a DFT with length
    8192. At the given sample rate, this ensures a resolution
    of 0.156 BPM. The peaks of the resulting beat spectrum
    B represent the strength of BPM values in the signal [7],
    but do not take harmonics into account [10, 21]. There-
    fore we derive an enhanced beat spectrum BE
    that boosts
    frequencies supported by harmonics:
    BE
    (k) =
    2
    X
    |B(bk/2i + 0.5c)| (2)
    0
    10
    20
    % of tr
    N = 222
    0
    10
    20
    30
    % of tracks
    Ballroom
    µ = 129.77, = 39.61
    N = 698
    20 – 30
    30 – 40
    40 – 50
    50 – 60
    60 – 70
    70 – 80
    80 – 90
    90 – 100
    100 – 110
    110 – 120
    120 – 130
    130 – 140
    140 – 150
    150 – 160
    160 – 170
    170 – 180
    180 – 190
    190 – 200
    200 – 210
    210 – 220
    220 – 230
    230 – 240
    240 – 250
    250 – 260
    0
    10
    20
    30
    Tempo intervals in BPM
    % of tracks
    GiantSteps µ = 136.66, = 28.33
    N = 664
    Figure 1. Tempo distributions for the test datasets.
    highest value of BE
    , divide its frequency by 4 to find the
    first harmonic, and finally convert its associated frequency
    to BPM:
    60
    μ=136.7 σ=28.3 μ=129.8 σ=39.6
    Then we compute the power spectrum Y of 93 ms win-
    dows with half overlap, by applying a Hamming win-
    dow and performing an STFT. The power for each bin
    k 2 [0 : K] := {0, 1, 2, . . . , K} at time m 2 [0 : M] :=
    {0, 1, 2, . . . , M} is given by Y (m, k), its positive logarith-
    mic power Yln
    (m, k) := ln (1000 · Y (m, k) + 1), and its
    frequency by F(k) given in Hz. We define the onset signal
    strength OSS(m) as the sum of the bandwise differences
    between the logarithmic powers Yln
    (m, k) and Yln
    (m
    1, k) for those k where the frequency F(k) 2 [30, 720]
    and Y (m, k) is greater than ↵Y (m 1, k) (see [16]):
    I(m, k) =
    8
    <
    :
    1 if Y (m, k) > ↵Y (m 1, k)
    and F(k) 2 [30, 720],
    0 otherwise
    (1)
    OSS(m) =
    X
    k
    (Yln
    (m, k) Yln
    (m 1, k)) · I(m, k)
    Both the factor ↵ = 1.76 and the frequency range were
    found experimentally [24].
    The OSS(m) is transformed using a DFT with length
    8192. At the given sample rate, this ensures a resolution
    of 0.156 BPM. The peaks of the resulting beat spectrum
    B represent the strength of BPM values in the signal [7],
    but do not take harmonics into account [10, 21]. There-
    0
    10
    20
    % of tr
    N = 1410
    0
    10
    20
    30
    % of tracks
    Hainsworth
    µ = 113.30, = 28.78
    N = 222
    0
    10
    20
    30
    % of tracks
    Ballroom
    µ = 129.77, = 39.61
    N = 698
    20 – 30
    30 – 40
    40 – 50
    50 – 60
    60 – 70
    70 – 80
    80 – 90
    90 – 100
    100 – 110
    110 – 120
    120 – 130
    130 – 140
    140 – 150
    150 – 160
    160 – 170
    170 – 180
    180 – 190
    190 – 200
    200 – 210
    210 – 220
    220 – 230
    230 – 240
    240 – 250
    250 – 260
    0
    10
    20
    30
    Tempo intervals in BPM
    % of tracks
    GiantSteps µ = 136.66, = 28.33
    N = 664
    Figure 1. Tempo distributions for the test datasets.

    View Slide

  74. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Limits of Augmentation

    Sheep will always stay sheep,

    no matter how you scale,

    rotate, crop, or shear.

    View Slide

  75. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    More (diverse) training data
    probably wouldn’t hurt…

    View Slide

  76. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Diverse Training
    ๏ MTG Tempo—tempo annotations for MTG Key [1], N=1,159
    ๏ LMD Tempo—derived from Lakh MIDI Dataset [2], N=3,611
    ๏ Eball—Extended Ballroom [3] sans Ballroom, N=3,826
    [1] Ángel Faraldo, Sergi Jordà, and Perfecto Herrera. A multi-profile method for key estimation in EDM. In Proceedings of the AES International Conference on Semantic Audio, Erlangen, Germany, June 2017.

    [2] Colin Raffel. "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching". PhD Thesis, 2016.

    [3] Ugo Marchand and Geoffroy Peeters. The extended ballroom dataset. In Late Breaking Demo of the International Conference on Music Information Retrieval (ISMIR), New York, NY, USA, 2016.
    Combined: 8,596 samples
    EDM
    Rock/Pop
    Ballroom

    View Slide

  77. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Diverse Training
    ๏ MTG Tempo—tempo annotations for MTG Key [1], N=1,159
    ๏ LMD Tempo—derived from Lakh MIDI Dataset [2], N=3,611
    ๏ Eball—Extended Ballroom [3] sans Ballroom, N=3,826
    [1] Ángel Faraldo, Sergi Jordà, and Perfecto Herrera. A multi-profile method for key estimation in EDM. In Proceedings of the AES International Conference on Semantic Audio, Erlangen, Germany, June 2017.

    [2] Colin Raffel. "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching". PhD Thesis, 2016.

    [3] Ugo Marchand and Geoffroy Peeters. The extended ballroom dataset. In Late Breaking Demo of the International Conference on Music Information Retrieval (ISMIR), New York, NY, USA, 2016.
    Combined: 8,596 samples
    EDM
    Rock/Pop
    Ballroom
    Still MIA:

    Jazz, World,
    Classical,
    Reggae, …

    View Slide

  78. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Diverse Training
    ๏ Randomly picked 256 frames segments
    ๏ Optimizer Adam, lr=0.001
    ๏ Batch size 32
    ๏ 90/10 train/validation split
    ๏ Early stopping with patience 150

    View Slide

  79. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    GTZAN Benchmarking
    [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.

    In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015.

    [2] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning.

    In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017.

    Trained on EDM
    Trained on Diverse
    SOTA [1][2]
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    95.0%
    91.0%
    86.7%
    71.0%
    58.2%
    50.5%
    Acc1 Acc2
    σ=2.2
    σ=0.2
    σ=1.4
    σ=0.4
    Clearly an
    improvement,
    but not SOTA
    +7.7pp

    View Slide

  80. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Tempo Distributions
    tures, and then compare our results with those from other
    methods. Finally, in Section 5, we present our conclusions.
    2. TEMPO ESTIMATION
    To lay the groundwork for our error correction method, we
    first describe a simple tempo estimation algorithm, then in-
    troduce several test datasets and discuss common pitfalls.
    In Section 2.5, we introduce performance metrics and de-
    scribe observed errors.
    2.1 Algorithm
    To estimate the dominant pulse we follow the approach
    taken in [24], which is similar to [23, 28]: We first con-
    vert the signal to mono and downsample to 11025 Hz.
    Then we compute the power spectrum Y of 93 ms win-
    dows with half overlap, by applying a Hamming win-
    dow and performing an STFT. The power for each bin
    k 2 [0 : K] := {0, 1, 2, . . . , K} at time m 2 [0 : M] :=
    {0, 1, 2, . . . , M} is given by Y (m, k), its positive logarith-
    mic power Yln
    (m, k) := ln (1000 · Y (m, k) + 1), and its
    frequency by F(k) given in Hz. We define the onset signal
    strength OSS(m) as the sum of the bandwise differences
    between the logarithmic powers Yln
    (m, k) and Yln
    (m
    1, k) for those k where the frequency F(k) 2 [30, 720]
    0
    10
    20
    % of tr
    0
    10
    20
    30
    % of tracks
    ISMIR2004 Songs
    µ = 89.80, = 27.83
    N = 464
    0
    10
    20
    30
    % of tracks
    GTZAN
    µ = 94.55, = 24.39
    N = 999
    0
    10
    20
    30
    % of tracks
    ACM MIRUM
    µ = 102.72, = 32.58
    N = 1410
    0
    10
    20
    30
    % of tracks
    Hainsworth
    µ = 113.30, = 28.78
    N = 222
    20
    30
    acks
    Ballroom
    µ = 129.77, = 39.61
    mic power Yln
    (m, k) := ln (1000 · Y (m, k) + 1), and its
    frequency by F(k) given in Hz. We define the onset signal
    strength OSS(m) as the sum of the bandwise differences
    between the logarithmic powers Yln
    (m, k) and Yln
    (m
    1, k) for those k where the frequency F(k) 2 [30, 720]
    and Y (m, k) is greater than ↵Y (m 1, k) (see [16]):
    I(m, k) =
    8
    <
    :
    1 if Y (m, k) > ↵Y (m 1, k)
    and F(k) 2 [30, 720],
    0 otherwise
    (1)
    OSS(m) =
    X
    k
    (Yln
    (m, k) Yln
    (m 1, k)) · I(m, k)
    Both the factor ↵ = 1.76 and the frequency range were
    found experimentally [24].
    The OSS(m) is transformed using a DFT with length
    8192. At the given sample rate, this ensures a resolution
    of 0.156 BPM. The peaks of the resulting beat spectrum
    B represent the strength of BPM values in the signal [7],
    but do not take harmonics into account [10, 21]. There-
    fore we derive an enhanced beat spectrum BE
    that boosts
    frequencies supported by harmonics:
    BE
    (k) =
    2
    X
    |B(bk/2i + 0.5c)| (2)
    0
    10
    20
    % of tr
    N = 222
    0
    10
    20
    30
    % of tracks
    Ballroom
    µ = 129.77, = 39.61
    N = 698
    20 – 30
    30 – 40
    40 – 50
    50 – 60
    60 – 70
    70 – 80
    80 – 90
    90 – 100
    100 – 110
    110 – 120
    120 – 130
    130 – 140
    140 – 150
    150 – 160
    160 – 170
    170 – 180
    180 – 190
    190 – 200
    200 – 210
    210 – 220
    220 – 230
    230 – 240
    240 – 250
    250 – 260
    0
    10
    20
    30
    Tempo intervals in BPM
    % of tracks
    GiantSteps µ = 136.66, = 28.33
    N = 664
    Figure 1. Tempo distributions for the test datasets.
    highest value of BE
    , divide its frequency by 4 to find the
    first harmonic, and finally convert its associated frequency
    to BPM:
    60
    40 – 50
    50 – 60
    60 – 70
    70 – 80
    80 – 90
    90 – 100
    100 – 110
    110 – 120
    120 – 130
    130 – 140
    140 – 150
    150 – 160
    160 – 170
    170 – 180
    180 – 190
    190 – 200
    200 – 210
    210 – 220
    0
    10
    20
    30
    Tempo intervals in BPM
    % of tracks
    µ = 121.32, = 30.52
    N = 8, 596
    Figure 1: Tempo distribution for the Train dataset con-
    sisting of LMD Tempo, MTG Tempo, and EBall.
    sisting of multiple components (“layers”) that has evolved
    naturally. But to the best of our knowledge, nobody has
    replaced the traditional multi-component architecture with
    a single deep neural network (DNN) yet. In this paper we
    describe a CNN-based approach that estimates the local
    end, we estimated the tempo of the matched audio pre-
    views using the algorithm from [31]. Then the associated
    MIDI files were parsed for tempo change messages. If the
    value of more than half the tempo messages for a given
    preview were within 2% of the estimated tempo, we as-
    sumed the estimated tempo of the audio excerpts to be cor-
    rect and added it to LMD Tempo. This resulted in 3,611
    audio tracks. We were able to match more than 76% of the
    tracks to the Million Song Dataset (MSD) genre annota-
    tions from [29]. Of the matched tracks 29% were labeled
    rock, 27% pop, 5% r&b, 5% dance, 5% country,
    4% latin, and 3% electronic. Less than 2% of the
    tracks were labeled jazz, soundtrack, world and
    others. Thus it is fair to characterize LMD Tempo as a
    good cross-section of popular music.
    2.2 MTG Tempo
    The MTG Key dataset was created by Faraldo [8] as a
    Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 99
    Diverse
    μ=121.3 σ=30.5 μ=94.6 σ=24.4

    View Slide

  81. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Ballroom Benchmarking
    Trained on EDM
    Trained on Diverse
    SOTA [1][2]
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    98.7%
    95.0%
    81.2%
    92.0%
    85.9%
    56.5%
    Acc1 Acc2
    σ=1.8
    σ=0.5
    σ=2.7
    σ=0.2
    Huge
    improvement,
    but not quite
    SOTA
    [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.

    In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015.

    [2] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.

    In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018.

    +29.4pp

    View Slide

  82. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    EDM Benchmarking
    Trained on EDM
    Trained on Diverse
    SOTA [1]
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    97.6%
    96.5%
    97.2%
    82.5%
    87.9%
    86.8%
    Acc1 Acc2
    σ=0.8
    σ=0.2
    σ=0.8
    σ=0.5
    [1] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.

    In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018.

    Results mostly
    unchanged

    View Slide

  83. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Adding more samples made a difference.
    Can we squeeze more out of the data by
    improving the network architecture?

    View Slide

  84. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg

    View Slide

  85. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Going Deeper
    Input
    4 (1x3) Conv2D
    Dropout
    256 (256) Conv1D
    Dropout
    256 (1) Conv1D
    AvgPooling2D
    GlobalAvgPooling
    Softmax
    328.208 parameters

    View Slide

  86. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Going Deeper
    Input
    4 (1x3) Conv2D
    Dropout
    256 (256) Conv1D
    Dropout
    256 (1) Conv1D
    AvgPooling2D
    GlobalAvgPooling
    Softmax
    328.208 parameters
    Input
    4 (1x3) Conv2D
    Dropout
    256 (256) Conv1D
    Dropout
    256 (1) Conv1D
    AvgPooling2D
    GlobalAvgPooling
    Softmax
    256 (256) Conv1D
    17.105.680 parameters

    View Slide

  87. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Going Deeper
    Input
    4 (1x3) Conv2D
    Dropout
    256 (256) Conv1D
    Dropout
    256 (1) Conv1D
    AvgPooling2D
    GlobalAvgPooling
    Softmax
    328.208 parameters
    Input
    4 (1x3) Conv2D
    Dropout
    256 (256) Conv1D
    Dropout
    256 (1) Conv1D
    AvgPooling2D
    GlobalAvgPooling
    Softmax
    256 (256) Conv1D
    17.105.680 parameters
    Input
    4 (1x3) Conv2D
    Dropout
    256 (256) Conv1D
    Dropout
    256 (1) Conv1D
    AvgPooling2D
    GlobalAvgPooling
    Softmax
    256 (256) Conv1D
    256 (256) Conv1D
    33.883.152 parameters
    Parameter

    Overkill!

    View Slide

  88. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Going Deeper
    What if we combined ideas from [1] and [2]:
    ๏ Bottleneck layers (1x1 conv) 㱺 dimensionality reduction
    ๏ Filter bank (not every filter needs to be 256 frames long)
    Stepwise pooling along frequency axis possible
    BatchNormalization to avoid covariate shift
    Add layers with short filters
    [1] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

    [2] Pons, Jordi, and Xavier Serra. "Designing efficient architectures for modeling temporal features with convolutional neural networks.”

    IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, March 2017.

    View Slide

  89. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Going Deeper
    (Kx1) AvgPooling2D
    24 (1x32) C2D 24 (1x64) C2D 24 (1x96) C2D 24 (1x128) C2D 24 (1x192) C2D 24 (1x256) C2D
    Concatenation
    34 (1x1) Conv2D
    mf_mod[1]
    BatchNormalization
    stepwise pooling
    along frequency axis
    filterbank
    bottleneck
    Input
    BN + 16 (1x5) Conv2D
    mf_mod K=5
    BN + Dropout
    256 (1x1) Conv2D
    GlobalAvgPooling2D
    Softmax
    2.319.956 parameters
    BN + 16 (1x5) Conv2D
    BN + 16 (1x5) Conv2D
    mf_mod K=2
    mf_mod K=2
    mf_mod K=2
    [1] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.

    In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018.

    View Slide

  90. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    GTZAN Benchmarking
    [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.

    In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015.

    [2] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning.

    In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017.

    Shallow Net
    Mf_mod Net
    SOTA [1][2]
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    95.0%
    92.7%
    91.0%
    71.0%
    63.9%
    58.2%
    Acc1 Acc2
    σ=1.4
    σ=0.4
    σ=1.9
    σ=0.7
    Again an
    improvement,
    but not SOTA
    +5.7pp

    View Slide

  91. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Ballroom Benchmarking
    Shallow Net
    Mf_mod Net
    SOTA [1][2]
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    98.7%
    94.7%
    95.0%
    92.0%
    88.0%
    85.9%
    Acc1 Acc2
    σ=2.7
    σ=0.2
    σ=1.7
    σ=0.4
    Slight Acc1
    improvement,
    but not quite
    SOTA
    [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.

    In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015.

    [2] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.

    In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018.

    +2.1pp

    View Slide

  92. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    EDM Benchmarking
    Shallow Net
    Mf_mod Net
    SOTA [1]
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    97.6%
    97.4%
    96.5%
    82.5%
    89.2%
    87.9%
    Acc1 Acc2
    σ=0.8
    σ=0.5
    σ=0.3
    σ=0.3
    [1] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.

    In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018.

    Results slightly
    improved
    +1.3pp
    +0.9pp

    View Slide

  93. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    All results got a little better:

    There seems to be room for improvement.

    View Slide

  94. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    All results got a little better:

    There seems to be room for improvement.
    But how?


    Let’s try something old!

    View Slide

  95. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    VGG-Style Net[1]
    1.198.704 parameters
    16*2K (5x5) Conv2D
    BatchNormalization
    16*2K (3x3) Conv2D
    BatchNormalization
    (2x2) MaxPooling2D*
    vgg_mod
    Dropout 0.3
    Input
    vgg_mod K=0
    256 (1x1) Conv2D
    GlobalAvgPooling2D
    Softmax
    vgg_mod K=1
    vgg_mod K=2
    vgg_mod K=2
    vgg_mod K=3
    vgg_mod K=3
    * pooling along frequency axis with size 2 only as long as there is something left to pool
    [1] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).

    View Slide

  96. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    GTZAN Benchmarking
    [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.

    In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015.

    [2] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning.

    In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017.

    Mf_mod Net
    VGG Net
    SOTA [1][2]
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    95.0%
    91.7%
    92.7%
    71.0%
    63.5%
    63.9%
    Acc1 Acc2
    σ=1.9
    σ=0.7
    σ=3.2
    σ=0.4
    No improvement,
    but as good as
    specialized network

    View Slide

  97. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Ballroom Benchmarking
    Mf_mod Net
    VGG Net
    SOTA [1][2]
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    98.7%
    95.1%
    94.7%
    92.0%
    91.6%
    88.0%
    Acc1 Acc2
    σ=1.7
    σ=0.4
    σ=1.7
    σ=0.1
    Acc1 now SOTA!
    Why? ? [3]
    [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.

    In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015.

    [2] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.

    In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018.

    [3] Sturm, Bob L. "A simple method to determine if a music information retrieval system is a “horse”." IEEE Transactions on Multimedia 16.6 (2014): 1636-1644.
    +3.6pp

    View Slide

  98. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    EDM Benchmarking
    Mf_mod Net
    VGG Net
    SOTA [1]
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    97.6%
    97.1%
    97.4%
    82.5%
    88.8%
    89.2%
    Acc1 Acc2
    σ=0.3
    σ=0.3
    σ=1.2
    σ=0.1
    [1] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.

    In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018.

    Similar results as specialized approaches

    View Slide

  99. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    What if we use a VGG style network,

    but only rectangular filters?

    View Slide

  100. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Rect VGG-Style Net
    320.304 parameters
    16*2K (1x5) Conv2D
    BatchNormalization
    16*2K (1x3) Conv2D
    BatchNormalization
    (2x2) MaxPooling2D
    rect_vgg_mod
    Dropout 0.3
    Input
    rect_vgg_mod K=0
    256 (1x1) Conv2D
    GlobalAvgPooling2D
    Softmax
    rect_vgg_mod K=1
    rect_vgg_mod K=2
    rect_vgg_mod K=2
    rect_vgg_mod K=3
    rect_vgg_mod K=3
    same as before,
    but with rectangular
    (1x5) and (1x3) filters

    View Slide

  101. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    GTZAN Benchmarking
    [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.

    In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015.

    [2] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning.

    In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017.

    VGG Net
    Rect VGG Net
    SOTA [1][2]
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    95.0%
    91.6%
    91.7%
    71.0%
    65.7%
    63.5%
    Acc1 Acc2
    σ=3.2
    σ=0.4
    σ=0.9
    σ=0.1
    Slight Acc1
    improvement
    +2.2pp

    View Slide

  102. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Ballroom Benchmarking
    VGG Net
    Rect VGG Net
    SOTA [1][2]
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    98.7%
    94.6%
    95.1%
    92.0%
    86.2%
    91.6%
    Acc1 Acc2
    σ=1.7
    σ=0.4
    σ=1.6
    σ=0.1
    Below SOTA again.

    Timbral information must add clues
    about genre and therefore tempo!
    [1] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.

    In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015.

    [2] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.

    In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018.

    -5.4pp

    View Slide

  103. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    EDM Benchmarking
    VGG Net
    Rect VGG Net
    SOTA [1]
    Accuracy in %
    0% 20% 40% 60% 80% 100%
    97.6%
    96.9%
    97.1%
    82.5%
    87.6%
    88.8%
    Acc1 Acc2
    σ=1.2
    σ=0.1
    σ=0.8
    σ=0.2
    [1] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.

    In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018.

    Slightly lower than square VGG
    -1.2pp

    View Slide

  104. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    That was a lot of graphs and
    architectures.
    Let’s summarize.

    View Slide

  105. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Multi-Step
    Dataset Averages: Acc1
    [1] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning.

    In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017.

    [2] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.

    In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015.

    [3] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.

    In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018.
    DSP + RandomForest
    CNN - similar to Mf_mod
    BLSTM + Comb Filters
    Shallow Net
    MF_mod Net
    VGG Net
    Rect VGG Net
    Schreiber17 [1]
    Böck/Madmom [2]
    Schreiber18 [3]
    Accuracy 1 in %
    0% 20% 40% 60% 80% 100%
    81.3%
    72.8%
    68.6%
    79.8%
    81.3%
    80.4%
    77.3%
    81.3%
    Max

    View Slide

  106. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Dataset Averages: Acc2
    Shallow Net
    MF_mod Net
    VGG Net
    Rect VGG Net
    Schreiber17 [1]
    Böck/Madmom [2]
    Schreiber18 [3]
    Accuracy 2 in %
    0% 20% 40% 60% 80% 100%
    95.9%
    96.2%
    95.2%
    94.4%
    94.6%
    94.9%
    94.2%
    96.2%
    Max
    [1] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo estimates using supervised learning.

    In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 235–242, Suzhou, China, October 2017.

    [2] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters.

    In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 625–631, Málaga, Spain, 2015.

    [3] Hendrik Schreiber and Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network.

    In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 98–105, Paris, France, Sept. 2018.
    DSP + RandomForest
    CNN - similar to Mf_mod
    BLSTM + Comb Filters

    View Slide

  107. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Apparently we have a couple of
    systems that work pretty well.
    What can we do with them?

    View Slide

  108. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Applications
    ๏ DJ apps
    ๏ Annotation supported browsing
    ๏ Content-based recommender systems
    ๏ …
    ๏ Anything else that’s remotely cool?

    View Slide

  109. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Local Tempo Estimation
    “Honky Tonk Women” by The Rolling Stones

    View Slide

  110. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Local Tempo Estimation
    “Honky Tonk Women” by The Rolling Stones

    View Slide

  111. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Local Tempo Estimation
    “Typhoon” by Foreign Beggars/Chasing Shadows

    View Slide

  112. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Local Tempo Estimation
    “Typhoon” by Foreign Beggars/Chasing Shadows

    View Slide

  113. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Conclusions I
    ๏ Consolidates multi-component approach

    View Slide

  114. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Conclusions I
    ๏ Consolidates multi-component approach
    ๏ Completely data-driven, no heuristics

    View Slide

  115. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Conclusions I
    ๏ Consolidates multi-component approach
    ๏ Completely data-driven, no heuristics
    ๏ Fewer octave-errors

    View Slide

  116. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Conclusions I
    ๏ Consolidates multi-component approach
    ๏ Completely data-driven, no heuristics
    ๏ Fewer octave-errors
    ๏ Suitable for global and local tempo estimation

    View Slide

  117. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Conclusions II
    ๏ Great tempo estimation results are possible with simple, shallow networks

    View Slide

  118. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Conclusions II
    ๏ Great tempo estimation results are possible with simple, shallow networks
    ๏ Training data and data augmentation are key

    View Slide

  119. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Conclusions II
    ๏ Great tempo estimation results are possible with simple, shallow networks
    ๏ Training data and data augmentation are key
    ๏ Rectangular filters are sufficient for tempo estimation

    View Slide

  120. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Conclusions II
    ๏ Great tempo estimation results are possible with simple, shallow networks
    ๏ Training data and data augmentation are key
    ๏ Rectangular filters are sufficient for tempo estimation
    ๏ Square filters can improve results further, but beware of horses

    View Slide

  121. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Conclusions II
    ๏ Great tempo estimation results are possible with simple, shallow networks
    ๏ Training data and data augmentation are key
    ๏ Rectangular filters are sufficient for tempo estimation
    ๏ Square filters can improve results further, but beware of horses
    ๏ Yes, spectrograms differ from images, but don’t re-invent the wheel

    View Slide

  122. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Questions
    ๏ If the DSP-ignorant VGG approach works so well, why bother
    with trying to design a DSP-informed network architecture?
    ๏ Should tempo estimators use timbral or genre information?
    ๏ What’s with this 4% tolerance? Doesn’t it make results
    useless for DJs?

    View Slide

  123. INTERNATIONAL AUDIO LABORATORIES ERLANGEN
    A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg
    Thank you.
    Hendrik Schreiber

    tagtraum industries incorporated
    [email protected] @h_schreiber

    View Slide