INTERNATIONAL AUDIO LABORATORIES ERLANGEN

A joint institution of Fraunhofer IIS and Universität Erlangen-Nürnberg

Tempo Distributions

tures, and then compare our results with those from other

methods. Finally, in Section 5, we present our conclusions.

2. TEMPO ESTIMATION

To lay the groundwork for our error correction method, we

ﬁrst describe a simple tempo estimation algorithm, then in-

troduce several test datasets and discuss common pitfalls.

In Section 2.5, we introduce performance metrics and de-

scribe observed errors.

2.1 Algorithm

To estimate the dominant pulse we follow the approach

taken in [24], which is similar to [23, 28]: We ﬁrst con-

vert the signal to mono and downsample to 11025 Hz.

Then we compute the power spectrum Y of 93 ms win-

dows with half overlap, by applying a Hamming win-

dow and performing an STFT. The power for each bin

k 2 [0 : K] := {0, 1, 2, . . . , K} at time m 2 [0 : M] :=

{0, 1, 2, . . . , M} is given by Y (m, k), its positive logarith-

mic power Yln

(m, k) := ln (1000 · Y (m, k) + 1), and its

frequency by F(k) given in Hz. We deﬁne the onset signal

strength OSS(m) as the sum of the bandwise differences

between the logarithmic powers Yln

(m, k) and Yln

(m

1, k) for those k where the frequency F(k) 2 [30, 720]

0

10

20

% of tr

0

10

20

30

% of tracks

ISMIR2004 Songs

µ = 89.80, = 27.83

N = 464

0

10

20

30

% of tracks

GTZAN

µ = 94.55, = 24.39

N = 999

0

10

20

30

% of tracks

ACM MIRUM

µ = 102.72, = 32.58

N = 1410

0

10

20

30

% of tracks

Hainsworth

µ = 113.30, = 28.78

N = 222

20

30

acks

Ballroom

µ = 129.77, = 39.61

mic power Yln

(m, k) := ln (1000 · Y (m, k) + 1), and its

frequency by F(k) given in Hz. We deﬁne the onset signal

strength OSS(m) as the sum of the bandwise differences

between the logarithmic powers Yln

(m, k) and Yln

(m

1, k) for those k where the frequency F(k) 2 [30, 720]

and Y (m, k) is greater than ↵Y (m 1, k) (see [16]):

I(m, k) =

8

<

:

1 if Y (m, k) > ↵Y (m 1, k)

and F(k) 2 [30, 720],

0 otherwise

(1)

OSS(m) =

X

k

(Yln

(m, k) Yln

(m 1, k)) · I(m, k)

Both the factor ↵ = 1.76 and the frequency range were

found experimentally [24].

The OSS(m) is transformed using a DFT with length

8192. At the given sample rate, this ensures a resolution

of 0.156 BPM. The peaks of the resulting beat spectrum

B represent the strength of BPM values in the signal [7],

but do not take harmonics into account [10, 21]. There-

fore we derive an enhanced beat spectrum BE

that boosts

frequencies supported by harmonics:

BE

(k) =

2

X

|B(bk/2i + 0.5c)| (2)

0

10

20

% of tr

N = 222

0

10

20

30

% of tracks

Ballroom

µ = 129.77, = 39.61

N = 698

20 – 30

30 – 40

40 – 50

50 – 60

60 – 70

70 – 80

80 – 90

90 – 100

100 – 110

110 – 120

120 – 130

130 – 140

140 – 150

150 – 160

160 – 170

170 – 180

180 – 190

190 – 200

200 – 210

210 – 220

220 – 230

230 – 240

240 – 250

250 – 260

0

10

20

30

Tempo intervals in BPM

% of tracks

GiantSteps µ = 136.66, = 28.33

N = 664

Figure 1. Tempo distributions for the test datasets.

highest value of BE

, divide its frequency by 4 to ﬁnd the

ﬁrst harmonic, and ﬁnally convert its associated frequency

to BPM:

60

40 – 50

50 – 60

60 – 70

70 – 80

80 – 90

90 – 100

100 – 110

110 – 120

120 – 130

130 – 140

140 – 150

150 – 160

160 – 170

170 – 180

180 – 190

190 – 200

200 – 210

210 – 220

0

10

20

30

Tempo intervals in BPM

% of tracks

µ = 121.32, = 30.52

N = 8, 596

Figure 1: Tempo distribution for the Train dataset con-

sisting of LMD Tempo, MTG Tempo, and EBall.

sisting of multiple components (“layers”) that has evolved

naturally. But to the best of our knowledge, nobody has

replaced the traditional multi-component architecture with

a single deep neural network (DNN) yet. In this paper we

describe a CNN-based approach that estimates the local

end, we estimated the tempo of the matched audio pre-

views using the algorithm from [31]. Then the associated

MIDI ﬁles were parsed for tempo change messages. If the

value of more than half the tempo messages for a given

preview were within 2% of the estimated tempo, we as-

sumed the estimated tempo of the audio excerpts to be cor-

rect and added it to LMD Tempo. This resulted in 3,611

audio tracks. We were able to match more than 76% of the

tracks to the Million Song Dataset (MSD) genre annota-

tions from [29]. Of the matched tracks 29% were labeled

rock, 27% pop, 5% r&b, 5% dance, 5% country,

4% latin, and 3% electronic. Less than 2% of the

tracks were labeled jazz, soundtrack, world and

others. Thus it is fair to characterize LMD Tempo as a

good cross-section of popular music.

2.2 MTG Tempo

The MTG Key dataset was created by Faraldo [8] as a

Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 99

Diverse

μ=121.3 σ=30.5 μ=94.6 σ=24.4