Dictionary Learning for Music Genre Recognition

Slide 1

Slide 1 text

Research Group for Geometric Optimization and Machine Learning Music Genre Recognition using Dictionary Learning Interdisciplinary Project Miguel Cabrera, Thomas Pieronczyk Research Group for Geometric Optimization and Machine Learning October 25, 2013 Slide 1/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 2

Slide 2 text

Research Group for Geometric Optimization and Machine Learning Table of contents Introduction Framework Experiments Results Conclusion Slide 2/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 3

Slide 3 text

Research Group for Geometric Optimization and Machine Learning Introduction Music Information Retrieval (MIR) Artist, instrument and chord recognition Music annotation (tagging) Mood and genre classiﬁcation Music Genre Recognition (MGR) Slide 3/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 4

Slide 4 text

Research Group for Geometric Optimization and Machine Learning MGR Challanges High dimensional No formal deﬁnition Highly subjective One song → Many genres Constantly new genres appearing Slide 4/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 5

Slide 5 text

Research Group for Geometric Optimization and Machine Learning Objectives Main objective: System that predicts the musical genre of a piece of music Combination of: Yeh and Youngs Dictionary Learning Framework for Music Genre Recognition (MGR) [1] with the K-SVD algorithm from Aharon et. al. [2]. Slide 5/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 6

Slide 6 text

Research Group for Geometric Optimization and Machine Learning Framework Audio-Signal transformation & feature extraction Codebook generation Encoding Code word encoding aggregation Training Audio-Signal transformation & feature extraction Encoding Code word encoding aggregation Prediction Training Songs Test song Ground truth If supervised Codebook Codebook Training Testing Prediction Figure : Framework (Source: [1]) Slide 6/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 7

Slide 7 text

Research Group for Geometric Optimization and Machine Learning Framework Audio-Signal transformation & feature extraction Codebook generation Encoding Code word encoding aggregation Training Audio-Signal transformation & feature extraction Encoding Code word encoding aggregation Prediction Training Songs Test song Ground truth If supervised Codebook Codebook Training Testing Prediction Figure : Framework - Audio Feature Extraction (Source: [1]) Slide 7/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 8

Slide 8 text

Research Group for Geometric Optimization and Machine Learning Features for musical genre recognition Meta-Data Features Short-Time Audio Features Slide 8/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 9

Slide 9 text

Research Group for Geometric Optimization and Machine Learning Short-Time Audio Features - Spectrogram Figure : Spectrogram Classical Figure : Spectrogram Rock Slide 9/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 10

Slide 10 text

Research Group for Geometric Optimization and Machine Learning Short-Time Audio Features - Constant Q Transformation Figure : Constant Q Transform - Classical Figure : Constant Q Transform - Rock Slide 10/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 11

Slide 11 text

Research Group for Geometric Optimization and Machine Learning Framework Audio-Signal transformation & feature extraction Codebook generation Encoding Code word encoding aggregation Training Audio-Signal transformation & feature extraction Encoding Code word encoding aggregation Prediction Training Songs Test song Ground truth If supervised Codebook Codebook Training Testing Prediction Figure : Framework - Codebook Generation (Source: [1]) Slide 11/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 12

Slide 12 text

Research Group for Geometric Optimization and Machine Learning Dictionary Learning Given an input signal vector y ∈ Rn, the sparse representation problem can be mathematically formulated as: x∗ = argmin x 1 2 y − Dx 2 2 + λ x 1 (1) Figure : Source: [5] Slide 12/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 13

Slide 13 text

Research Group for Geometric Optimization and Machine Learning K-SVD Dictionary Learning Algorithm min D,X Y − DX 2 F subject to ∀i, xi 0 ≤ T0 (2) ⇓ K-SVD is a generalization of the K-Means algorithm. Slide 13/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 14

Slide 14 text

Research Group for Geometric Optimization and Machine Learning K-SVD - Algorithm Initialization: Initialization of the dictionary D ∈ RnxK Sparse Coding Step: Sparse coding of examples based on the current dictionary Codebook Update Step Updating the dictionary atoms to better ﬁt the data. Slide 14/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 15

Slide 15 text

Research Group for Geometric Optimization and Machine Learning Dictionary Learning for MGR Dictionaries are trained separately per genre The resulting dictionary is the concatenation of all the separately trained dictionaries D [D1, D2, D3, ..., Dc] (3) Slide 15/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Research Group for Geometric Optimization and Machine Learning Framework Audio-Signal transformation & feature extraction Codebook generation Encoding Code word encoding aggregation Training Audio-Signal transformation & feature extraction Encoding Code word encoding aggregation Prediction Training Songs Test song Ground truth If supervised Codebook Codebook Training Testing Prediction Figure : Framework - Encoding (Source: [1]) Slide 16/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 18

Slide 18 text

Research Group for Geometric Optimization and Machine Learning Encoding After codebook generation the training data is re-encoded with the concatenated dictionary D [D1, D2, D3, ..., Dc] resulting in a sparse representation. Slide 17/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 19

Slide 19 text

Research Group for Geometric Optimization and Machine Learning Framework Audio-Signal transformation & feature extraction Codebook generation Encoding Code word encoding aggregation Training Audio-Signal transformation & feature extraction Encoding Code word encoding aggregation Prediction Training Songs Test song Ground truth If supervised Codebook Codebook Training Testing Prediction Figure : Framework - Aggregation (Source: [1]) Slide 18/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 20

Slide 20 text

Research Group for Geometric Optimization and Machine Learning Histogram Aggregation We aggregate the encoded frames into “texture windows” [3] Texture Windows: Minimum amount of time that is necessary to identify a particular music “texture” Window size: 3-5 seconds Song represented as a bag-of-histograms. i.e. 6 histograms per song The histograms inherit the labels Slide 19/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Research Group for Geometric Optimization and Machine Learning Framework Audio-Signal transformation & feature extraction Codebook generation Encoding Code word encoding aggregation Training Audio-Signal transformation & feature extraction Encoding Code word encoding aggregation Prediction Training Songs Test song Ground truth If supervised Codebook Codebook Training Testing Prediction Figure : Framework - Aggregation (Source: [1]) Slide 20/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 26

Slide 26 text

Research Group for Geometric Optimization and Machine Learning SVM with Histogram Intersection Kernel We use a Support Vector Machine for the classiﬁcation step. Histogram Intersection Kernel KHI(ha, hb) = k j=1 min(ha(j), hb(j)) Measure the degree of similarity between two histograms. Computational comparable with linear SVM Works better than linear and non-linear SVM for histograms features. Implementation1 based on the popular Libsvm for Matlab. 1http://www.cs.berkeley.edu/ smaji/projects/ﬁksvm/ Slide 21/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Research Group for Geometric Optimization and Machine Learning Experiments Slide 22/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 33

Slide 33 text

Research Group for Geometric Optimization and Machine Learning Experiments Summary I Data GTZAN dataset comprising 1000 songs with 30 sec. length, equally divided into 10 genres: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock. One of the most frequently used datasets in MGR. But: Exposes several problems such as replications, mislabelings, and distortions. [6] Features CQT Spectrogram Features are normalized Slide 23/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text

Slide 41

Slide 41 text

Research Group for Geometric Optimization and Machine Learning Experiments Summary II Dictionary Learning Initialization: random and from data. Dictionary Size: 50-400 Target Sparsity: 1-3 Pursuit Algorithm: Orthogonal Matching Pursuit (OMP) Classiﬁcation Parameter Selection: Experimentally using 10-fold cross-validation Performance measures: Accuracy at histogram and clip level Slide 24/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 42

Slide 42 text

Slide 43

Slide 43 text

Slide 44

Slide 44 text

Slide 45

Slide 45 text

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Slide 49

Slide 49 text

Research Group for Geometric Optimization and Machine Learning Data Partitioning Two partitioning schemes 90-10 90% used for dictionary and SVM training 10%: Encoded with the dictionary learned and used as testing set Full data 100% used for dictionary and SVM training Performance evaluation with with 10-fold cross validation This scheme is the one used in the literature Slide 25/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 50

Slide 50 text

Slide 51

Slide 51 text

Slide 52

Slide 52 text

Slide 53

Slide 53 text

Slide 54

Slide 54 text

Slide 55

Slide 55 text

Slide 56

Slide 56 text

Slide 57

Slide 57 text

Research Group for Geometric Optimization and Machine Learning Results Slide 26/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 58

Slide 58 text

Research Group for Geometric Optimization and Machine Learning Results - Dictionary Update Iterations Y − DX 2 F for J = 1, 2, . . . , k (4) Slide 27/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 59

Slide 59 text

Research Group for Geometric Optimization and Machine Learning Encoding - Atom Usage : Blues blues classical country disco hip hop jazz metal pop reggae rock 0 200 400 600 800 1000 1200 Dictionaries Atom counts Atom usage counts Slide 28/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 60

Slide 60 text

Research Group for Geometric Optimization and Machine Learning Encoding - Atom Usage : Rock blues classical country disco hip hop jazz metal pop reggae rock 0 100 200 300 400 500 600 700 800 900 1000 1100 Dictionaries Atom counts Atom usage counts Slide 29/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 61

Slide 61 text

Research Group for Geometric Optimization and Machine Learning Results - Histogram Level Table : Results summary using 90-10% split and normalized spectrogram Dictionary Target Cross test set size sparsity validation performance performance 500 1 75.96 62.83 1000 1 80.20 63.40 2000 1 81.69 65.67 3000 1 84.40 60.00 4000 1 77.61 65.16 500 2 65.10 53.66 1000 2 68.88 60.00 2000 2 74.55 61.00 3000 2 76.29 62.3 4000 2 76.29 62.3 500 3 65.16 54.87 1000 3 54.87 56.16 2000 3 71.81 58.83 3000 3 72.10 59.0 4000 3 74.22 59.3 Slide 30/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 62

Slide 62 text

Research Group for Geometric Optimization and Machine Learning Results - Clip Level Table : Results using full data and cross-validation with normalized spectrogram Dictionary Target Cross Cross size sparsity validation validation performance perf. clip level 500 1 75.02 79.53 1000 1 78.05 81.50 2000 1 82.11 84.26 3000 1 83.40 85.10 4000 1 83.23 85.12 Slide 31/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 63

Slide 63 text

Research Group for Geometric Optimization and Machine Learning Results - Dictionary Size vs Performance 500 1000 1500 2000 2500 3000 3500 4000 Dictionary Size 74 76 78 80 82 84 86 Performance Performance with different dictionary sizes. Frame level Performance Clip level Performance Slide 32/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 64

Slide 64 text

Research Group for Geometric Optimization and Machine Learning Results - Confusion Matrix blues classical country disco hiphop jazz metal pop reggae rock blues classical country disco hiphop jazz metal pop reggae rock 8 0 0 1 0 0 0 0 0 1 0 10 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 1 0 0 0 8 0 0 1 2 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 1 0 0 0 8 1 0 2 0 0 0 0 0 0 0 9 1 0 0 0 0 0 0 1 0 0 7 0 1 2 3 4 5 6 7 8 9 10 Figure : Confusion matrix from the test runs Slide 33/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 65

Slide 65 text

Research Group for Geometric Optimization and Machine Learning Results - State-of-the-art accuracies Tzanetakis et al.[TC02b] Panagakis et al.[PBK08] Yeh et al.[YY12] K-SVD + Histogram SVM Panagakis et al.[PKIA09] Chang et al.[CsRJI10] 0 20 40 60 80 100 Accuracy (%) Figure : State-of-the-art accuracies Slide 34/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 66

Slide 66 text

Research Group for Geometric Optimization and Machine Learning Conclusion K-SVD in combination with SVM HIK Kernel performs comparable with other state-of-the-art techniques. Sparsity 1 is the best set-up for this particular task Learning a number of sub-dictionaries for each class enhances the discriminative power of the encoding system This technique works better when the dictionary is intialized with frames from the data. Slide 35/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 67

Slide 67 text

Research Group for Geometric Optimization and Machine Learning Thank you. Slide 36/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 68

Slide 68 text

Research Group for Geometric Optimization and Machine Learning Sources I [1] Yeh, Chin-Chia Michael and Yang, Yi-Hsuan Supervised dictionary learning for music genre classiﬁcation. ACM, 2012., ISBN: 978-1-4503-1329-2 [2] Aharon, M. and Elad, M. and Bruckstein, A. K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation Signal Processing, IEEE Transactions on [3] Tzanetakis, G. and Cook, P. Musical genre classiﬁcation of audio signals Speech and Audio Processing, IEEE Transactions on Slide 1/8 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 69

Slide 69 text

Research Group for Geometric Optimization and Machine Learning Sources II [4] Subhransu Maji, Alexander C. Berg and Jitendra Malik Fast Intersection / Additive Kernel SVM Toolbox [5] Course Slides - Information retrieval in high dimensional data WS1213, (Image) [6] Sturm, Bob L. An Analysis of the GTZAN Music Genre Dataset. Proceedings of the Second International ACM Workshop on Music Information Retrieval with User-centered and Multimodal Strategies Slide 2/8 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 70

Slide 70 text

Research Group for Geometric Optimization and Machine Learning Backup Slides Slide 3/8 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 71

Slide 71 text

Research Group for Geometric Optimization and Machine Learning Graphical User Interface Slide 4/8 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 72

Slide 72 text

Research Group for Geometric Optimization and Machine Learning K-SVD - Initialization Phase Initialization Set the dictionary matrix D(0) ∈ RnxK with l2 normalized columns. Set J = 1. Slide 5/8 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 73

Slide 73 text

Research Group for Geometric Optimization and Machine Learning K-SVD - Sparse Coding Step Sparse Coding Step Use any pursuit algorithm to compute the representation vectors xi for each example yi , by approximating the solution of i = 1, 2, . . . , N, min xi yi − Dxi 2 2 subject to xi 0 ≤ T0. (5) Slide 6/8 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 74

Slide 74 text

Research Group for Geometric Optimization and Machine Learning K-SVD - Codebook Update Step Codebook Update Step For each column k = 1, 2, . . . , K ∈ DJ−1, update it by Define the group of examples that use this atom, ωk = i|1 ≤ i ≤ N, xk T (i) = 0 . Compute the overall representation error matrix, Ek , by Ek = Y − j=k dj xi T (6) Restrict Ek by choosing only the columns corresponding to ωk , and obtain ER k Apply SVD decomposition ER=U∆VT k . Choose the updated dictionary column dk to be the first column of U. Update the coefficient vector xk R to be the first column of V multiplied by ∆(1, 1). Slide 7/8 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013

Slide 75

Slide 75 text

Research Group for Geometric Optimization and Machine Learning K-SVD - Update Iteration Step Increase Iteration Step Set J = J + 1 Slide 8/8 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013