Source Code: https://github.com/mfcabrera/wtg As mandatory interdisciplinary project in my M.Sc. at the TUM we worked in the Machine Learning and Geometric Optimization group implementing a system for Music Genre Recognition using K-SVD and SVM.
Recognition using Dictionary Learning Interdisciplinary Project Miguel Cabrera, Thomas Pieronczyk Research Group for Geometric Optimization and Machine Learning October 25, 2013 Slide 1/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Information Retrieval (MIR) Artist, instrument and chord recognition Music annotation (tagging) Mood and genre classification Music Genre Recognition (MGR) Slide 3/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
High dimensional No formal definition Highly subjective One song → Many genres Constantly new genres appearing Slide 4/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
objective: System that predicts the musical genre of a piece of music Combination of: Yeh and Youngs Dictionary Learning Framework for Music Genre Recognition (MGR) [1] with the K-SVD algorithm from Aharon et. al. [2]. Slide 5/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
transformation & feature extraction Codebook generation Encoding Code word encoding aggregation Training Audio-Signal transformation & feature extraction Encoding Code word encoding aggregation Prediction Training Songs Test song Ground truth If supervised Codebook Codebook Training Testing Prediction Figure : Framework (Source: [1]) Slide 6/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
transformation & feature extraction Codebook generation Encoding Code word encoding aggregation Training Audio-Signal transformation & feature extraction Encoding Code word encoding aggregation Prediction Training Songs Test song Ground truth If supervised Codebook Codebook Training Testing Prediction Figure : Framework - Audio Feature Extraction (Source: [1]) Slide 7/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
musical genre recognition Meta-Data Features Short-Time Audio Features Slide 8/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
transformation & feature extraction Codebook generation Encoding Code word encoding aggregation Training Audio-Signal transformation & feature extraction Encoding Code word encoding aggregation Prediction Training Songs Test song Ground truth If supervised Codebook Codebook Training Testing Prediction Figure : Framework - Codebook Generation (Source: [1]) Slide 11/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Given an input signal vector y ∈ Rn, the sparse representation problem can be mathematically formulated as: x∗ = argmin x 1 2 y − Dx 2 2 + λ x 1 (1) Figure : Source: [5] Slide 12/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Learning Algorithm min D,X Y − DX 2 F subject to ∀i, xi 0 ≤ T0 (2) ⇓ K-SVD is a generalization of the K-Means algorithm. Slide 13/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Algorithm Initialization: Initialization of the dictionary D ∈ RnxK Sparse Coding Step: Sparse coding of examples based on the current dictionary Codebook Update Step Updating the dictionary atoms to better fit the data. Slide 14/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
for MGR Dictionaries are trained separately per genre The resulting dictionary is the concatenation of all the separately trained dictionaries D [D1, D2, D3, ..., Dc] (3) Slide 15/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
for MGR Dictionaries are trained separately per genre The resulting dictionary is the concatenation of all the separately trained dictionaries D [D1, D2, D3, ..., Dc] (3) Slide 15/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
transformation & feature extraction Codebook generation Encoding Code word encoding aggregation Training Audio-Signal transformation & feature extraction Encoding Code word encoding aggregation Prediction Training Songs Test song Ground truth If supervised Codebook Codebook Training Testing Prediction Figure : Framework - Encoding (Source: [1]) Slide 16/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
codebook generation the training data is re-encoded with the concatenated dictionary D [D1, D2, D3, ..., Dc] resulting in a sparse representation. Slide 17/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
transformation & feature extraction Codebook generation Encoding Code word encoding aggregation Training Audio-Signal transformation & feature extraction Encoding Code word encoding aggregation Prediction Training Songs Test song Ground truth If supervised Codebook Codebook Training Testing Prediction Figure : Framework - Aggregation (Source: [1]) Slide 18/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
We aggregate the encoded frames into “texture windows” [3] Texture Windows: Minimum amount of time that is necessary to identify a particular music “texture” Window size: 3-5 seconds Song represented as a bag-of-histograms. i.e. 6 histograms per song The histograms inherit the labels Slide 19/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
We aggregate the encoded frames into “texture windows” [3] Texture Windows: Minimum amount of time that is necessary to identify a particular music “texture” Window size: 3-5 seconds Song represented as a bag-of-histograms. i.e. 6 histograms per song The histograms inherit the labels Slide 19/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
We aggregate the encoded frames into “texture windows” [3] Texture Windows: Minimum amount of time that is necessary to identify a particular music “texture” Window size: 3-5 seconds Song represented as a bag-of-histograms. i.e. 6 histograms per song The histograms inherit the labels Slide 19/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
We aggregate the encoded frames into “texture windows” [3] Texture Windows: Minimum amount of time that is necessary to identify a particular music “texture” Window size: 3-5 seconds Song represented as a bag-of-histograms. i.e. 6 histograms per song The histograms inherit the labels Slide 19/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
We aggregate the encoded frames into “texture windows” [3] Texture Windows: Minimum amount of time that is necessary to identify a particular music “texture” Window size: 3-5 seconds Song represented as a bag-of-histograms. i.e. 6 histograms per song The histograms inherit the labels Slide 19/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
transformation & feature extraction Codebook generation Encoding Code word encoding aggregation Training Audio-Signal transformation & feature extraction Encoding Code word encoding aggregation Prediction Training Songs Test song Ground truth If supervised Codebook Codebook Training Testing Prediction Figure : Framework - Aggregation (Source: [1]) Slide 20/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Histogram Intersection Kernel We use a Support Vector Machine for the classification step. Histogram Intersection Kernel KHI(ha, hb) = k j=1 min(ha(j), hb(j)) Measure the degree of similarity between two histograms. Computational comparable with linear SVM Works better than linear and non-linear SVM for histograms features. Implementation1 based on the popular Libsvm for Matlab. 1http://www.cs.berkeley.edu/ smaji/projects/fiksvm/ Slide 21/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Histogram Intersection Kernel We use a Support Vector Machine for the classification step. Histogram Intersection Kernel KHI(ha, hb) = k j=1 min(ha(j), hb(j)) Measure the degree of similarity between two histograms. Computational comparable with linear SVM Works better than linear and non-linear SVM for histograms features. Implementation1 based on the popular Libsvm for Matlab. 1http://www.cs.berkeley.edu/ smaji/projects/fiksvm/ Slide 21/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Histogram Intersection Kernel We use a Support Vector Machine for the classification step. Histogram Intersection Kernel KHI(ha, hb) = k j=1 min(ha(j), hb(j)) Measure the degree of similarity between two histograms. Computational comparable with linear SVM Works better than linear and non-linear SVM for histograms features. Implementation1 based on the popular Libsvm for Matlab. 1http://www.cs.berkeley.edu/ smaji/projects/fiksvm/ Slide 21/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Histogram Intersection Kernel We use a Support Vector Machine for the classification step. Histogram Intersection Kernel KHI(ha, hb) = k j=1 min(ha(j), hb(j)) Measure the degree of similarity between two histograms. Computational comparable with linear SVM Works better than linear and non-linear SVM for histograms features. Implementation1 based on the popular Libsvm for Matlab. 1http://www.cs.berkeley.edu/ smaji/projects/fiksvm/ Slide 21/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Histogram Intersection Kernel We use a Support Vector Machine for the classification step. Histogram Intersection Kernel KHI(ha, hb) = k j=1 min(ha(j), hb(j)) Measure the degree of similarity between two histograms. Computational comparable with linear SVM Works better than linear and non-linear SVM for histograms features. Implementation1 based on the popular Libsvm for Matlab. 1http://www.cs.berkeley.edu/ smaji/projects/fiksvm/ Slide 21/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Histogram Intersection Kernel We use a Support Vector Machine for the classification step. Histogram Intersection Kernel KHI(ha, hb) = k j=1 min(ha(j), hb(j)) Measure the degree of similarity between two histograms. Computational comparable with linear SVM Works better than linear and non-linear SVM for histograms features. Implementation1 based on the popular Libsvm for Matlab. 1http://www.cs.berkeley.edu/ smaji/projects/fiksvm/ Slide 21/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
I Data GTZAN dataset comprising 1000 songs with 30 sec. length, equally divided into 10 genres: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock. One of the most frequently used datasets in MGR. But: Exposes several problems such as replications, mislabelings, and distortions. [6] Features CQT Spectrogram Features are normalized Slide 23/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
I Data GTZAN dataset comprising 1000 songs with 30 sec. length, equally divided into 10 genres: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock. One of the most frequently used datasets in MGR. But: Exposes several problems such as replications, mislabelings, and distortions. [6] Features CQT Spectrogram Features are normalized Slide 23/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
I Data GTZAN dataset comprising 1000 songs with 30 sec. length, equally divided into 10 genres: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock. One of the most frequently used datasets in MGR. But: Exposes several problems such as replications, mislabelings, and distortions. [6] Features CQT Spectrogram Features are normalized Slide 23/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
I Data GTZAN dataset comprising 1000 songs with 30 sec. length, equally divided into 10 genres: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock. One of the most frequently used datasets in MGR. But: Exposes several problems such as replications, mislabelings, and distortions. [6] Features CQT Spectrogram Features are normalized Slide 23/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
I Data GTZAN dataset comprising 1000 songs with 30 sec. length, equally divided into 10 genres: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock. One of the most frequently used datasets in MGR. But: Exposes several problems such as replications, mislabelings, and distortions. [6] Features CQT Spectrogram Features are normalized Slide 23/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
I Data GTZAN dataset comprising 1000 songs with 30 sec. length, equally divided into 10 genres: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock. One of the most frequently used datasets in MGR. But: Exposes several problems such as replications, mislabelings, and distortions. [6] Features CQT Spectrogram Features are normalized Slide 23/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
I Data GTZAN dataset comprising 1000 songs with 30 sec. length, equally divided into 10 genres: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock. One of the most frequently used datasets in MGR. But: Exposes several problems such as replications, mislabelings, and distortions. [6] Features CQT Spectrogram Features are normalized Slide 23/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
I Data GTZAN dataset comprising 1000 songs with 30 sec. length, equally divided into 10 genres: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock. One of the most frequently used datasets in MGR. But: Exposes several problems such as replications, mislabelings, and distortions. [6] Features CQT Spectrogram Features are normalized Slide 23/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
II Dictionary Learning Initialization: random and from data. Dictionary Size: 50-400 Target Sparsity: 1-3 Pursuit Algorithm: Orthogonal Matching Pursuit (OMP) Classification Parameter Selection: Experimentally using 10-fold cross-validation Performance measures: Accuracy at histogram and clip level Slide 24/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
II Dictionary Learning Initialization: random and from data. Dictionary Size: 50-400 Target Sparsity: 1-3 Pursuit Algorithm: Orthogonal Matching Pursuit (OMP) Classification Parameter Selection: Experimentally using 10-fold cross-validation Performance measures: Accuracy at histogram and clip level Slide 24/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
II Dictionary Learning Initialization: random and from data. Dictionary Size: 50-400 Target Sparsity: 1-3 Pursuit Algorithm: Orthogonal Matching Pursuit (OMP) Classification Parameter Selection: Experimentally using 10-fold cross-validation Performance measures: Accuracy at histogram and clip level Slide 24/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
II Dictionary Learning Initialization: random and from data. Dictionary Size: 50-400 Target Sparsity: 1-3 Pursuit Algorithm: Orthogonal Matching Pursuit (OMP) Classification Parameter Selection: Experimentally using 10-fold cross-validation Performance measures: Accuracy at histogram and clip level Slide 24/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
II Dictionary Learning Initialization: random and from data. Dictionary Size: 50-400 Target Sparsity: 1-3 Pursuit Algorithm: Orthogonal Matching Pursuit (OMP) Classification Parameter Selection: Experimentally using 10-fold cross-validation Performance measures: Accuracy at histogram and clip level Slide 24/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
II Dictionary Learning Initialization: random and from data. Dictionary Size: 50-400 Target Sparsity: 1-3 Pursuit Algorithm: Orthogonal Matching Pursuit (OMP) Classification Parameter Selection: Experimentally using 10-fold cross-validation Performance measures: Accuracy at histogram and clip level Slide 24/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
II Dictionary Learning Initialization: random and from data. Dictionary Size: 50-400 Target Sparsity: 1-3 Pursuit Algorithm: Orthogonal Matching Pursuit (OMP) Classification Parameter Selection: Experimentally using 10-fold cross-validation Performance measures: Accuracy at histogram and clip level Slide 24/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
II Dictionary Learning Initialization: random and from data. Dictionary Size: 50-400 Target Sparsity: 1-3 Pursuit Algorithm: Orthogonal Matching Pursuit (OMP) Classification Parameter Selection: Experimentally using 10-fold cross-validation Performance measures: Accuracy at histogram and clip level Slide 24/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Two partitioning schemes 90-10 90% used for dictionary and SVM training 10%: Encoded with the dictionary learned and used as testing set Full data 100% used for dictionary and SVM training Performance evaluation with with 10-fold cross validation This scheme is the one used in the literature Slide 25/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Two partitioning schemes 90-10 90% used for dictionary and SVM training 10%: Encoded with the dictionary learned and used as testing set Full data 100% used for dictionary and SVM training Performance evaluation with with 10-fold cross validation This scheme is the one used in the literature Slide 25/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Two partitioning schemes 90-10 90% used for dictionary and SVM training 10%: Encoded with the dictionary learned and used as testing set Full data 100% used for dictionary and SVM training Performance evaluation with with 10-fold cross validation This scheme is the one used in the literature Slide 25/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Two partitioning schemes 90-10 90% used for dictionary and SVM training 10%: Encoded with the dictionary learned and used as testing set Full data 100% used for dictionary and SVM training Performance evaluation with with 10-fold cross validation This scheme is the one used in the literature Slide 25/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Two partitioning schemes 90-10 90% used for dictionary and SVM training 10%: Encoded with the dictionary learned and used as testing set Full data 100% used for dictionary and SVM training Performance evaluation with with 10-fold cross validation This scheme is the one used in the literature Slide 25/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Two partitioning schemes 90-10 90% used for dictionary and SVM training 10%: Encoded with the dictionary learned and used as testing set Full data 100% used for dictionary and SVM training Performance evaluation with with 10-fold cross validation This scheme is the one used in the literature Slide 25/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Two partitioning schemes 90-10 90% used for dictionary and SVM training 10%: Encoded with the dictionary learned and used as testing set Full data 100% used for dictionary and SVM training Performance evaluation with with 10-fold cross validation This scheme is the one used in the literature Slide 25/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Two partitioning schemes 90-10 90% used for dictionary and SVM training 10%: Encoded with the dictionary learned and used as testing set Full data 100% used for dictionary and SVM training Performance evaluation with with 10-fold cross validation This scheme is the one used in the literature Slide 25/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Dictionary Update Iterations Y − DX 2 F for J = 1, 2, . . . , k (4) Slide 27/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Atom Usage : Blues blues classical country disco hip hop jazz metal pop reggae rock 0 200 400 600 800 1000 1200 Dictionaries Atom counts Atom usage counts Slide 28/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Atom Usage : Rock blues classical country disco hip hop jazz metal pop reggae rock 0 100 200 300 400 500 600 700 800 900 1000 1100 Dictionaries Atom counts Atom usage counts Slide 29/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
in combination with SVM HIK Kernel performs comparable with other state-of-the-art techniques. Sparsity 1 is the best set-up for this particular task Learning a number of sub-dictionaries for each class enhances the discriminative power of the encoding system This technique works better when the dictionary is intialized with frames from the data. Slide 35/36 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
[1] Yeh, Chin-Chia Michael and Yang, Yi-Hsuan Supervised dictionary learning for music genre classification. ACM, 2012., ISBN: 978-1-4503-1329-2 [2] Aharon, M. and Elad, M. and Bruckstein, A. K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation Signal Processing, IEEE Transactions on [3] Tzanetakis, G. and Cook, P. Musical genre classification of audio signals Speech and Audio Processing, IEEE Transactions on Slide 1/8 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
[4] Subhransu Maji, Alexander C. Berg and Jitendra Malik Fast Intersection / Additive Kernel SVM Toolbox [5] Course Slides - Information retrieval in high dimensional data WS1213, (Image) [6] Sturm, Bob L. An Analysis of the GTZAN Music Genre Dataset. Proceedings of the Second International ACM Workshop on Music Information Retrieval with User-centered and Multimodal Strategies Slide 2/8 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Initialization Phase Initialization Set the dictionary matrix D(0) ∈ RnxK with l2 normalized columns. Set J = 1. Slide 5/8 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Sparse Coding Step Sparse Coding Step Use any pursuit algorithm to compute the representation vectors xi for each example yi , by approximating the solution of i = 1, 2, . . . , N, min xi yi − Dxi 2 2 subject to xi 0 ≤ T0. (5) Slide 6/8 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013
Codebook Update Step Codebook Update Step For each column k = 1, 2, . . . , K ∈ DJ−1, update it by Define the group of examples that use this atom, ωk = i|1 ≤ i ≤ N, xk T (i) = 0 . Compute the overall representation error matrix, Ek , by Ek = Y − j=k dj xi T (6) Restrict Ek by choosing only the columns corresponding to ωk , and obtain ER k Apply SVD decomposition ER=U∆VT k . Choose the updated dictionary column dk to be the first column of U. Update the coefficient vector xk R to be the first column of V multiplied by ∆(1, 1). Slide 7/8 | Interdisciplinary Project | Music Genre Recognition using Dictionary Learning | July 2013