Slide 1

Slide 1 text

Matrix and Tensor Factorization for Machine Learning Special Topics in Mechano-Informatics โ…ก@ Tokyo University, 2024 Kazu Ghalamkari RIKEN AIP @KazuGhalamkari

Slide 2

Slide 2 text

3 3 Announcements Please click the raise hand button if you can see the slides and hear me clearly. Please refrain from redistributing slides. Feel free to ask any questions using the chat window during the lecture. ๐Ÿ—Ž The assignment is available on the final section. Some parts of the slide will be skipped due to time limitation.

Slide 3

Slide 3 text

4 Overview โ–  Introduction: Why do we decompose data? โ–  A quick review of matrix rank โ–  Singular value decomposition (SVD) and low-rank approximation โ–  Kernel subspace method and its applications (denoising, anomaly detection) โ–  Non-negative matrix factorization โ–  Tensor low-rank decomposition and many-body approximation

Slide 4

Slide 4 text

Diverse real-world data โ–  Purchase history โ–  Tabular data Sepal Length [cm] Sepal Width [cm] Petal Length [cm] Petal Width [cm] Species 5.1 3.5 1.4 0.2 setosa 7 3.2 4.7 1.4 versicolor 6.4 3.2 4.5 1.5 versicolor 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa 6.5 2.8 4.6 1.5 versicolor 6.6 2.9 4.6 1.3 versicolor 4.9 3 1.4 0.2 setosa 5.2 2.7 3.9 1.4 versicolor โ–  Grayscale image Decomposing the data as a matrix is beneficial โ–  Spectrum information Image from https://www.mathworks.com/help/images/image-types-in-the-toolbox_ja_JP.html Image from https://sigview.com/help/Time-FFTSpectrogram.html Image from Mithy, S. A., et al. "Classification of Iris Flower Dataset using Different Algorithms." Int. J. Sci. Res. In (2022).

Slide 5

Slide 5 text

Learning - Gaining insights from data decomposition Pattern Pattern 7

Slide 6

Slide 6 text

Learning - Gaining insights from data decomposition Beverages Snacks 8

Slide 7

Slide 7 text

Inference - Use the obtained knowledge Recommend to . Beverages Snacks 9

Slide 8

Slide 8 text

What is a good decomposition? ใƒปIs the decomposed representation interpretable? ใƒปIs the decomposition scalable to large data? ใƒปIs the decomposed representation unique? ใƒปCan the decomposition be done even with missing values? Many Choose appropriate decomposition method by considering real-world constraints Many

Slide 9

Slide 9 text

11 Overview โ–  Introduction: Why do we decompose data? โ–  A quick review of matrix rank โ–  Singular value decomposition (SVD) and low-rank approximation โ–  Kernel subspace method and its applications (denoising, anomaly detection) โ–  Non-negative matrix factorization โ–  Tensor low-rank decomposition and many-body approximation

Slide 10

Slide 10 text

Rank in linear algebra Each column is a constant multiple of vector . We call such a matrix a rank-1 matrix. 6-dimensional vector space 12

Slide 11

Slide 11 text

Rank in linear algebra Each row is a constant multiple of vector . We call such a matrix a rank-1 matrix. 6-dimensional vector space 13

Slide 12

Slide 12 text

Rank in linear algebra 14

Slide 13

Slide 13 text

Rank in linear algebra Each column of this matrix is written by a linear combination of the base . The rank-2 matrix is the matrix that is a linear combination of two linearly independent vectors. Each column vector is in a 2-dimensional plane in the 6-dimensional vector space. Plane spanned by 6-dimensional vector space 15

Slide 14

Slide 14 text

Rank in linear algebra When the column vector of A can be written by linear combination of , is the rank of matrix . Linear independent ๐‘Ÿ vectors (Basis) the rank of matrix A is the minimum integer number of ๐‘Ÿ. A matrix whose rank is ๐‘Ÿ is called a rank ๐‘Ÿ matrix. Def. 16

Slide 15

Slide 15 text

Properties of the matrix rank Basis Rank Property 17

Slide 16

Slide 16 text

Properties of the matrix rank Basis A is full rank when . A is rank deficient if . Rank Property 18

Slide 17

Slide 17 text

21 Summary: matrix rank Basis Rank Properties When the column vector of A can be written by linear combination of , the rank of matrix A is the minimum integer number of ๐‘Ÿ. Def.

Slide 18

Slide 18 text

22 Overview โ–  Introduction: Why do we decompose data? โ–  A quick review of matrix rank โ–  Singular value decomposition (SVD) and low-rank approximation โ–  Kernel subspace method and its applications (denoising, anomaly detection) โ–  Non-negative matrix factorization โ–  Tensor low-rank decomposition and many-body approximation

Slide 19

Slide 19 text

the base could be any linearly independent vectors. Singular value decomposition, SVD So far, we assumed that Imposes orthonormality on 23

Slide 20

Slide 20 text

Singular value decomposition, SVD Most significant rank-1 factor Least significant rank-1 factor Important term Unimportant term Imposes orthonormality on : weight (importance) of the following rank-1 matrix Singular value

Slide 21

Slide 21 text

Singular value decomposition, SVD Imposes orthonormality on : weight (importance) of the following rank-1 matrix Singular value 25

Slide 22

Slide 22 text

Singular value decomposition, SVD โ–  Any matrix A can be decomposed into the product of U, V, and a diagonal matrix ฮฃ It can be a complex, non-singular, or rectangular matrix 26

Slide 23

Slide 23 text

Singular value decomposition, SVD โ–  Any matrix A can be decomposed into the product of U, V, and a diagonal matrix ฮฃ 27 It can be a complex, non-singular, or rectangular matrix

Slide 24

Slide 24 text

Singular value decomposition, SVD โ–  A is a rank- matrix Unnecessary 0 28

Slide 25

Slide 25 text

Singular value decomposition, SVD Rank and Singular Value 0 Unnecessary The rank of matrix A is the number of non-zero singular values of A โ–  A is a rank- matrix 29

Slide 26

Slide 26 text

Low-rank approximation by SVD They are not as important, so let's set them to 0 and ignore them. Important term Unimportant term โ‰’ 0 โ–  Decomposing rank-๐‘Ÿ matrix ๐€ into orthogonal ๐” and ๐• and diagonal matrix โˆ‘. Singular value : weight of the following rank-1 matrix 30

Slide 27

Slide 27 text

Low-rank approximation by SVD โ‰’ 0 โ–  Decomposing rank-๐‘Ÿ matrix ๐€ into orthogonal ๐” and ๐• and diagonal matrix โˆ‘. Approximating a matrix by a smaller rank matrix Low-rank approximation They are not as important, so let's set them to 0 and ignore them. Singular value : weight of the following rank-1 matrix 31

Slide 28

Slide 28 text

The BEST Low-rank approximation by SVD โ–  SVD provides the best low-rank approximation minimizing Frobenius norm. is the best rank-k approximation for ๐€, which minimizes Frobenius norm. Evaluate how close they are via the Frobenius norm. Frobenius norm Eckart-Young Theorem (1936) 32

Slide 29

Slide 29 text

The BEST Low-rank approximation by SVD โ–  SVD is the best low-rank approximation minimizing unitary invariant norm. is the best rank-r matrix for ๐€, which minimizes any unitary invariant norm. Evaluate how similar they are via unitary invariant norm. Eckart-Young-Mirsky Theorem (1960) * Unitary invariant norm โ‹… โˆ— satisfies that ๐ โˆ— = ๐—๐๐˜ โˆ— for any matrix P and any unitary matrices X and Y. 33

Slide 30

Slide 30 text

Low-rank approximation of matrices saves memory. Required memory before the approximation Required memory sizes after approximation Low-rank approximation reduces required memory storage Example: Assuming is sufficiently smaller than . 34

Slide 31

Slide 31 text

Singular values and eigenvalues Put into The ๐‘–-th singular value ๐œŽ๐‘– of matrix ๐€ is the square root of the ๐‘–-th eigenvalue ฮป๐‘– of matrix ๐€๐€โŠค 35

Slide 32

Slide 32 text

Singular values and eigenvalues Each column vector is orthonormal Put into The ๐‘–-th singular value ๐œŽ๐‘– of matrix ๐€ is the square root of the ๐‘–-th eigenvalue ฮป๐‘– of matrix ๐€๐€โŠค 36

Slide 33

Slide 33 text

Singular values and eigenvalues Each column vector is orthonormal Put into The ๐‘–-th singular value ๐œŽ๐‘– of matrix ๐€ is the square root of the ๐‘–-th eigenvalue ฮป๐‘– of matrix ๐€๐€โŠค 37

Slide 34

Slide 34 text

Singular values and eigenvalues Each column vector is orthonormal Put into The ๐‘–-th singular value ๐œŽ๐‘– of matrix ๐€ is the square root of the ๐‘–-th eigenvalue ฮป๐‘– of matrix ๐€๐€โŠค 38

Slide 35

Slide 35 text

The ๐‘–-th singular value ๐œŽ๐‘– of matrix ๐€ is the square root of the ๐‘–-th eigenvalue ฮป๐‘– of matrix ๐€๐€โŠค Eigenvalue of Eigenvector of Singular values and eigenvalues Each column vector is orthonormal Put into ๐‘™ -th column of U ๐‘™ -th column of V (๐‘™, ๐‘™)-entry of โˆ‘

Slide 36

Slide 36 text

Singular values and eigenvalues Eigenvector of Put into Eigenvalue of Eigenvector of The ๐‘–-th singular value ๐œŽ๐‘– of matrix ๐€ is the square root of the ๐‘–-th eigenvalue ฮป๐‘– of matrix ๐€๐€โŠค Put into ๐‘™ -th column of U ๐‘™ -th column of V (๐‘™, ๐‘™)-entry of โˆ‘

Slide 37

Slide 37 text

Singular values and eigenvalues (Left singular vector) The ๐‘–-th singular value ๐œŽ๐‘– of matrix ๐€ is the square root of the ๐‘–-th eigenvalue ฮป๐‘– of matrix ๐€๐€โŠค SVD of A is an eigenvalue decomposition of ๐€๐€โŠค and ๐€โŠค๐€ The rank of matrix ๐€ equals the number of non-zero singular values of ๐€. Each column vector of ๐• is an eigenvector of ๐€โŠค๐€. Each column vector of ๐” is an eigenvector of ๐€๐€โŠค. (Right singular vector)

Slide 38

Slide 38 text

SVD examples in Python 42

Slide 39

Slide 39 text

SVD example in Python 43

Slide 40

Slide 40 text

SVD example in Python The reconstruction equals the original matrix. 44

Slide 41

Slide 41 text

SVD example in Python The orthonormality of the basis was confirmed. 45

Slide 42

Slide 42 text

Example of low-rank approximation by SVD in Python Smaller rank ๐‘˜ worsens the approximation performance. 46

Slide 43

Slide 43 text

Image reconstruction by SVD How many ranks are needed for reconstruction? 2000ร—1500 (2000+1500)ร—5 (2000+1500)ร—20 (2000+1500)ร—100 ๐‘˜=100, 11.67% storage ๐‘˜=20, 2.33% storage ๐‘˜=5, 0.57% storage ใƒ†ใ‚ญใ‚นใƒˆใŒๅซใพใ‚Œใฆใ„ใ‚‹็”ปๅƒ ่‡ชๅ‹•็š„ใซ็”Ÿๆˆใ• ใ‚ŒใŸ่ชฌๆ˜Ž Image from Steven L. Brunton, J. Nathan Kutz, โ€œData-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Controlโ€ Image from Steven L. Brunton, J. Nathan Kutz, โ€œData-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Controlโ€ 47

Slide 44

Slide 44 text

Hyperparameter tuning in SVD Accurate Rough Increased memory Slow Memory saving High speed (2000+1500)ร—5 (2000+1500)ร—20 (2000+1500)ร—100 Selecting the appropriate rank is a trade-off problem. It requires repeated trial and error. A typical hyper-parameter tuning problem ๐‘˜=100, 11.67% storage ๐‘˜=20, 2.33% storage ๐‘˜=5, 0.57% storage ใƒ†ใ‚ญใ‚นใƒˆใŒๅซใพใ‚Œใฆใ„ใ‚‹็”ปๅƒ ่‡ชๅ‹•็š„ใซ็”Ÿๆˆใ• ใ‚ŒใŸ่ชฌๆ˜Ž Image from Steven L. Brunton, J. Nathan Kutz, โ€œData-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Controlโ€ Image from Steven L. Brunton, J. Nathan Kutz, โ€œData-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Controlโ€ 48

Slide 45

Slide 45 text

Data are represented by a low-rank structure plus noise. Most of the data in this world has a low-rank structure. Real data Low-rank matrix Noise Low-rank approximation is good enough for many datasets. โ–  Low-rank approximation can approximate real-world data well. 49

Slide 46

Slide 46 text

50 Singular values Summary: Low-rank approximation and SVD โ–  SVD can be applied to any matrix. โ–  SVD finds the best low-rank matrix minimizing the Frobenius norm. are orthonormal. SVD provides the best rank-k approximation that minimizes the Frobenius norm. Eckart โˆ’ Young Theorem โ–  Low-rank approximation reduces data memory requirements.

Slide 47

Slide 47 text

52 Overview โ–  Introduction: Why do we decompose data? โ–  A quick review of matrix rank โ–  Singular value decomposition (SVD) and low-rank approximation โ–  Kernel subspace method and its applications (denoising, anomaly detection) โ–  Non-negative matrix factorization โ–  Tensor low-rank decomposition and many-body approximation

Slide 48

Slide 48 text

Classification of Iris Data Sepal Petal Sepal Petal โ–  Iris Dataset[1] [1] Fisher, Ronald A. "The use of multiple measurements in taxonomic problems." Annals of eugenics 7.2 (1936): 179-188. We estimate the species of iris data-points based on their sepal and petal lengths. Sepal Length [cm] Sepal Width [cm] Petal Length [cm] Petal Width [cm] Species 5.1 3.5 1.4 0.2 setosa 7 3.2 4.7 1.4 versicolor 6.4 3.2 4.5 1.5 versicolor 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa 6.5 2.8 4.6 1.5 versicolor 6.3 3.3 4.7 1.6 versicolor 6.6 2.9 4.6 1.3 versicolor 4.9 3 1.4 0.2 setosa 5.2 2.7 3.9 1.4 versicolor 5.9 3 4.2 1.5 ??? 5.6 3 4.5 1.5 ??? 4.7 3.2 1.6 0.2 ??? โ€ฆ โ€ฆ Image from Mithy, S. A., et al. "Classification of Iris Flower Dataset using Different Algorithms." Int. J. Sci. Res. In (2022). 53

Slide 49

Slide 49 text

Classification by the CLAFIC method Samples of class A Samples of class B Estimate class of each sample Classification Training data setosa versicolor Sepal Length [cm] Sepal Width [cm] Petal Length [cm] Petal Width [cm] Species 5.1 3.5 1.4 0.2 setosa 7 3.2 4.7 1.4 versicolor 6.4 3.2 4.5 1.5 versicolor 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa 6.5 2.8 4.6 1.5 versicolor 6.3 3.3 4.7 1.6 versicolor 6.6 2.9 4.6 1.3 versicolor 4.9 3 1.4 0.2 setosa 5.2 2.7 3.9 1.4 versicolor 5.9 3 4.2 1.5 ??? 5.6 3 4.5 1.5 ??? 4.7 3.2 1.6 0.2 ??? โ€ฆ โ€ฆ 54

Slide 50

Slide 50 text

Classification by the CLAFIC method Sepal Length [cm] Petal Width [cm] Sepal Width [cm] Samples of class A Samples of class B Estimate class of each sample Classification Training data setosa versicolor 55

Slide 51

Slide 51 text

Classification by the CLAFIC method Samples of class A Samples of class B Estimate class of each sample Classification Training data setosa versicolor Sepal Length [cm] Petal Width [cm] Sepal Width [cm] 56

Slide 52

Slide 52 text

Classification by the CLAFIC method Samples of class A Samples of class B Estimate class of each sample Classification Training data setosa versicolor 57

Slide 53

Slide 53 text

Classification by the CLAFIC method Samples of class A Samples of class B Estimate class of each sample Classification Training data setosa versicolor 58

Slide 54

Slide 54 text

Classification by the CLAFIC method Samples of class A Samples of class B Estimate class of each sample Classification Training data setosa versicolor 59

Slide 55

Slide 55 text

Classification by the CLAFIC method Samples of class A Samples of class B Estimate class of each sample Classification Training data setosa versicolor 60

Slide 56

Slide 56 text

Classification by the CLAFIC method ๏ผˆWe need normalize the data so that both subspaces pass through the origin๏ผ‰ Samples of class A Samples of class B Estimate class of each sample Classification Training data setosa versicolor

Slide 57

Slide 57 text

Classification by the CLAFIC method Orthogonal basis Orthogonal basis ๏ผšhyper-parameter Samples of class A Samples of class B Estimate class of each sample Classification Training data setosa versicolor ๏ผˆWe need normalize the data so that both subspaces pass through the origin๏ผ‰

Slide 58

Slide 58 text

Classification by the CLAFIC method โ–  Inference Find the closest subspace. โ–  Learning SVD for each class samples. Samples of class A Samples of class B Estimate class of each sample Classification Training data setosa versicolor ๏ผˆWe need normalize the data so that both subspaces pass through the origin๏ผ‰ If ๐‘‘๐‘Ž < ๐‘‘๐‘ , ๐‘ belongs to class A. If ๐‘‘๐‘Ž > ๐‘‘๐‘ , ๐‘ belongs to class B.

Slide 59

Slide 59 text

1-Neighborhood and CLAFIC 1-NN Classification by the nearest training data Classification by the nearest subspace CLAFIC If ๐‘‘๐‘Ž < ๐‘‘๐‘ , ๐‘ belongs to class A. If ๐‘‘๐‘Ž > ๐‘‘๐‘ , ๐‘ belongs to class B. 64

Slide 60

Slide 60 text

65 Summary: Classification by subspace method โ–  Learning SVD for each class. โ–  Inference Find the closest subspace. ๏ผˆWe need normalize the data so that both subspaces pass through the origin๏ผ‰ If ๐‘‘๐‘Ž < ๐‘‘๐‘ , ๐‘ belongs to class A. If ๐‘‘๐‘Ž > ๐‘‘๐‘ , ๐‘ belongs to class B.

Slide 61

Slide 61 text

Datasets that cannot be linearly separated โ–  There is no plane (straight line) that reduces the dimension properly. โ–  CLAFIC is ineffective for such a dataset. Two-dimensional data space Project the data to a higher dimensional space, making the data linearly separable. 66

Slide 62

Slide 62 text

Kernel CLAFIC, Learning ๐ฟ-dimensional data space ๐‘€(โ‰ซ ๐ฟ)- dimensional space 67

Slide 63

Slide 63 text

Kernel CLAFIC, Learning ๐ฟ-dimensional data space ๐‘€(โ‰ซ ๐ฟ)- dimensional space 68

Slide 64

Slide 64 text

Kernel CLAFIC, Learning Projection onto the higher dimensional space makes the data linearly separable. ๐ฟ-dimensional data space ๐‘€(โ‰ซ ๐ฟ)-dimensional space 69

Slide 65

Slide 65 text

Kernel CLAFIC, Learning Perform SVD for each class in high-dimensional space . ๐ฟ-dimensional data space ๐‘€(โ‰ซ ๐ฟ)-dimensional space 70

Slide 66

Slide 66 text

Kernel CLAFIC, Inference Project the test data ๐’„ to the high-dimensional space and predict the class by the closest subspace. 77 ๐ฟ-dimensional data space ๐‘€(โ‰ซ ๐ฟ)-dimensional space If ๐‘‘๐‘Ž < ๐‘‘๐‘ , ๐‘ belongs to class A. If ๐‘‘๐‘Ž > ๐‘‘๐‘ , ๐‘ belongs to class B.

Slide 67

Slide 67 text

No need to seek explicitly. Kernel trick 78 Large The distance can be written by the eigenvalues and eigenvectors of the Kernel matrix: ๐‘€(โ‰ซ ๐ฟ)-dimensional space

Slide 68

Slide 68 text

No need to seek explicitly. Kernel trick 79 Large The distance can be written by the eigenvalues and eigenvectors of the Kernel matrix: Typical example of a inner product (RBF Kernel) We need the similarity between the samples. ๏ผˆInner product๏ผ‰ ๐‘€(โ‰ซ ๐ฟ)-dimensional space

Slide 69

Slide 69 text

Denoising by Kernel PCA[1] [1] Mika, Sebastian, et al. "Kernel PCA and de-noising in feature spaces." Advances in neural information processing systems 11 (1998). โ–  Denoising : Remove noise from noisy data โ€ฆ 28 28 โ€ฆ 28ร—28=784

Slide 70

Slide 70 text

Denoising by Kernel PCA[1] [1] Mika, Sebastian, et al. "Kernel PCA and de-noising in feature spaces." Advances in neural information processing systems 11 (1998). (784) ๐ฟ-dimensional data space

Slide 71

Slide 71 text

Denoising by Kernel PCA[1] 85 ๐ฟ-dimensional data space ๐‘€(โ‰ซ ๐ฟ)-dimensional space

Slide 72

Slide 72 text

Denoising by Kernel PCA[1] 86 ๐ฟ-dimensional data space ๐‘€(โ‰ซ ๐ฟ)-dimensional space

Slide 73

Slide 73 text

Denoising by Kernel PCA[1] Project the noisy data into a low-dimensional space in . ๐ฟ-dimensional data space ๐‘€(โ‰ซ ๐ฟ)-dimensional space

Slide 74

Slide 74 text

The denoised image is obtained by the inverse mapping . Denoising by Kernel PCA[1] ๐ฟ-dimensional data space ๐‘€(โ‰ซ ๐ฟ)-dimensional space

Slide 75

Slide 75 text

Anomaly detection by Kernel PCA [2] โ–  Detect anomalous samples automatically. Normal Normal Anomaly Normal Normal Normal Normal [2] Hoffmann, Heiko. "Kernel PCA for novelty detection." Pattern recognition 40.3 (2007): 863-874. Anomaly 89

Slide 76

Slide 76 text

Anomaly detection by Kernel PCA [2] [2] Hoffmann, Heiko. "Kernel PCA for novelty detection." Pattern recognition 40.3 (2007): 863-874. โ–  Normal samples ๏ผˆTraining data๏ผ‰ โ–  Normal and anomalous samples๏ผˆTest data๏ผ‰ ็•ฐๅธธ ๆญฃๅธธ Normal Normal Normal Normal Normal ็•ฐๅธธ Determine whether it is normal or anomalous . 90

Slide 77

Slide 77 text

Anomaly detection by Kernel PCA [2] (784) Perform the Kernel PCA with normal samples only. 93 ๐ฟ-dimensional data space ๐‘€(โ‰ซ ๐ฟ)-dimensional space

Slide 78

Slide 78 text

Anomaly detection by Kernel PCA [2] Next, we project the test data to . 94 (784) ๐ฟ-dimensional data space ๐‘€(โ‰ซ ๐ฟ)-dimensional space

Slide 79

Slide 79 text

Anomaly detection by Kernel PCA [2] 95 (784) ๐ฟ-dimensional data space Next, we project the test data to . ๐‘€(โ‰ซ ๐ฟ)-dimensional space

Slide 80

Slide 80 text

Anomaly detection by Kernel PCA [2] ๏ผšHyper parameter (784) ๐ฟ-dimensional data space ๐‘€(โ‰ซ ๐ฟ)-dimensional space If , is predicted to be an anomalous sample.

Slide 81

Slide 81 text

97 Summary: Kernel CLAFIC and its various applications Kernel CLAFIC The data is linearly separated after projection to a higher dimensional space. Application 2 : Anomaly detection Back to data space with the inverse mapping . Anomaly detection by the distance to subspace. Application 1 : Denoising Data Space Feature space

Slide 82

Slide 82 text

Grassmannian Learning span ๐’–1 ๐ด, โ€ฆ , ๐’–๐‘Ÿ ๐ด span ๐’–1 ๐ต, โ€ฆ , ๐’–๐‘Ÿ ๐ต span ๐’–1 ๐ถ, โ€ฆ , ๐’–๐‘Ÿ ๐ถ span ๐’–1 , โ€ฆ , ๐’–๐‘Ÿ Classify subspaces by distance between subspaces Face images from Yale Face Database http://cvc.cs.yale.edu/cvc/projects/yalefaces/yalefaces.html ๐’€๐ด ๐’€๐ต ๐’€๐ถ ๐‘‘ ๐’€๐ด , ๐’€๐ต = 1 2 ||๐’€๐ด ๐’€๐ด โŠค โˆ’ ๐’€๐ต ๐’€๐ต โŠค||๐น Grassmannian Kernel method Useful when the number of data used for inference is not static e.g., Classification to estimate labels for each multiple facial image ๐‘‘ ๐’€๐ด , ๐’€๐ต = 1 โˆ’ ฯ‚ ๐‘– cos2๐œƒ๐‘– Projection Metric Binet-Cauchy Metric = ฯƒ ๐‘– sin2๐œƒ๐‘– [17] Hamm, Jihun, and Daniel D. Lee. "Grassmann discriminant analysis: a unifying view on subspace-based learning." ICML. 2008. ๐‘˜ ๐’€๐ด , ๐’€๐ต = det ๐’€๐ด โŠค๐’€๐ต 2 ๐‘˜ ๐’€๐ด , ๐’€๐ต = ||๐’€๐ด ๐’€๐ต โŠค||๐น Projection kernel Binet-Cauchy kernel ๐’€๐ด , ๐’€๐ต , ๐’€๐ถ โˆถ Training data Each point is a linear subspace of โ„๐‘ where N is the size of each image Principal angles obtained by SVD on ๐’€๐ด โŠค ๐’€๐ต

Slide 83

Slide 83 text

99 Overview โ–  Introduction: Why do we decompose data? โ–  A quick review of matrix rank โ–  Singular value decomposition (SVD) and low-rank approximation โ–  Kernel subspace method and its applications (denoising, anomaly detection) โ–  Non-negative matrix factorization โ–  Tensor low-rank decomposition and many-body approximation

Slide 84

Slide 84 text

Matrix factorization with non-negative constraints โ–  SVD includes negative values in the representation โ–  Non-negativity improves interpretability are orthogonal. SVD imposes that are non-negative. NMF imposes that 100

Slide 85

Slide 85 text

Example of non-negative matrix factorization [3] Lee, Daniel D., and H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." Nature 401.6755 (1999): 788-791. โ€ฆ ๏ผš hyperparameter 101 1 2 3 2429 19 19 in this experiment, ๐‘˜ = 49.

Slide 86

Slide 86 text

Example of non-negative matrix factorization [3] Lee, Daniel D., and H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." Nature 401.6755 (1999): 788-791. โ€ฆ ๏ผš hyperparameter 102 1 2 3 2429 19 19 in this experiment, ๐‘˜ = 49. ๐‘› = 19 โˆ— 19, ๐‘š = 2429

Slide 87

Slide 87 text

The base is the facial parts such as eyes, mouth, and nose. The face is reconstructed by the addition of the parts. PCA and NMF [3] Lee, Daniel D., and H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." Nature 401.6755 (1999): 788-791. (Positive values are drawn in black, negative values in red) Each basis is face-like. The face is reconstructed by adding up the faces. 103 Representation by NMF Representation by SVD(PCA)

Slide 88

Slide 88 text

The base is the facial parts such as eyes, mouth, and nose. The face is reconstructed by the addition of the parts. PCA and NMF [3] Lee, Daniel D., and H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." Nature 401.6755 (1999): 788-791. (Positive values are drawn in black, negative values in red) Each basis is face-like. The face is reconstructed by adding up the faces. 104 Representation by NMF Representation by SVD(PCA)

Slide 89

Slide 89 text

Challenges of NMF Finding the best is NP-hard problem. โ€ป in terms of minimizing Frobenius error. Gradient based optimization of nonconvex functions. (Approximated solution depends on initial values) Hyperparameter needs to be tuned Convex Function Non-Convex Function NMF for matrix is non-convex. The error ๏ผšHyper-parameter 105

Slide 90

Slide 90 text

Nonnegative multiple matrix factorization[6] 111 โ–  Extract patterns from playback history for a music recommendation system User Singer User Tag ๏ผš Who listened to which singer? ๏ผš Tags for artists (Pop, rock, Jazz...) ๏ผš Friendship between users. [6] Takeuchi, Koh, et al. "Non-negative multiple matrix factorization." Twenty-third international joint conference on artificial intelligence. 2013. If user i and user j are friends, then ๏ผŽ KL divergence ใƒปPrevents over-training, ใƒปRobust to outliers, Weights of auxiliary information

Slide 91

Slide 91 text

Nonnegative multiple matrix factorization[6] 112 โ–  Extract patterns from playback history for a music recommendation system [6] Takeuchi, Koh, et al. "Non-negative multiple matrix factorization." Twenty-third international joint conference on artificial intelligence. 2013. Patterns - between users and singers - between users and genres are simultaneously extracted.

Slide 92

Slide 92 text

113 Summary: Non-negative matrix factorization โ–  Non-negativity improves interpretability โ–  Many task-specific NMFs have been developed. โ–  The best solution cannot be obtained. The best decomposition is NP-hard problem โ€ป in terms of minimizing Frobenius error. Gradient based optimization of nonconvex functions (Approximated solution depends on initial values) PCA NMF Group NMF[5] NMMF[6] [5] Lee, Hyekyoung, and Seungjin Choi. "Group nonnegative matrix factorization for EEG classification." Artificial Intelligence and Statistics. PMLR, 2009. [6] Takeuchi, Koh, et al. "Non-negative multiple matrix factorization." Twenty-third international joint conference on artificial intelligence. 2013.

Slide 93

Slide 93 text

114 Overview โ–  Introduction: Why do we decompose data? โ–  A quick review of matrix rank โ–  Singular value decomposition (SVD) and low-rank approximation โ–  Kernel subspace method and its applications (denoising, anomaly detection) โ–  Non-negative matrix factorization โ–  Tensor low-rank decomposition and many-body approximation

Slide 94

Slide 94 text

NMF with missing values โ–  Data often includes missing values The approach of simply replacing missing values with averages or zeros doesn't work well. Estimates missing values using the assumption that the data have a low-rank structure. ๏ผšMissing value Only one value is missing for simplicity. In practice, many missing values are possible. 118

Slide 95

Slide 95 text

๏ผšMissing value NMF with missing values The space where each point is a matrix ใƒปThe dimension of the subspace is the number of missing value. ใƒปAssuming A is low rank structure + noise Low-rank matrix completion, EM-WNMF[4] Only one value is missing for simplicity. In practice, many missing values are possible. [4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006.

Slide 96

Slide 96 text

NMF with missing values Low-rank matrix completion, EM-WNMF[4] [4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. ๏ผšHyper-parameter Rank- matrices The space where each point is a matrix Only one value is missing for simplicity. In practice, many missing values are possible. ๏ผšMissing value

Slide 97

Slide 97 text

NMF with missing values Low-rank matrix completion, EM-WNMF[4] Initialize missing values and obtain . [4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. ๏ผšHyper-parameter Rank- matrices The space where each point is a matrix Only one value is missing for simplicity. In practice, many missing values are possible. ๏ผšMissing value

Slide 98

Slide 98 text

NMF with missing values Low-rank matrix completion, EM-WNMF[4] M step [4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. ๏ผšHyper-parameter Rank- matrices The space where each point is a matrix Only one value is missing for simplicity. In practice, many missing values are possible. ๏ผšMissing value Do NMF on to obtain rank- matrix . Initialize missing values and obtain .

Slide 99

Slide 99 text

NMF with missing values Low-rank matrix completion, EM-WNMF[4] M step E step Overwrite observations to obtain . ๏ผšObserved indices [4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. ๏ผšHyper-parameter Rank- matrices The space where each point is a matrix Only one value is missing for simplicity. In practice, many missing values are possible. ๏ผšMissing value Do NMF on to obtain rank- matrix . Initialize missing values and obtain .

Slide 100

Slide 100 text

NMF with missing values [4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. Low-rank matrix completion, EM-WNMF[4] M step E step Repeat ๏ผšHyper-parameter Rank- matrices The space where each point is a matrix Only one value is missing for simplicity. In practice, many missing values are possible. ๏ผšObserved indices ๏ผšMissing value Overwrite observations to obtain . Do NMF on to obtain rank- matrix . Initialize missing values and obtain .

Slide 101

Slide 101 text

Issues in EM-WNMF Rank needs to be chosen appropriately. Algorithm has initial value dependence๏ผŽ Slow convergence. [4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. The space where each point is a matrix ๏ผšHyper-parameter Rank- matrices Low-rank matrix completion, EM-WNMF[4]

Slide 102

Slide 102 text

Further topics of matrix factorization ใƒป Probabilistic Principal Component Analysis 128 1ร—1 = 1 1๏ผ‹1 = 1 Decomposition under the algebra with [18] Hashemi, Soheil, Hokchhay Tann, and Sherief Reda. "BLASYS: Approximate logic synthesis using Boolean matrix factorization." Proceedings of the 55th Annual Design Automation Conference. 2018. Confidence intervals can be obtained. Each element of the matrix is sampled from a probability distribution. We assume a prior distribution. Input Output ใƒป Binary Matrix Factorization ๏ผŒBoolean matrix factorization ใƒปAutoencoder and Principal Component Analysis The error function of a linear autoencoder with parameter sharing is equivalent to PCA. The activation is given as f(x) = x. The width of the middle layer corresponds to the rank. Matrix rank ๐– ๐–โŠค often used in recommendation engine

Slide 103

Slide 103 text

Further topics of matrix factorization ใƒปDeep matrix factorization [19] Fan, Jicong. "Multi-mode deep matrix and tensor factorization." ICLR, 2021. Activation function ใƒป Double-descent phenomena in NMF 129 โ‰ƒ ๐— โˆˆ โ„๐‘›ร—๐‘š ๐– โˆˆ โ„๐‘›ร—๐‘Ÿ ๐‡ โˆˆ โ„๐‘Ÿร—๐‘š [20] Kawakami, Y., et.al.,โ€œInvestigating Overparameterization for Non-Negative Matrix Factorization in Collaborative Filteringโ€ RecSys 2021, Late-Breaking Results track, 2021. โ†’ Larger rank ๐‘Ÿ Over fitting

Slide 104

Slide 104 text

130 Overview โ–  Introduction: Why do we decompose data? โ–  A quick review of matrix rank โ–  Singular value decomposition (SVD) and low-rank approximation โ–  Kernel subspace method and its applications (denoising, anomaly detection) โ–  Non-negative matrix factorization โ–  Tensor low-rank decomposition and many-body approximation

Slide 105

Slide 105 text

Various data are stored in computers as tensors. 131 NASA, https://sdo.gsfc.nasa.gov/data/ Microarray data RGB Image Hyperspectral Image EEG Data = [14] M.Mรธrup, Data Mining and Knowledge Discovery 2011 [16] A.Cichocki, et al. Nonnegative Matrix and Tensor Factorizations 2009 [16] A.Cichocki, et al. Nonnegative Matrix and Tensor Factorizations 2009 Video Time series data, Signals [16] A.Cichocki, et al. Nonnegative Matrix and Tensor Factorizations 2009 [15] Kosmas Dimitropoulos, et al. Transactions on circuits and systems for video technology 2018

Slide 106

Slide 106 text

Tensors in modern machine learning 132 Relational Learning A B C E D F G ๏ผŸ [7] e.g., Graph link prediction by RESCAL Model Express by tensors [8] Deep Learning e.g., Lightweighting Neural Networks with Tensor Decomposition [9] [7] Nickel, Maximilian, Volker Tresp, and Hans-Peter Kriegel. โ€œA three-way model for collective learning on multi-relational data.โ€ ICML2011 [8] T. Bezdan, N. Baฤanin Dลพakula, International Scientific Conference on Information Technology and Data Related Research, 2019 [9] Y. Liu et al., Tensor Computation for data analysis, 2022

Slide 107

Slide 107 text

Tensor Hierarchy: Vectors of a Tensor are also Tensors Numbers Vectors [First-order tensors] Matrices [Second-order tensors] ๐‘Ž ๐’‚ = ๐‘Ž1 ๐‘Ž2 โ‹ฎ ๐‘Ž๐‘› ๐— = ๐‘Ž11 โ‹ฏ ๐‘Ž1๐‘› โ‹ฎ โ‹ฑ โ‹ฎ ๐‘Ž๐‘›1 โ‹ฏ ๐‘Ž๐‘›๐‘› Third-order tensors โ‹ฏ ๐’ฏ= Fourth-order tensors ๐‘ˆ= โ‹ฏ โ‹ฏ ๐’ฏ 1 ๐’ฏ 2 ๐’ฏ ๐‘› 133

Slide 108

Slide 108 text

Density estimation, pattern extraction, data mining, denoising, data compression, anomaly detectionโ€ฆ Extracting features from tensors. 134 Who Shop โ‰ƒ = เดค ๐’ซ minmize ๐’ซ โˆ’ เดค ๐’ซ ๐น + โ‹ฏ + ๐’ซ We want to extract pattern or feature in tensor formatted data โ–  Tensor decomposition extracts features from data. mode-๏ผ’ mode-๏ผ‘

Slide 109

Slide 109 text

Difficulties in tensor factorization โ–  We need to decide both of a structure and a rank. CP decomp. โ‰ƒ = ๐’ซ เดค ๐’ซ โ‰ƒ Tucker decomp. Decomp. with tensor networks + โ‹ฏ + minmize ๐’ซ โˆ’ เดค ๐’ซ ๐น Tensor Train decomposion Tensor ring decomposion ๐’ซ Tensor networks Each node is a tensor, Each edge is a mode [11] Wang, Wenqi, et al. CVPR. 2018. [12] Zheng, Yu-Bang, et al. AAAI 2021. https://tensornetwork.org [13] A. Cichocki, et al., Tensor Networks for Dimensionality Reduction and Large-scale Optimization, 2016

Slide 110

Slide 110 text

Difficulties in tensor factorization 136 โ–  We need to decide both of a structure and a rank. CP decomp. โ‰ƒ = ๐’ซ เดค ๐’ซ โ–  Optimization is difficult. Tensor decomposition is often ill-posed or NP-hard. ใƒป The best rank-1 CP-decomp. for minimizing Frobenius norm is NP-hard. โ‰ƒ Tucker decomp. Decomp. with tensor networks + โ‹ฏ + A convex, stable and intuitive designable tensor decomposition is desired. minmize ๐’ซ โˆ’ เดค ๐’ซ ๐น Non-convex ๐’ซ โˆ’ เดค ๐’ซ ๐น Solution often might be indeterminate. The objective function is typically non-convex. ใƒปInitial values dependency No guarantee to be the best solution. Tensor Train decomposion Tensor ring decomposion ๐’ซ

Slide 111

Slide 111 text

Many-body approximation for non-negative tensors[10] 137 Natural parameter of exponential distribution family. Energy function

Slide 112

Slide 112 text

Many-body approximation for non-negative tensors[10] 138 Natural parameter of exponential distribution family. Energy function

Slide 113

Slide 113 text

Many-body approximation for non-negative tensors[10] 139 Control relation between mode-k and mode-l. Control relation among mode-j, -k and -l. Natural parameter of exponential distribution family. Energy function

Slide 114

Slide 114 text

Many-body approximation for non-negative tensors[10] 140 One-body approx. Rank-1 approximation ๏ผˆmean-field approximation๏ผ‰

Slide 115

Slide 115 text

Many-body approximation for non-negative tensors[10] 141 One-body approx. Rank-1 approximation ๏ผˆmean-field approximation๏ผ‰ Two-body approx. Control relation between mode-k and mode-l. Larger Capability Two-body Interaction

Slide 116

Slide 116 text

Many-body approximation for non-negative tensors[10] 142 One-body approx. Rank-1 approximation ๏ผˆmean-field approximation๏ผ‰ Two-body approx. Two-body Interaction Three-body approx. Larger Capability The global optimal solution minimizing KL divergence from can be obtained by a convex optimization. Intuitive modeling focusing on interactions between modes Control relation among mode-j, -k and -l. Control relation between mode-k and mode-l. Three-body Interaction

Slide 117

Slide 117 text

We regard a normalized tensor as a discrete joint probability distribution whose sample space is an index set Coordinates transformation Theoretical idea behind many-body approx. 143 Index is discrete random variable ๐œฝ- Representation Representation with natural parameters of exponential family Geometry of ๐œฝ-space Low-rank space ๏ผˆNot flat๏ผ‰ Flat subspace Difficult to optimize We construct flat subspace by focusing interaction among tensor modes Describing tensor factorization in ๐œฝ-coordinate system makes it convex problem We use information geometry to formulate factorization as convex problem

Slide 118

Slide 118 text

Example tensor reconstruction by many-body approx. Larger capability Reconstruction for 40ร—40ร—3ร—10 tensor. (width, height, colors, # images) Color depends on image index Shape of each image Color is uniform within each image. Intuitive model designable that captures the relationship between modes Color depends on pixel Three-body Approx. 152

Slide 119

Slide 119 text

Color image is decomposed into shape ร— color Approximation

Slide 120

Slide 120 text

Color image is decomposed into shape ร— color ร— โ‰ƒ = Shape of each image Color of each image

Slide 121

Slide 121 text

Data completion by many-body approx. and the em-algorithm m-step ๐ โ† low_rank_approx(๐) e-step ๐๐œ โ† ๐“๐œ ๐œ : observed indices Low-rank Tensor Completion ๐‘’-projection ๐‘š-projection Not flat. It causes non-uniquness of projection. Which structure is better for decomposition? How to chose rank? Low-rank space 155

Slide 122

Slide 122 text

Data completion by many-body approx. and the em-algorithm m-step ๐ โ† low_rank_approx(๐) e-step ๐๐œ โ† ๐“๐œ ๐œ : observed indices Low-rank Tensor Completion ๐‘’-projection Low-body many_body_approx Low-body space is flat. It ensures uniquness of the projection Intuitive modeling is possible! ๐‘š-projection 156

Slide 123

Slide 123 text

Data completion by many-body approx. and the em-algorithm Missing area GT Prediction Interaction above correlated modes Reconstructing traffic speed data of 28ร—24ร—12ร—4 tensor with missing values. (days, hours, min, lanes) 157

Slide 124

Slide 124 text

Data completion by many-body approx. and the em-algorithm Missing area GT Prediction Interaction above correlated modes Interaction above non-correlated modes Fit score: 0.82 Reconstructing traffic speed data of 28ร—24ร—12ร—4 tensor with missing values. (days, hours, min, lanes) 158

Slide 125

Slide 125 text

159 Summary: Many-body approximation for tensors https://arxiv.org/abs/2209.15338 ใƒปMore intuitive design than rank tuning One-body Approx. Three-body Approx. Two-body Approx. Many-body Approximation Bias (magnetic field) Weight (electron-electron interaction) ใƒปConvex optimization always provide unique solution

Slide 126

Slide 126 text

160 Report assignments 1. Prove the Eckart-Young or Eckart-Young-Mirsky Theorem. 2. There are many algorithms for Non-negative Matrix Factorization, e.g., HALS-NMF, MU-NMF, etc. Choose one of them and derive the algorithm. You do not have to show the convergence guarantee. There are four assignments below. Choose one of them and write your report. Please indicate which assignment you have chosen. The format is flexible but should be less than 2 sheets of A4 size paper. Handwriting is also acceptable. Note: There are many variation of NMF, such as: - NMF optimizing Frobenius norm - NMF optimizing KL divergence - NMF optimizing Frobenius norm with regularization term. Please indicate which kind of NMF you are discussing.

Slide 127

Slide 127 text

161 Report assignments 3. Visit Google Scholar (https://scholar.google.com/) and search for a paper related to SVD, NMF, or tensor factorization. Summarize the motivation, method, and novelty of the paper. Note: Please clearly indicate the paper title you chose. Example papers: - Lee, Daniel D., and H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." nature 401.6755 (1999): 788-791. - Takeuchi, Koh, et al. "Non-negative multiple matrix factorization." Twenty-third international joint conference on artificial intelligence. 2013. - Lee, Hyekyoung, and Seungjin Choi. "Group nonnegative matrix factorization for EEG classification." Artificial Intelligence and Statistics. PMLR, 2009. - Ding, Chris, Xiaofeng He, and Horst D. Simon. "On the equivalence of nonnegative matrix factorization and spectral clustering." Proceedings of the 2005 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2005. - Lee, Daniel, and H. Sebastian Seung. "Algorithms for non-negative matrix factorization." Advances in neural information processing systems 13 (2000). - Zhang, Jun, Yong Yan, and Martin Lades. "Face recognition: eigenface, elastic matching, and neural nets." Proceedings of the IEEE 85.9 (1997): 1423-1435. Recommended keywords: - Non-negative matrix factorization, SVD, Tensor decomposition, Eigen face, weighted NMF, low-rank completion

Slide 128

Slide 128 text

162 Report assignments 4. Follow the steps below using Python, Matlab, R, C/C++, Fortran, or Julia: (1) Convert a grayscale image into a two-dimensional array (matrix). (2) Perform the rank-r approximation on it using SVD. (3) Plot a figure with rank on the horizontal axis and reconstruction error on the vertical axis. Observe how the reconstructed image changes as the rank is varied. (4) Plot a figure of the singular value distribution obtained by SVD for the image. The vertical axis of the plot is the nth largest singular value, and the horizontal axis is n. Rank Reconstruction Error n n-th singular value Note: Please paste a part of the source code you have written into the report. You do not have to paste all source code.

Slide 129

Slide 129 text

163 Overview โ–  Introduction: Why do we decompose data? โ–  A quick review of matrix rank โ–  Singular value decomposition (SVD) and low-rank approximation โ–  Kernel subspace method and its applications (denoising, anomaly detection) โ–  Non-negative matrix factorization โ–  Tensor low-rank decomposition and many-body approximation

Slide 130

Slide 130 text

For further study ใ‚ฐใƒฉ ใƒ• ใ‚ฃ ใ‚ซ ใƒซใƒฆใƒผใ‚ถใƒผใ‚คใƒณใ‚ฟ ใƒผใƒ• ใ‚ง ใ‚คใ‚น ่‡ชๅ‹•็š„ใซ็”Ÿๆˆใ• ใ‚ŒใŸ่ชฌๆ˜Ž ใƒ†ใ‚ญใ‚นใƒˆใŒๅซใพใ‚Œใฆใ„ใ‚‹็”ปๅƒ ่‡ชๅ‹•็š„ใซ็”Ÿๆˆใ• ใ‚ŒใŸ่ชฌๆ˜Ž โ–  Textbooks for Matrix and Tensor Decomposition A series lecture by Dr. Steve Brunton ใƒปSpeeding up SVD ใƒปSVD for linear regression ใƒปSVD for face recognition ใƒปHow to chose ranks โ–  Tensor libraries ๆ™‚่จˆ , ใƒก ใƒผใ‚ฟ ใƒผ , ใƒœใƒผใƒซ , ่จ˜ๅท ใŒๅซใพใ‚Œใฆใ„ใ‚‹็”ปๅƒ ่‡ชๅ‹•็š„ใซ็”Ÿๆˆใ• ใ‚ŒใŸ่ชฌๆ˜Ž ใƒญใ‚ด ่‡ชๅ‹•็š„ใซ็”Ÿๆˆใ• ใ‚ŒใŸ่ชฌๆ˜Ž ็ฎฑใฒใ’ๅ›ณ ใŒๅซใพใ‚Œใฆใ„ใ‚‹็”ปๅƒ ่‡ชๅ‹•็š„ใซ็”Ÿๆˆใ• ใ‚ŒใŸ่ชฌๆ˜Ž [iTensor] ใ‚ซ ใƒฌใƒณใƒ€ใƒผ ไธญ็จ‹ๅบฆใฎ็ฒพๅบฆใง่‡ชๅ‹•็š„ใซ็”Ÿๆˆใ• ใ‚ŒใŸ่ชฌๆ˜Ž โ€ฆ 166

Slide 131

Slide 131 text

167 References [1] Mika, Sebastian, et al. "Kernel PCA and de-noising in feature spaces." Advances in neural information processing systems 11 (1998). [2] Hoffmann, Heiko. "Kernel PCA for novelty detection." Pattern recognition 40.3 (2007): 863-874. [3] Lee, Daniel D., and H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." Nature 401.6755 (1999): 788-791. [4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization.โ€œ Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. [5] Lee, Hyekyoung, and Seungjin Choi. "Group nonnegative matrix factorization for EEG classification.โ€œ Artificial Intelligence and Statistics. PMLR, 2009. [6] Takeuchi Koh, et al. "Non-negative multiple matrix factorization." Twenty-third international joint conference on artificial intelligence. 2013. [7] Nickel, Maximilian, Volker Tresp, and Hans-Peter Kriegel. โ€œA three-way model for collective learning on multi-relational data.โ€ ICML2011 [8] T. Bezdan, N. Baฤanin Dลพakula, International Scientific Conference on Information Technology and Data Related Research, 2019 [9] Y. Liu et al., "Tensor Computation for data analysis", 2022 [10] Ghalamkari Kazu, Mahito Sugiyama, and Yoshinobu Kawahara. "Many-body approximation for non-negative tensors.โ€œ Advances in Neural Information Processing Systems 36 (2024). [11] Wang, Wenqi, et al. "Wide compression: Tensor ring nets." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. [12] Zheng, Yu-Bang, et al. "Fully-connected tensor network decomposition and its application to higher-order tensor completion." Proceedings of the AAAI conference on artificial intelligence. Vol. 35. No. 12. 2021. [13] Cichocki, Andrzej, et al. "Tensor networks for dimensionality reduction and large-scale optimization: Part 1 low-rank tensor decompositions." Foundations and Trendsยฎ in Machine Learning 9.4-5 (2016): 249-429.

Slide 132

Slide 132 text

168 References [14] Mรธrup, Morten. "Applications of tensor (multiway array) factorizations and decompositions in data mining." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1.1 (2011): 24-40. [15] Dimitropoulos, Kosmas, et al. "Classification of multidimensional time-evolving data using histograms of grassmannian points." IEEE Transactions on Circuits and Systems for Video Technology 28.4 (2016): 892-905. [16] Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan, and Shun-ichi Amari. Non-negative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. John Wiley & Sons, 2009. [17] Hamm, Jihun, and Daniel D. Lee. "Grassmann discriminant analysis: a unifying view on subspace-based learning." ICML. 2008. [18] Hashemi, Soheil, Hokchhay Tann, and Sherief Reda. โ€œBLASYS: Approximate logic synthesis using Boolean matrix factorization.โ€ Proceedings of the 55th Annual Design Automation Conference. 2018. [19] Fan, Jicong. "Multi-mode deep matrix and tensor factorization." ICLR, 2021. [20] Kawakami, Y., et.al.,โ€œInvestigating Overparameterization for Non-Negative Matrix Factorization in Collaborative Filteringโ€ RecSys 2021, Late-Breaking Results track, 2021. Acknowledgements I would like to express my gratitude to Dr. Profir-Petru Pรขrศ›achi from the National Institute of Informatics for his valuable comments on the structure and content of this lecture slides. Used datasets MNIST https://yann.lecun.com/exdb/mnist/ COIL100 https://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php IRIS https://archive.ics.uci.edu/dataset/53/iris Yale Face Database http://cvc.cs.yale.edu/cvc/projects/yalefaces/yalefaces.html