Slide 1

Slide 1 text

Rajesh Singh Gaussian Processes The Institute of Mathematical Sciences-HBNI

Slide 2

Slide 2 text

Machine learning A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E. — T. M. Mitchell

Slide 3

Slide 3 text

Machine learning A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E. — T. M. Mitchell select the best model|code it up|update with new data| learn from data

Slide 4

Slide 4 text

Machine learning A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E. — T. M. Mitchell select the best model|code it up|update with new data| learn from data Supervised learning infer from training data having input (x) & output (y) Unsupervised learning infer from training data having input (x) alone

Slide 5

Slide 5 text

Machine learning A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E. — T. M. Mitchell select the best model|code it up|update with new data| learn from data Supervised learning infer from training data having input (x) & output (y) Unsupervised learning infer from training data having input (x) alone Classification discrete output eg: SVM, GP Regression continuous output eg: SVR, GP

Slide 6

Slide 6 text

Machine learning A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E. — T. M. Mitchell select the best model|code it up|update with new data| learn from data Supervised learning infer from training data having input (x) & output (y) Unsupervised learning infer from training data having input (x) alone Classification discrete output eg: SVM, GP Regression continuous output eg: SVR, GP Clustering discover groups eg: k-means Dim reduction reduce variables eg: GP-LVM

Slide 7

Slide 7 text

kdnuggets.com/2014/09/most-viewed-machine-learning-talks-videolectures

Slide 8

Slide 8 text

kdnuggets.com/2014/09/most-viewed-machine-learning-talks-videolectures

Slide 9

Slide 9 text

kdnuggets.com/2014/09/most-viewed-machine-learning-talks-videolectures

Slide 10

Slide 10 text

kdnuggets.com/2014/09/most-viewed-machine-learning-talks-videolectures

Slide 11

Slide 11 text

kdnuggets.com/2014/09/most-viewed-machine-learning-talks-videolectures

Slide 12

Slide 12 text

kdnuggets.com/2014/09/most-viewed-machine-learning-talks-videolectures

Slide 13

Slide 13 text

Bayes theorem betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/ P (✓|X) = P (X|✓) P (✓) P (X) .

Slide 14

Slide 14 text

Bayes theorem betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/ Posterior Evidence Likelihood Prior P (✓|X) = P (X|✓) P (✓) P (X) .

Slide 15

Slide 15 text

Bayes theorem betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/ chances of a disease (D) if the test (T) is positive P(D | T) =? Posterior Evidence Likelihood Prior P (✓|X) = P (X|✓) P (✓) P (X) .

Slide 16

Slide 16 text

Bayes theorem betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/ chances of a disease (D) if the test (T) is positive P(D | T) =? P(T | ˜ D) = 0.096 P(D) = 0.01 P(T | D) = 0.8 knowns Posterior Evidence Likelihood Prior P (✓|X) = P (X|✓) P (✓) P (X) .

Slide 17

Slide 17 text

Bayes theorem betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/ chances of a disease (D) if the test (T) is positive P(D | T) =? P(D | T) = P(T | D) P(D) P(T | D)P(D) + P(T | ˜ D)P( ˜ D) P(T | ˜ D) = 0.096 P(D) = 0.01 P(T | D) = 0.8 knowns Posterior Evidence Likelihood Prior P (✓|X) = P (X|✓) P (✓) P (X) .

Slide 18

Slide 18 text

Bayes theorem betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/ chances of a disease (D) if the test (T) is positive P(D | T) =? P(D | T) = P(T | D) P(D) P(T | D)P(D) + P(T | ˜ D)P( ˜ D) Plug in the numbers to obtain 0.078 for the probability. P(T | ˜ D) = 0.096 P(D) = 0.01 P(T | D) = 0.8 knowns Posterior Evidence Likelihood Prior P (✓|X) = P (X|✓) P (✓) P (X) .

Slide 19

Slide 19 text

Regression given noisy data - make predictions! parameters for this?

Slide 20

Slide 20 text

Regression given noisy data - make predictions! parameters for this?

Slide 21

Slide 21 text

Regression given noisy data - make predictions! parameters for this?

Slide 22

Slide 22 text

Regression given noisy data - make predictions! parameters for this?

Slide 23

Slide 23 text

Regression given noisy data - make predictions! parameters for this? use a non-parametric approach: fit all possible functions!

Slide 24

Slide 24 text

Gaussian Processes At the core of GP is the multivariate Gaussian distribution http://www.gaussianprocess.org/gpml/

Slide 25

Slide 25 text

Gaussian Processes At the core of GP is the multivariate Gaussian distribution ‣ Any n-points, (x1, …, xn), in a data set is a point sampled from n-variate Gaussian distribution ‣ GP defines prior over functions (GD), which is then used to compute posterior using training points http://www.gaussianprocess.org/gpml/

Slide 26

Slide 26 text

Gaussian Processes arXiv:1803.03296

Slide 27

Slide 27 text

Gaussian Processes Regression Let’s assume that we have a training set to predict output at test points Observed (training set) To estimate (test) x ! y x⇤ ! y⇤

Slide 28

Slide 28 text

Gaussian Processes Regression Let’s assume that we have a training set to predict output at test points Observed (training set) To estimate (test) x ! y x⇤ ! y⇤ The joint distribution is given as

Slide 29

Slide 29 text

Gaussian Processes Regression Let’s assume that we have a training set to predict output at test points Observed (training set) To estimate (test) x ! y x⇤ ! y⇤ So we have a prior distribution over the training set and we seek to obtain the posterior. It is given as y⇤ |y ⇠ N (µ⇤, ⌃⇤) The joint distribution is given as

Slide 30

Slide 30 text

For a univariate case, we know that x = µ + p N(0 , 1) Gaussian Processes Regression

Slide 31

Slide 31 text

For a univariate case, we know that x = µ + p N(0 , 1) Gaussian Processes Regression Here L is the Cholesky factor of covariance matrix. Similarly, for the multivariate case, we have y⇤ = µ⇤ + LN(0, I), where LLT = ⌃⇤

Slide 32

Slide 32 text

For a univariate case, we know that x = µ + p N(0 , 1) Gaussian Processes Regression So the steps are summarized as specify a prior to fix functions to be considered for inferences compute covariance matrix using a specified the kernle K. Let’s have a look at the code now. Here L is the Cholesky factor of covariance matrix. Similarly, for the multivariate case, we have y⇤ = µ⇤ + LN(0, I), where LLT = ⌃⇤

Slide 33

Slide 33 text

Unsupervised Learning

Slide 34

Slide 34 text

Dimension reduction http://people.cs.uchicago.edu/~dinoj/manifold/swissroll.html Many times observed high-dimensional data lies on a manifold of lower dimension

Slide 35

Slide 35 text

Dimension reduction http://people.cs.uchicago.edu/~dinoj/manifold/swissroll.html Many times observed high-dimensional data lies on a manifold of lower dimension

Slide 36

Slide 36 text

Dimension reduction http://people.cs.uchicago.edu/~dinoj/manifold/swissroll.html Many times observed high-dimensional data lies on a manifold of lower dimension

Slide 37

Slide 37 text

Dimension reduction http://people.cs.uchicago.edu/~dinoj/manifold/swissroll.html Many times observed high-dimensional data lies on a manifold of lower dimension

Slide 38

Slide 38 text

Principal component analysis PCA gives orthogonal directions of maximum variance ⌃ij = 1 N X ( xni µi)( xnj µj) https://en.wikipedia.org/wiki/Principal_component_analysis

Slide 39

Slide 39 text

Principal component analysis PCA gives orthogonal directions of maximum variance ⌃ij = 1 N X ( xni µi)( xnj µj) ⌃ = ✓ ⌃ xx ⌃ xy ⌃ yx ⌃ yy ◆ https://en.wikipedia.org/wiki/Principal_component_analysis

Slide 40

Slide 40 text

Principal component analysis PCA gives orthogonal directions of maximum variance ⌃ij = 1 N X ( xni µi)( xnj µj) ⌃ = ✓ ⌃ xx ⌃ xy ⌃ yx ⌃ yy ◆ ⌃ = ✓ ⌃ xx 0 0 ⌃ yy ◆ https://en.wikipedia.org/wiki/Principal_component_analysis

Slide 41

Slide 41 text

Principal component analysis PCA gives orthogonal directions of maximum variance ⌃ij = 1 N X ( xni µi)( xnj µj) ⌃ = ✓ ⌃ xx ⌃ xy ⌃ yx ⌃ yy ◆ ⌃ = ✓ ⌃ xx 0 0 ⌃ yy ◆ Find eigenvalues vk and eigenvectors k of ⌃ . variance of k-th direction https://en.wikipedia.org/wiki/Principal_component_analysis

Slide 42

Slide 42 text

Principal component analysis PCA gives orthogonal directions of maximum variance ⌃ij = 1 N X ( xni µi)( xnj µj) ⌃ = ✓ ⌃ xx ⌃ xy ⌃ yx ⌃ yy ◆ ⌃ = ✓ ⌃ xx 0 0 ⌃ yy ◆ Find eigenvalues vk and eigenvectors k of ⌃ . variance of k-th direction Choose some of the eigenvectors based on the problem at hand to obtain a lower set of orthogonal directions. https://en.wikipedia.org/wiki/Principal_component_analysis

Slide 43

Slide 43 text

Gaussian Process Latent Variable Model GPLVM, introduced by Lawrence (2004), generalizes PCA in GP framework where latent input X are obtained from the data Y, when the mapping W is integrated out

Slide 44

Slide 44 text

Gaussian Process Latent Variable Model GPLVM, introduced by Lawrence (2004), generalizes PCA in GP framework where latent input X are obtained from the data Y, when the mapping W is integrated out Y = W X + ⌘ latent variable

Slide 45

Slide 45 text

Gaussian Process Latent Variable Model GPLVM, introduced by Lawrence (2004), generalizes PCA in GP framework where latent input X are obtained from the data Y, when the mapping W is integrated out Y = W X + ⌘ latent variable Define conditional (marginal) probability between the observed data-space Y to a latent data-space X

Slide 46

Slide 46 text

Gaussian Process Latent Variable Model GPLVM, introduced by Lawrence (2004), generalizes PCA in GP framework where latent input X are obtained from the data Y, when the mapping W is integrated out Y = W X + ⌘ latent variable Define conditional (marginal) probability between the observed data-space Y to a latent data-space X Extremize the corresponding log-likelihood to obtain X.

Slide 47

Slide 47 text

Model selection Gaussian Processes in Machine Learning, C. E. Rasmussen., Advanced Lectures on Machine Learning region of interest

Slide 48

Slide 48 text

Model selection Gaussian Processes in Machine Learning, C. E. Rasmussen., Advanced Lectures on Machine Learning region of interest P(Y |Mi) = Z P(Y |✓, Mi)P(✓|Mi)d✓.

Slide 49

Slide 49 text

Model selection Gaussian Processes in Machine Learning, C. E. Rasmussen., Advanced Lectures on Machine Learning region of interest P(Y |Mi) = Z P(Y |✓, Mi)P(✓|Mi)d✓. peaked at most-probable parameter value, and thus, integral becomes

Slide 50

Slide 50 text

Model selection Gaussian Processes in Machine Learning, C. E. Rasmussen., Advanced Lectures on Machine Learning region of interest P(Y |Mi) = Z P(Y |✓, Mi)P(✓|Mi)d✓. Evidence = Best fit likelihood x Ockham factor P(Y |Mi) ' P(Y |✓⇤, Mi) ⇥ P(✓⇤|Mi) ✓. peaked at most-probable parameter value, and thus, integral becomes

Slide 51

Slide 51 text

Model selection Gaussian Processes in Machine Learning, C. E. Rasmussen., Advanced Lectures on Machine Learning ‣ Ockham razor: the simpler model is usually the true one! region of interest P(Y |Mi) = Z P(Y |✓, Mi)P(✓|Mi)d✓. Evidence = Best fit likelihood x Ockham factor P(Y |Mi) ' P(Y |✓⇤, Mi) ⇥ P(✓⇤|Mi) ✓. peaked at most-probable parameter value, and thus, integral becomes

Slide 52

Slide 52 text

GP provides a systematic and tractable framework for various machine learning problems Complete probabilistic predictions in GPs - confidence interval, etc. GP extendable to hierarchical formulation In GP covariance matrices are not sparse - O(N3) complexity for matrix inversion - approximate methods essential for large data-sets Summary

Slide 53

Slide 53 text

GP provides a systematic and tractable framework for various machine learning problems Complete probabilistic predictions in GPs - confidence interval, etc. GP extendable to hierarchical formulation In GP covariance matrices are not sparse - O(N3) complexity for matrix inversion - approximate methods essential for large data-sets Summary

Slide 54

Slide 54 text

GP provides a systematic and tractable framework for various machine learning problems Complete probabilistic predictions in GPs - confidence interval, etc. GP extendable to hierarchical formulation In GP covariance matrices are not sparse - O(N3) complexity for matrix inversion - approximate methods essential for large data-sets Summary C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning, MIT press Cambridge, 2006. K. P. Murphy. Machine Learning: A Probabilistic Perspective MIT press Cambridge, 2012. N. D. Lawrence. Gaussian process latent variable models for visualisation of high dimensional data, 2004. D. J. C. MacKay. Information theory, inference and learning algorithms. Cambridge University Press, 2003. http://www.gaussianprocess.org/

Slide 55

Slide 55 text

GP provides a systematic and tractable framework for various machine learning problems Complete probabilistic predictions in GPs - confidence interval, etc. GP extendable to hierarchical formulation In GP covariance matrices are not sparse - O(N3) complexity for matrix inversion - approximate methods essential for large data-sets Summary Thank You ! C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning, MIT press Cambridge, 2006. K. P. Murphy. Machine Learning: A Probabilistic Perspective MIT press Cambridge, 2012. N. D. Lawrence. Gaussian process latent variable models for visualisation of high dimensional data, 2004. D. J. C. MacKay. Information theory, inference and learning algorithms. Cambridge University Press, 2003. http://www.gaussianprocess.org/

Slide 56

Slide 56 text

Support Vector Machines SVM is classifier which divides the data by a hyperplane using labeled training datasets (supervised learning)

Slide 57

Slide 57 text

Support Vector Machines SVM is classifier which divides the data by a hyperplane using labeled training datasets (supervised learning)

Slide 58

Slide 58 text

Support Vector Machines SVM is classifier which divides the data by a hyperplane using labeled training datasets (supervised learning) What if the data is not so neat?

Slide 59

Slide 59 text

Support Vector Machines SVM is classifier which divides the data by a hyperplane using labeled training datasets (supervised learning) Transform the data to higher dimensions! What if the data is not so neat?

Slide 60

Slide 60 text

Support Vector Machines https://en.wikipedia.org/wiki/Kernel_method#/media/File:Kernel_trick_idea.svg