. . . Integration of Biological Knowledge in the Mixture-of-Gaussians Analysis of Genomic Clustering S. Sfakianakis1,2 M. Zervakis2 M. Tsiknakis1 D. Kafetzopoulos3 1Institute of Computer Science, Foundation for Research and Technology - Hellas 2Department of Electronic and Computer Engineering, Technical University of Crete 3Institute of Molecular Biology, Foundation for Research and Technology - Hellas November 3, 2010 Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 1 / 15
in bioinformatics focused on the identiﬁcation of a small number of infromative genes to discriminate between phenotypes or experimental conditions Instead we now see a shift to more integrated analysis of gene expression data, leading to the systems biology In this work we try to use existing biological knowledge to guide the analysis of gene expression data Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 2 / 15
the EM Finite Mixture Models Mixture models [McLachlan, 2000] present a probabilistic framework both for building complex probability distributions (e.g. density estimation) as linear combinations of simpler ones but also for clustering data (unsupervised learning) They present a “generative” model: a sample x x xj can have been generated by one of the g clusters or groups: f (x x xj ;Θ Θ Θ) = g ∑ i=1 πi fi (x x xj ;θi θi θi ) (1) where Θ Θ Θ is the collection of the unknown parameters πi that are usually referred as “mixing coeﬃcients”, and θi θi θi , which are the parameters of the component densities fi . And also 0 ≤ πi ≤ 1 and ∑ g i=1 πi = 1. Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 3 / 15
the EM Gaussian Mixture Models In the Gaussian Mixture Models (GMM) each of the component probability distributions is Gaussian but with diﬀerent parameters: fi (x x xj ;θ θ θi ) = N(x x xj ;µ µ µi ,Σ Σ Σi ) ≡ 1 √ (2π)p|Σ Σ Σi | e−1 2 (x x xj −µ µ µi )T Σ Σ Σ−1 i (x x xj −µ µ µi ) (2) f (x x xj ;Θ Θ Θ) = g ∑ i=1 πi N(x x xj ;µ µ µi ,Σ Σ Σi ) (3) There’s an eﬃcient iterative algorithm called Expectation Maximization (EM) to compute the models’ parameters of GMM [Dempster, 1977] Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 4 / 15
the EM EM for GMM Initialize the models’ parameters µ µ µi , Σi , πi E-step. Compute the support (or “responsibility”) each sample provides to a given component density as the conditional probability τji ≡ Pr(zji = 1|x x xj ;Θ Θ Θcur) = πcur j N(x x xj ;µ µ µcur i ,Σ Σ Σcur i )) ∑ g c=1 πcur c N(x x xj ;µ µ µcur c ,Σ Σ Σcur c )) (4) M-step. µ µ µnew i = ∑ N j=1 τijx x xj ∑ N i=1 τji Σ Σ Σnew i = ∑ N j=1 τji (x x xj − µ µ µnew i )(x x xj − µ µ µnew i )T ∑ N j=1 τji πnew i = 1 N N ∑ j=1 τji (5) Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 5 / 15
Biological Knowledge: the Stratiﬁed model The genes can be classifed into K “functional groups” e.g. based on the Gene Ontology or the Pathways We assume genes categorized into the same functional group are dependent whereas genes in diﬀerent groups are independent For Gaussian distributions independence is equivalent to uncorrelatedness and therefore we introduce the following “stratiﬁed” model for the covariance matrix: Σ Σ Σ = Σ Σ Σ(1) 0 0 0 · · · 0 0 0 0 0 0 0 0 0 Σ Σ Σ(2) · · · 0 0 0 0 0 0 . . . . . . ... . . . . . . 0 0 0 0 0 0 · · · Σ Σ Σ(K) 0 0 0 0 0 0 0 0 0 · · · 0 0 0 D D D(r) (6) where each of Σ Σ Σ(k) is the (unconstrained) covariance (sub)matrix for the genes belonging to the k group, and D D D is the diagonal covariance matrix of the r genes that do not belong to any group. Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 6 / 15
sparse structure of the cov. matrix is imposed on every component of the mixture model so that component densities are rewritten as fi (x x xj ;θi θi θi ) = N(x x xj ;µ µ µi ,Σ Σ Σi ) (7) and then taking into account the block diagonal structure we get a factorization of the form: fi (x x xj ;θi θi θi ) = N(x x x(r) j ;µ µ µ(r) i ,D D D(r) i ) K ∏ k=1 N(x x x(k) j ;µ µ µ(k) i ,Σ Σ Σ(k) i ) (8) and the mixture density becomes: f (x x xj ;Θ Θ Θ) = g ∑ i=1 πi N(x x x(r) j ;µ µ µ(r) i ,D D D(r) i ) · K ∏ k=1 N(x x x(k) j ;µ µ µ(k) i ,Σ Σ Σ(k) i ) = g ∑ i=1 πi K+1 ∏ k=1 N(x x x(k) j ;µ µ µ(k) i ,Σ Σ Σ(k) i ) (9) Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 7 / 15
for the stratiﬁed model In the E-step the “responsibilities” are updated based on the current model parameters as τji = πcur i ∏ K+1 k=1 N(x x x(k) j ;µ µ µ(k),cur i ,Σ Σ Σ(k),cur i ) ∑ g s=1 πcur s ∏ K+1 k=1 N(x x x(k) j ;µ µ µ(k),cur s ,Σ Σ Σ(k),cur s ) (10) In the M-step the new model parameters can be separately computed per functional group as µ µ µ(k) i = ∑ N j=1 τjix x x(k) j ∑ N j=1 τji (11) Σ Σ Σ(k) i = ∑ N j=1 τji (x x x(k) j − µ µ µ(k) i )(x x x(k) j − µ µ µ(k) i )T ∑ N j=1 τji (12) πi = ∑ N j=1 τji N (13) Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 8 / 15
In order to perform some evaluation of our method two data sets are used: A Breast Cancer data set [Huang, 2003] where there exist 52 samples with 18 samples exhibit recurrence of tumor and 34 do not. A Prostate Cancer data set [Singh, 2002] where there exist 52 tumor samples and 50 normal samples. Both are based on the AﬀymetrixTM hgu95av2 platform, containing 12625 probesets that are preprocessed using the GCRMA normalization and summarization methods. Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 9 / 15
of clustering results Algorithms: k-means, PAM, and our stratiﬁed EM “Hard” clustering can be done by assigning a sample to the cluster it mostly supports (i.e. based on the value of τji ) Comparison of clustering results The “true” underlying clusters are unknown We use the class labels to validate and evaluate the cluster results Because both datasets have a binary classiﬁcation, we request the identiﬁcation of g = 2 clusters Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 11 / 15
Index [Datta, 2006] BHI = 1 g g ∑ i=1 1 Ni (Ni − 1) ∑ x=y x,y∈Di 1 I(C(x) = C(y)) (14) Checks the homogeneity of the clusters based on the class labes Ideally BHI = 1 if e.g. all the tumor samples are assigned to one cluster and all the normal ones to the other Table: BHI Results Algorithm BHI BHI Breast Cancer Prostate Cancer kmeans 0.55 0.52 pam 0.56 0.51 our EM 0.56 0.49 Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 12 / 15
biological knowledge is a “hot” area Such knowledge can be used to overcome computational deﬁciencies and also improve the results and the validity of the methods The stratiﬁed model can be seen as a middle solution between choosing the full sample covariance matrix, which can lead to an ill-posed inverse problem, and a lower dimensional diagonal covariance matrix. Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 14 / 15
work The outcome of the experiments are not very informative on the validity of the described approach and further testing will be conducted in the future. The statiﬁed model that we deﬁned assumes the independence of the uncategorized genes, an assumption that is deﬁnitely far from the truth. It can be the case that certain genes can have more than one functional annotation or participate in more than one category or pathway. Improve the performance and robustness and study of the convergence properties Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 15 / 15