Slide 1

Slide 1 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Correlation mining in high dimension with limited samples Alfred Hero University of Michigan - Ann Arbor Jan. 30, 2015 1 54

Slide 2

Slide 2 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions 1 Motivation 2 Correlation mining principles 3 Application: network analysis 4 Application: SPARC∗ predictor design 5 Conclusions

Slide 3

Slide 3 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Acknowledgements • Bala Rajaratnam, Stanford Statistics • Hamed Firouzi, UM EECS (doctoral student) • Rob Brown, UCLA Bioinformatics (doctoral student) • Yongsheng Huang, Merck Labs (Former UM-PIBS student) • Geoffrey Ginsburg, Amy Zaas, Chris Woods: Duke Medicine Sponsors • AFOSR Complex Networks Program • NSF Theoretical Foundations Program • ARO Social Informatics Program • ARO MURI Value of Information Program • NIH P01 program NIBIB - Meyer PI • DARPA Predicting Health and Disease Program - Ginsburg PI 3 54

Slide 4

Slide 4 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Outline 1 Motivation 2 Correlation mining principles 3 Application: network analysis 4 Application: SPARC∗ predictor design 5 Conclusions ∗SPARC=Screening, Prediction, and Regression via Correlation (SPARC) 4 54

Slide 5

Slide 5 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Why mine for high sample correlation vs sample mean? Mining for treatment effects: p = 12023 biomarkers, n = 130 samples/treatment HMBOX1 vs NRLP2 JARID1D vs SNX19 Blue: treatment 1 (Sx). Green: treatment 2 (Asx). Solid: women. Hollow: men. Size: hours elapsed since inoculation. 5 54

Slide 6

Slide 6 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Correlation mining pipeline 6 54

Slide 7

Slide 7 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Network discovery from correlation O/I correlation gene correlation mutual correlation • p = 1.5 × 109 vertices • p = 23, 000 vertices • p = 7000 vertices • 6 × 109 ≤ 10−8 p 2 edges • 1.5 × 105 ≤ 10−3 p 2 edges • 7 × 105 ≤ 10−2 p 2 edges • n = 365 samples • n = 270 samples • n = 6 samples 7 54

Slide 8

Slide 8 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Network discovery from correlation O/I correlation gene correlation mutual correlation • ”Big data” aspects • Large number of unknowns (hubs, edges, subgraphs) • Small number of samples for inference on unknowns • Crucial need to manage uncertainty (false positives) 8 54

Slide 9

Slide 9 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Sample correlation: p = 2 variables n = 50 samples Sample correlation: corrX,Y = n i=1 (Xi − X)(Yi − Y ) n i=1 (Xi − X)2 n i=1 (Yi − Y )2 ∈ [−1, 1] , Positive correlation =1 Negative correlation =-1 9 54

Slide 10

Slide 10 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Sample correlation for two sequences: p = 2, n = 50 Q: Are the two time sequences Xi and Yj correlated, e.g. |corrXY | > 0.5? 10 54

Slide 11

Slide 11 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Sample correlation for two sequences: p = 2, n = 50 Q: Are the two time sequences Xi and Yj correlated? A: No. Computed over range i = 1, . . . 50: corrXY = −0.0809 11 54

Slide 12

Slide 12 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Sample correlation for two sequences: p = 2, n < 15 Q: Are the two time sequences Xi and Yj correlated? A: Yes. corrXY > 0.5 over range i = 3, . . . 12 and corrXY < −0.5 over range i = 29, . . . , 42. 12 54

Slide 13

Slide 13 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Correlating a set of p = 20 sequences Q: Are any pairs of sequences correlated? Are there patterns of correlation? 13 54

Slide 14

Slide 14 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Sample correlation R w/ correlation thresholding (0.5) Correlation matrix Thresholded matrix Apparent patterns emerge after thresholding each pairwise correlation at ±0.5. (12 cross-correlations). 14 54

Slide 15

Slide 15 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Associated sample correlation graph Graph has an edge between node (variable) i and j if ij-th entry of thresholded correlation is non-zero. Sequences are actually uncorrelated Gaussian. 15 54

Slide 16

Slide 16 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Misreporting of correlations is a real problem Source: Young and Karr, Significance, Sept. 2011 16 54

Slide 17

Slide 17 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions The problem of false discoveries: phase transitions • Number of discoveries exhibit phase transition phenomenon • This phenomenon gets worse as p/n increases. • Example: false discoveries of high correlation for uncorrelated Gaussian variables 17 54

Slide 18

Slide 18 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions The problem of false discoveries: phase transitions • Number of discoveries exhibit phase transition phenomenon • This phenomenon gets worse as p/n increases. • Example: false discoveries of high correlation for uncorrelated Gaussian variables 18 54

Slide 19

Slide 19 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Outline 1 Motivation 2 Correlation mining principles 3 Application: network analysis 4 Application: SPARC∗ predictor design 5 Conclusions ∗SPARC=Screening, Prediction, and Regression via Correlation (SPARC) 19 54

Slide 20

Slide 20 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Principled design of correlation mining algorithms Design objective: estimate or detect patterns of correlation in high dimensional sample-poor environments with low error rates Fundamental design question What are the fundamental properties of a network of p interacting variables that can be accurately estimated from a small number n of measurements? Regimes • n/p → ∞: sample rich regime (CLT, LLNs) • n/p → c: sample critical regime (Semi-circle, Marchenko-Pastur) • n/p → 0: sample starved regime (Chen-Stein) It is important to design the procedure for the regime one is in

Slide 21

Slide 21 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Fundamental sampling regimes • Classical asymptotics: n → ∞, p fixed (’small data’) • Mixed asymptotics: n → ∞, p → ∞ (’Medium sized data’) • Purely high dimensional: n fixed, p → ∞ (’Big data’) 21 54

Slide 22

Slide 22 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Why is correlation important? • Network modeling: learning/simulating descriptive models • Empirical prediction: forecast a response variable Y • Classification: estimate type of correlation from samples • Anomaly detection: localize unusual activity in a sample 22 54

Slide 23

Slide 23 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Why is correlation important? • Network modeling: learning/simulating descriptive models • Empirical prediction: forecast a response variable Y • Classification: estimate type of correlation from samples • Anomaly detection: localize unusual activity in a sample Each application requires estimate of cov matrix ΣX or its inverse Prediction: Linear minimum MSE predictor of q variables Y from X ˆ Y = ΣYX Σ−1 X X Covariance matrix related to inter-dependency structure. Classification: QDA test H0 : ΣX = Σ0 vs H1 : ΣX = Σ1 XT (Σ−1 0 − Σ−1 1 )X H1 > < H0 η Anomaly detection: Mahalanobis test H0 : ΣX = Σ0 vs H1 : ΣX = Σ0 XT Σ−1 0 X XT X H1 > < H0 η 22 54

Slide 24

Slide 24 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Estimation, selection, testing, screening • Regularized l2 or lF covariance estimation • Banded covariance model: Bickel-Levina (2008) Sparse eigendecomposition model: Johnstone-Lu (2007) • Stein shrinkage estimator: Ledoit-Wolf (2005), Chen-Weisel-Eldar-H (2010) • Gaussian graphical model selection • l1 regularized GGM: Meinshausen-B¨ uhlmann (2006), Wiesel-Eldar-H (2010). • Sparse Kronecker GGM (Matrix Normal):Allen-Tibshirani (2010), Tsiligkaridis-Zhou-H (2012) • Independence testing • Sphericity test for multivariate Gaussian: Wilks (1935) • Maximal correlation test: Moran (1980), Eagleson (1983), Jiang (2004), Zhou (2007), Cai and Jiang (2011) • Correlation screening (H, Rajaratnam 2011, 2012) • Find variables having high correlation wrt other variables • Find hubs of degree ≥ k ≡ test maximal k-NN. 23 54

Slide 25

Slide 25 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Sample complexity regimes for different tasks Hero and Rajaratnam, submitted 2015 • Sample complexity regime specified by # available samples • Some of these regimes require knowledge of sparsity factor • From L to R, regimes require progressively larger sample size 24 54

Slide 26

Slide 26 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Sample complexity regimes for different tasks Hero and Rajaratnam, submitted 2015 • There are niche regimes for reliable screening, detection, . . . , performance estimation • Smallest amount of data needed to screen for high correlations • Largest amount of data needed to quantify uncertainty 25 54

Slide 27

Slide 27 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Implication: adapt inference task to sample size Dichotomous sampling regimes has motivated (Firouzi-H-R 2014): • Progressive correlation mining ⇒ match the mining task to the available sample size. • Multistage correlation mining for budget limited applications ⇒ Screen small exploratory sample prior to big collection 26 54

Slide 28

Slide 28 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Screening edges and hubs (H-Rajaratnam 2011, 2012) After applying threshold ρ obtain a graph G having edges E · · · • Number of hub nodes in G: Nδ,ρ = p i=1 I(di ≥ δ) I(di ≥ δ) = 1, card{j : j = i, |Cij | ≥ ρ} ≥ δ 0, o.w. C is either sample correlation matrix R = diag(Sn)−1/2Sndiag(Sn)−1/2 or sample partial correlation matrix ˆ Ω = diag(S† n )−1/2S† n diag(S† n )−1/2 27 54

Slide 29

Slide 29 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Asymptotics for fixed sample size n, p → ∞, and ρ → 1 Asymptotics of hub screening: (Rajaratnam and H 2011, 2012)) Assume that rows of n × p matrix X are i.i.d. circular complex random variables with bounded elliptically contoured density and block sparse covariance. Theorem Let p and ρ = ρp satisfy limp→∞ p1/δ(p − 1)(1 − ρ2 p )(n−2)/2 = en,δ. Then P(Nδ,ρ > 0) → 1 − exp(−λδ,ρ,n/2), δ = 1 1 − exp(−λδ,ρ,n), δ > 1 . λδ,ρ,n = p p − 1 δ (P0(ρ, n))δ P0(ρ, n) = 2B((n − 2)/2, 1/2) 1 ρ (1 − u2)n−4 2 du 28 54

Slide 30

Slide 30 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions False positive rate as function of ρ (δ = 1) n 550 500 450 150 100 50 10 8 6 ρc 0.188 0.197 0.207 0.344 0.413 0.559 0.961 0.988 0.9997 Critical threshold (δ = 1): ρc ≈ max{ρ : dE[Nδ,ρ]/dρ = −1} ρc = 1 − cn(p − 1)−2/(n−4) 29 54

Slide 31

Slide 31 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions False positive rate as function of ρ and n (δ = 1) p=10 (δ = 1) p=10000 30 54

Slide 32

Slide 32 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions False positive rate as function of ρ and n (δ = 1) p=10 (δ = 1) p=10000 Critical threshold for any δ > 0 : ρc = 1 − cδ,n(p − 1)−2δ/δ(n−2)−2 30 54

Slide 33

Slide 33 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Critical threshold ρc as function of n (H-Rajaratnam 2012) 31 54

Slide 34

Slide 34 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Critical threshold ρc as function of n (H-Rajaratnam 2012) 32 54

Slide 35

Slide 35 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Outline 1 Motivation 2 Correlation mining principles 3 Application: network analysis 4 Application: SPARC∗ predictor design 5 Conclusions ∗SPARC=Screening, Prediction, and Regression via Correlation (SPARC) 33 54

Slide 36

Slide 36 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Respiratory virus challenge study: experimental design Zaas et al, Cell, Host and Microbe, 2009 Chen et al, IEEE Trans. Biomedical Engineering, 2010 Chen et al BMC Bioinformatics, 2011 Puig et al IEEE Trans. Signal Processing, 2011 Huang et al, PLoS Genetics, 2011 Woods et al, PLoS One, 2012 Bazot et al, BMC Bioinformatics, 2013 Zaas et al, Science Translation Medicine, 2014 34 54

Slide 37

Slide 37 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Data collected from seven challenge studies Challenge Virus Year Location Duration (hrs) # Subjects DEE1 RSV 2008 Retroscreen 166 20 DEE2 H3N2 2009 Retroscreen 166 17 DEE3 H1N1 2009 Retroscreen 166 24 DEE4 H1N1 2010 Retroscreen 166 19 DEE5 H3N2 2011 Retroscreen 680 21 HRV UVA HRV 2008 Univ. of Virginia 120 20 HRV Duke HRV 2010 Duke Univ. 136 30 35 54

Slide 38

Slide 38 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Details on H3N2 DEE2 challenge study • 17 subjects inoculated and sampled over 7 days • 373 samples collected • 21 Affymetrix gene chips assayed for each subject • p = 12023 genes recorded for each sample • 10 symptom scored from {0, 1, 2, 3} for each sample [Huang et al, PLoS Genetics, 2011] 36 54

Slide 39

Slide 39 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Critical threshold ρc for H3N2 DEE2 Samples fall into 3 categories • Pre-inoculation samples • Number of Pre-inoc. samples: n = 34 • Critical threshold: ρc = 0.70 • 10−6 FWER threshold: ρ = 0.92 • Post-inoculation symptomatic samples • Number of Post-inoc. Sx samples: n = 170 • Critical threshold: ρc = 0.36 • 10−6 FWER threshold: ρ = 0.55 • Post-inoculation asymptomatic samples • Number of Pre-inoc. samples: n = 152 • Critical threshold: ρc = 0.37 • 10−6 FWER threshold: ρ = 0.57 37 54

Slide 40

Slide 40 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Correlation-mining the pre-inoc. samples • Screen correlation at FWER 10−6: 1658 genes, 8718 edges • Screen partial correlation at FWER 10−6: 39 genes, 111 edges 38 54

Slide 41

Slide 41 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions P-value waterfall analysis (Pre-inoc. parcor) 39 54

Slide 42

Slide 42 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Outline 1 Motivation 2 Correlation mining principles 3 Application: network analysis 4 Application: SPARC∗ predictor design 5 Conclusions ∗SPARC=Screening, Prediction, and Regression via Correlation (SPARC) 40 54

Slide 43

Slide 43 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Correlation mining for predictor design: bipartite graph Q: What genes are predictive of certain symptom combinations? Firouzi, Rajaratnam and H, ”Two-stage variable selection for molecular prediction of disease ,” CAMSAP 2013 41 54

Slide 44

Slide 44 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Cost considerations Doing experiments are costly (>$250K per challenge study) Figure: Pricing per slide for Agilent Custom Micorarrays G2309F, G2513F, G4503A, G4502A (Feb 2013). Source: BMC RNA Profiling Core 42 54

Slide 45

Slide 45 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Single stage learning of a predictor Q: What genes are predictive of certain symptom combinations? 43 54

Slide 46

Slide 46 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Two-stage learning of a predictor (SPARC/SIS) 44 54

Slide 47

Slide 47 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Related work Falls in the general framework of adaptive support set recovery Some related work • Compressive sensing approaches • Distilled sampling (DS) (Haupt, Castro and Nowak 2010) • Sequentially designed compressive sensing (Haupt, Baraniuk, Castro and Nowak 2011) • Sparse multivariate regression approaches • Lasso recovery (Wainright 2006, Zhao and Yu 2007) • Group lasso recovery (Obozinski, Wainright, and Jordan 2008) • Sure independence screening (SIS) (Fan and Lv, 2008) • Screens cross-correlation Syx for good predictor variables • SPARC approach • Screens predictor coefficients Syx S† x • Only requires n = logt full-size samples in SPARC stage 1 45 54

Slide 48

Slide 48 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions SPARC recovery of support of active variables Theorem (Firouzi, H, R, 2013, 2014) Assume that the response Y satisfies the following noiseless ground truth model: Y = ai1 Xi1 + ai2 Xi2 + · · · + aik Xik If n ≥ Θ(logp) then, with probability at least 1 − 1/p, PCS recovers support of active variables π0. • Analogous to condition for LASSO support recovery (Obozinski, Wainright, Jordan 2008). • The constant in Θ(logp) is increasing in dynamic range coefficient |π0|−1 l∈π0 |al | minj∈π0 |aj | ∈ [1, ∞) • Worst case: high dynamic range in active regression coefficients. 46 54

Slide 49

Slide 49 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Optimal pre-screening allocation under budget µ Assume that: cost(acquisition of 1 sample of 1 variable)=1. Define • Total budget for two-stage experiment: µ. • Number of selected variables k. Total number of samples t. To meet budget t, n, k, p must satisfy: np + (t − n)k ≤ µ Theorem MSE optimal pre-screening allocation rule for two-stage predictor n = O(logt), c(p − k)logt + kt ≤ µ 0, o.w. When budget is tight skip stage 1 (n = 0).

Slide 50

Slide 50 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Simulation comparing SPARC to LASSO/SIS Figure: Avg mis-selection for two-stage predictor under AR(1) model. n = 25logt samples are used for the first stage and all t samples are used the second stage. p = 10, 000. 48 54

Slide 51

Slide 51 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Comparison of prediction accuracy and computation Figure: Prediction accuracy (L) and avg. CPU time (Matlab) (R) for AR(1) model. SPARC compared to SIS and active set implementation of LASSO. p = 10, 000. 49 54

Slide 52

Slide 52 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Prediction of Symptoms of H3N2 Based on Gene Expression Levels Figure: Prediction accuracy for symptom prediction in H3N2 DEE2 50 54

Slide 53

Slide 53 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Top 20 predictive biomarkers selected 51 54

Slide 54

Slide 54 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Variablility comparisons btwn PCS and lasso Figure: PCS genes for subj 25 lasso genes for subj 25 52 54

Slide 55

Slide 55 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Outline 1 Motivation 2 Correlation mining principles 3 Application: network analysis 4 Application: SPARC∗ predictor design 5 Conclusions ∗SPARC=Screening, Prediction, and Regression via Correlation (SPARC) 53 54

Slide 56

Slide 56 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Conclusions Correlation mining requires care when n p • “Classical” low dimensional (“CLT”) setting inadequate. • “Ultra-high” dimensional setting inadequate when n fixed. • “Purely high” dimensional (”big data”) setting well suited • Universal phase transition thresholds can be predicted • Phase transitions useful for properly sample-sizing experiments

Slide 57

Slide 57 text

Outline Motivation Correlation mining principles Network analysis SPARC predictor design Conclusions Conclusions Correlation mining requires care when n p • “Classical” low dimensional (“CLT”) setting inadequate. • “Ultra-high” dimensional setting inadequate when n fixed. • “Purely high” dimensional (”big data”) setting well suited • Universal phase transition thresholds can be predicted • Phase transitions useful for properly sample-sizing experiments Correlation mining topics not covered here • Individualized predictors: reference-aided classifiers (Liu et al 2013) • Structured covariance: Kronecker, Toeplitz, low rank+sparse, etc (Tsiligkaridis and H 2013), (Greenewald and H 2014) ,, • Non-linear correlation mining (Todros and H, 2011, 2012) • Spectral correlation mining: stationary time series (Firouzi and H, 2014) 54 54