Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Alfred Hero

S³ Seminar
January 30, 2015

Alfred Hero

(University of Michigan, Ann Arbor, MI, USA)

https://s3-seminar.github.io/seminars/alfred-hero

Title — Correlation mining in high dimension with limited samples

Abstract — Correlation mining arises in many areas of engineering, social sciences, and natural sciences. Correlation mining discovers columns of a random matrix that are highly correlated with other columns of the matrix and can be used to construct a dependency network over columns. However, when the number n of samples is finite and the number p of columns increases such exploration becomes futile due to a phase transition phenomenon: spurious discoveries will eventually dominate. In this presentation I will present theory for predicting these phase transitions and present Poisson limit theorems that can be used to determine finite sample behavior of correlation structure. The theory has application to areas including gene expression analysis, network security, remote sensing, and portfolio selection.

Biography — Alfred O. Hero III received the B.S. (summa cum laude) from Boston University (1980) and the Ph.D from Princeton University (1984), both in Electrical Engineering. Since 1984 he has been with the University of Michigan, Ann Arbor, where he is the R. Jamison and Betty Williams Professor of Engineering. His primary appointment is in the Department of Electrical Engineering and Computer Science and he also has appointments, by courtesy, in the Department of Biomedical Engineering and the Department of Statistics. From 2008-2013 he was held the Digiteo Chaire d'Excellence at the Ecole Superieure d'Electricite, Gif-sur-Yvette, France. He is a Fellow of the Institute of Electrical and Electronics Engineers (IEEE) and several of his research articles have recieved best paper awards. Alfred Hero was awarded the University of Michigan Distinguished Faculty Achievement Award (2011). He received the IEEE Signal Processing Society Meritorious Service Award (1998), the IEEE Third Millenium Medal (2000), and the IEEE Signal Processing Society Technical Achievement Award (2014). Alfred Hero was President of the IEEE Signal Processing Society (2006-2008) and was on the Board of Directors of the IEEE (2009-2011) where he served as Director of Division IX (Signals and Applications). Alfred Hero's recent research interests are in statistical signal processing, machine learning and the analysis of high dimensional spatio-temporal data. Of particular interest are applications to networks, including social networks, multi-modal sensing and tracking, database indexing and retrieval, imaging, and genomic signal processing.

S³ Seminar

January 30, 2015
Tweet

More Decks by S³ Seminar

Other Decks in Research

Transcript

  1. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Correlation mining in high dimension with limited samples Alfred Hero University of Michigan - Ann Arbor Jan. 30, 2015 1 54
  2. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions 1 Motivation 2 Correlation mining principles 3 Application: network analysis 4 Application: SPARC∗ predictor design 5 Conclusions
  3. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Acknowledgements • Bala Rajaratnam, Stanford Statistics • Hamed Firouzi, UM EECS (doctoral student) • Rob Brown, UCLA Bioinformatics (doctoral student) • Yongsheng Huang, Merck Labs (Former UM-PIBS student) • Geoffrey Ginsburg, Amy Zaas, Chris Woods: Duke Medicine Sponsors • AFOSR Complex Networks Program • NSF Theoretical Foundations Program • ARO Social Informatics Program • ARO MURI Value of Information Program • NIH P01 program NIBIB - Meyer PI • DARPA Predicting Health and Disease Program - Ginsburg PI 3 54
  4. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Outline 1 Motivation 2 Correlation mining principles 3 Application: network analysis 4 Application: SPARC∗ predictor design 5 Conclusions ∗SPARC=Screening, Prediction, and Regression via Correlation (SPARC) 4 54
  5. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Why mine for high sample correlation vs sample mean? Mining for treatment effects: p = 12023 biomarkers, n = 130 samples/treatment HMBOX1 vs NRLP2 JARID1D vs SNX19 Blue: treatment 1 (Sx). Green: treatment 2 (Asx). Solid: women. Hollow: men. Size: hours elapsed since inoculation. 5 54
  6. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Network discovery from correlation O/I correlation gene correlation mutual correlation • p = 1.5 × 109 vertices • p = 23, 000 vertices • p = 7000 vertices • 6 × 109 ≤ 10−8 p 2 edges • 1.5 × 105 ≤ 10−3 p 2 edges • 7 × 105 ≤ 10−2 p 2 edges • n = 365 samples • n = 270 samples • n = 6 samples 7 54
  7. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Network discovery from correlation O/I correlation gene correlation mutual correlation • ”Big data” aspects • Large number of unknowns (hubs, edges, subgraphs) • Small number of samples for inference on unknowns • Crucial need to manage uncertainty (false positives) 8 54
  8. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Sample correlation: p = 2 variables n = 50 samples Sample correlation: corrX,Y = n i=1 (Xi − X)(Yi − Y ) n i=1 (Xi − X)2 n i=1 (Yi − Y )2 ∈ [−1, 1] , Positive correlation =1 Negative correlation =-1 9 54
  9. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Sample correlation for two sequences: p = 2, n = 50 Q: Are the two time sequences Xi and Yj correlated, e.g. |corrXY | > 0.5? 10 54
  10. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Sample correlation for two sequences: p = 2, n = 50 Q: Are the two time sequences Xi and Yj correlated? A: No. Computed over range i = 1, . . . 50: corrXY = −0.0809 11 54
  11. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Sample correlation for two sequences: p = 2, n < 15 Q: Are the two time sequences Xi and Yj correlated? A: Yes. corrXY > 0.5 over range i = 3, . . . 12 and corrXY < −0.5 over range i = 29, . . . , 42. 12 54
  12. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Correlating a set of p = 20 sequences Q: Are any pairs of sequences correlated? Are there patterns of correlation? 13 54
  13. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Sample correlation R w/ correlation thresholding (0.5) Correlation matrix Thresholded matrix Apparent patterns emerge after thresholding each pairwise correlation at ±0.5. (12 cross-correlations). 14 54
  14. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Associated sample correlation graph Graph has an edge between node (variable) i and j if ij-th entry of thresholded correlation is non-zero. Sequences are actually uncorrelated Gaussian. 15 54
  15. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Misreporting of correlations is a real problem Source: Young and Karr, Significance, Sept. 2011 16 54
  16. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions The problem of false discoveries: phase transitions • Number of discoveries exhibit phase transition phenomenon • This phenomenon gets worse as p/n increases. • Example: false discoveries of high correlation for uncorrelated Gaussian variables 17 54
  17. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions The problem of false discoveries: phase transitions • Number of discoveries exhibit phase transition phenomenon • This phenomenon gets worse as p/n increases. • Example: false discoveries of high correlation for uncorrelated Gaussian variables 18 54
  18. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Outline 1 Motivation 2 Correlation mining principles 3 Application: network analysis 4 Application: SPARC∗ predictor design 5 Conclusions ∗SPARC=Screening, Prediction, and Regression via Correlation (SPARC) 19 54
  19. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Principled design of correlation mining algorithms Design objective: estimate or detect patterns of correlation in high dimensional sample-poor environments with low error rates Fundamental design question What are the fundamental properties of a network of p interacting variables that can be accurately estimated from a small number n of measurements? Regimes • n/p → ∞: sample rich regime (CLT, LLNs) • n/p → c: sample critical regime (Semi-circle, Marchenko-Pastur) • n/p → 0: sample starved regime (Chen-Stein) It is important to design the procedure for the regime one is in
  20. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Fundamental sampling regimes • Classical asymptotics: n → ∞, p fixed (’small data’) • Mixed asymptotics: n → ∞, p → ∞ (’Medium sized data’) • Purely high dimensional: n fixed, p → ∞ (’Big data’) 21 54
  21. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Why is correlation important? • Network modeling: learning/simulating descriptive models • Empirical prediction: forecast a response variable Y • Classification: estimate type of correlation from samples • Anomaly detection: localize unusual activity in a sample 22 54
  22. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Why is correlation important? • Network modeling: learning/simulating descriptive models • Empirical prediction: forecast a response variable Y • Classification: estimate type of correlation from samples • Anomaly detection: localize unusual activity in a sample Each application requires estimate of cov matrix ΣX or its inverse Prediction: Linear minimum MSE predictor of q variables Y from X ˆ Y = ΣYX Σ−1 X X Covariance matrix related to inter-dependency structure. Classification: QDA test H0 : ΣX = Σ0 vs H1 : ΣX = Σ1 XT (Σ−1 0 − Σ−1 1 )X H1 > < H0 η Anomaly detection: Mahalanobis test H0 : ΣX = Σ0 vs H1 : ΣX = Σ0 XT Σ−1 0 X XT X H1 > < H0 η 22 54
  23. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Estimation, selection, testing, screening • Regularized l2 or lF covariance estimation • Banded covariance model: Bickel-Levina (2008) Sparse eigendecomposition model: Johnstone-Lu (2007) • Stein shrinkage estimator: Ledoit-Wolf (2005), Chen-Weisel-Eldar-H (2010) • Gaussian graphical model selection • l1 regularized GGM: Meinshausen-B¨ uhlmann (2006), Wiesel-Eldar-H (2010). • Sparse Kronecker GGM (Matrix Normal):Allen-Tibshirani (2010), Tsiligkaridis-Zhou-H (2012) • Independence testing • Sphericity test for multivariate Gaussian: Wilks (1935) • Maximal correlation test: Moran (1980), Eagleson (1983), Jiang (2004), Zhou (2007), Cai and Jiang (2011) • Correlation screening (H, Rajaratnam 2011, 2012) • Find variables having high correlation wrt other variables • Find hubs of degree ≥ k ≡ test maximal k-NN. 23 54
  24. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Sample complexity regimes for different tasks Hero and Rajaratnam, submitted 2015 • Sample complexity regime specified by # available samples • Some of these regimes require knowledge of sparsity factor • From L to R, regimes require progressively larger sample size 24 54
  25. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Sample complexity regimes for different tasks Hero and Rajaratnam, submitted 2015 • There are niche regimes for reliable screening, detection, . . . , performance estimation • Smallest amount of data needed to screen for high correlations • Largest amount of data needed to quantify uncertainty 25 54
  26. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Implication: adapt inference task to sample size Dichotomous sampling regimes has motivated (Firouzi-H-R 2014): • Progressive correlation mining ⇒ match the mining task to the available sample size. • Multistage correlation mining for budget limited applications ⇒ Screen small exploratory sample prior to big collection 26 54
  27. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Screening edges and hubs (H-Rajaratnam 2011, 2012) After applying threshold ρ obtain a graph G having edges E · · · • Number of hub nodes in G: Nδ,ρ = p i=1 I(di ≥ δ) I(di ≥ δ) = 1, card{j : j = i, |Cij | ≥ ρ} ≥ δ 0, o.w. C is either sample correlation matrix R = diag(Sn)−1/2Sndiag(Sn)−1/2 or sample partial correlation matrix ˆ Ω = diag(S† n )−1/2S† n diag(S† n )−1/2 27 54
  28. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Asymptotics for fixed sample size n, p → ∞, and ρ → 1 Asymptotics of hub screening: (Rajaratnam and H 2011, 2012)) Assume that rows of n × p matrix X are i.i.d. circular complex random variables with bounded elliptically contoured density and block sparse covariance. Theorem Let p and ρ = ρp satisfy limp→∞ p1/δ(p − 1)(1 − ρ2 p )(n−2)/2 = en,δ. Then P(Nδ,ρ > 0) → 1 − exp(−λδ,ρ,n/2), δ = 1 1 − exp(−λδ,ρ,n), δ > 1 . λδ,ρ,n = p p − 1 δ (P0(ρ, n))δ P0(ρ, n) = 2B((n − 2)/2, 1/2) 1 ρ (1 − u2)n−4 2 du 28 54
  29. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions False positive rate as function of ρ (δ = 1) n 550 500 450 150 100 50 10 8 6 ρc 0.188 0.197 0.207 0.344 0.413 0.559 0.961 0.988 0.9997 Critical threshold (δ = 1): ρc ≈ max{ρ : dE[Nδ,ρ]/dρ = −1} ρc = 1 − cn(p − 1)−2/(n−4) 29 54
  30. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions False positive rate as function of ρ and n (δ = 1) p=10 (δ = 1) p=10000 30 54
  31. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions False positive rate as function of ρ and n (δ = 1) p=10 (δ = 1) p=10000 Critical threshold for any δ > 0 : ρc = 1 − cδ,n(p − 1)−2δ/δ(n−2)−2 30 54
  32. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Critical threshold ρc as function of n (H-Rajaratnam 2012) 31 54
  33. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Critical threshold ρc as function of n (H-Rajaratnam 2012) 32 54
  34. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Outline 1 Motivation 2 Correlation mining principles 3 Application: network analysis 4 Application: SPARC∗ predictor design 5 Conclusions ∗SPARC=Screening, Prediction, and Regression via Correlation (SPARC) 33 54
  35. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Respiratory virus challenge study: experimental design Zaas et al, Cell, Host and Microbe, 2009 Chen et al, IEEE Trans. Biomedical Engineering, 2010 Chen et al BMC Bioinformatics, 2011 Puig et al IEEE Trans. Signal Processing, 2011 Huang et al, PLoS Genetics, 2011 Woods et al, PLoS One, 2012 Bazot et al, BMC Bioinformatics, 2013 Zaas et al, Science Translation Medicine, 2014 34 54
  36. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Data collected from seven challenge studies Challenge Virus Year Location Duration (hrs) # Subjects DEE1 RSV 2008 Retroscreen 166 20 DEE2 H3N2 2009 Retroscreen 166 17 DEE3 H1N1 2009 Retroscreen 166 24 DEE4 H1N1 2010 Retroscreen 166 19 DEE5 H3N2 2011 Retroscreen 680 21 HRV UVA HRV 2008 Univ. of Virginia 120 20 HRV Duke HRV 2010 Duke Univ. 136 30 35 54
  37. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Details on H3N2 DEE2 challenge study • 17 subjects inoculated and sampled over 7 days • 373 samples collected • 21 Affymetrix gene chips assayed for each subject • p = 12023 genes recorded for each sample • 10 symptom scored from {0, 1, 2, 3} for each sample [Huang et al, PLoS Genetics, 2011] 36 54
  38. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Critical threshold ρc for H3N2 DEE2 Samples fall into 3 categories • Pre-inoculation samples • Number of Pre-inoc. samples: n = 34 • Critical threshold: ρc = 0.70 • 10−6 FWER threshold: ρ = 0.92 • Post-inoculation symptomatic samples • Number of Post-inoc. Sx samples: n = 170 • Critical threshold: ρc = 0.36 • 10−6 FWER threshold: ρ = 0.55 • Post-inoculation asymptomatic samples • Number of Pre-inoc. samples: n = 152 • Critical threshold: ρc = 0.37 • 10−6 FWER threshold: ρ = 0.57 37 54
  39. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Correlation-mining the pre-inoc. samples • Screen correlation at FWER 10−6: 1658 genes, 8718 edges • Screen partial correlation at FWER 10−6: 39 genes, 111 edges 38 54
  40. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions P-value waterfall analysis (Pre-inoc. parcor) 39 54
  41. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Outline 1 Motivation 2 Correlation mining principles 3 Application: network analysis 4 Application: SPARC∗ predictor design 5 Conclusions ∗SPARC=Screening, Prediction, and Regression via Correlation (SPARC) 40 54
  42. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Correlation mining for predictor design: bipartite graph Q: What genes are predictive of certain symptom combinations? Firouzi, Rajaratnam and H, ”Two-stage variable selection for molecular prediction of disease ,” CAMSAP 2013 41 54
  43. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Cost considerations Doing experiments are costly (>$250K per challenge study) Figure: Pricing per slide for Agilent Custom Micorarrays G2309F, G2513F, G4503A, G4502A (Feb 2013). Source: BMC RNA Profiling Core 42 54
  44. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Single stage learning of a predictor Q: What genes are predictive of certain symptom combinations? 43 54
  45. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Two-stage learning of a predictor (SPARC/SIS) 44 54
  46. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Related work Falls in the general framework of adaptive support set recovery Some related work • Compressive sensing approaches • Distilled sampling (DS) (Haupt, Castro and Nowak 2010) • Sequentially designed compressive sensing (Haupt, Baraniuk, Castro and Nowak 2011) • Sparse multivariate regression approaches • Lasso recovery (Wainright 2006, Zhao and Yu 2007) • Group lasso recovery (Obozinski, Wainright, and Jordan 2008) • Sure independence screening (SIS) (Fan and Lv, 2008) • Screens cross-correlation Syx for good predictor variables • SPARC approach • Screens predictor coefficients Syx S† x • Only requires n = logt full-size samples in SPARC stage 1 45 54
  47. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions SPARC recovery of support of active variables Theorem (Firouzi, H, R, 2013, 2014) Assume that the response Y satisfies the following noiseless ground truth model: Y = ai1 Xi1 + ai2 Xi2 + · · · + aik Xik If n ≥ Θ(logp) then, with probability at least 1 − 1/p, PCS recovers support of active variables π0. • Analogous to condition for LASSO support recovery (Obozinski, Wainright, Jordan 2008). • The constant in Θ(logp) is increasing in dynamic range coefficient |π0|−1 l∈π0 |al | minj∈π0 |aj | ∈ [1, ∞) • Worst case: high dynamic range in active regression coefficients. 46 54
  48. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Optimal pre-screening allocation under budget µ Assume that: cost(acquisition of 1 sample of 1 variable)=1. Define • Total budget for two-stage experiment: µ. • Number of selected variables k. Total number of samples t. To meet budget t, n, k, p must satisfy: np + (t − n)k ≤ µ Theorem MSE optimal pre-screening allocation rule for two-stage predictor n = O(logt), c(p − k)logt + kt ≤ µ 0, o.w. When budget is tight skip stage 1 (n = 0).
  49. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Simulation comparing SPARC to LASSO/SIS Figure: Avg mis-selection for two-stage predictor under AR(1) model. n = 25logt samples are used for the first stage and all t samples are used the second stage. p = 10, 000. 48 54
  50. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Comparison of prediction accuracy and computation Figure: Prediction accuracy (L) and avg. CPU time (Matlab) (R) for AR(1) model. SPARC compared to SIS and active set implementation of LASSO. p = 10, 000. 49 54
  51. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Prediction of Symptoms of H3N2 Based on Gene Expression Levels Figure: Prediction accuracy for symptom prediction in H3N2 DEE2 50 54
  52. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Variablility comparisons btwn PCS and lasso Figure: PCS genes for subj 25 lasso genes for subj 25 52 54
  53. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Outline 1 Motivation 2 Correlation mining principles 3 Application: network analysis 4 Application: SPARC∗ predictor design 5 Conclusions ∗SPARC=Screening, Prediction, and Regression via Correlation (SPARC) 53 54
  54. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Conclusions Correlation mining requires care when n p • “Classical” low dimensional (“CLT”) setting inadequate. • “Ultra-high” dimensional setting inadequate when n fixed. • “Purely high” dimensional (”big data”) setting well suited • Universal phase transition thresholds can be predicted • Phase transitions useful for properly sample-sizing experiments
  55. Outline Motivation Correlation mining principles Network analysis SPARC predictor design

    Conclusions Conclusions Correlation mining requires care when n p • “Classical” low dimensional (“CLT”) setting inadequate. • “Ultra-high” dimensional setting inadequate when n fixed. • “Purely high” dimensional (”big data”) setting well suited • Universal phase transition thresholds can be predicted • Phase transitions useful for properly sample-sizing experiments Correlation mining topics not covered here • Individualized predictors: reference-aided classifiers (Liu et al 2013) • Structured covariance: Kronecker, Toeplitz, low rank+sparse, etc (Tsiligkaridis and H 2013), (Greenewald and H 2014) ,, • Non-linear correlation mining (Todros and H, 2011, 2012) • Spectral correlation mining: stationary time series (Firouzi and H, 2014) 54 54