Katrijn Van Deun - Speaker Deck

Slide 1

Slide 1 text

Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Slide 2

Slide 2 text

Outline • Introduction & Motivation • Methods • Model • Objective function • Estimation • Model Selection • Conclusion & Discussion Statistical methods for multi-source high-dimensional data 2

Slide 3

Slide 3 text

Introduction Statistical methods for multi-source high-dimensional data 3

Slide 4

Slide 4 text

Big Data in the Social Sciences • Everything is measured • What we think and do: Social media, Web browsing behavior • Where we are and with whom: GPS tracking, cameras • At a very detailed level: neuron, DNA • Data are shared • Open data: in science, governments (open government data) • Data are linked • Government • Science: multi-disciplinary • Linked Data web-architecture Statistical methods for multi-source high-dimensional data 4

Slide 5

Slide 5 text

Statistical methods for multi-source high-dimensional data 5 • Illustration: Health & Retirement Study; traditional data Respondents Well being 1 . . . . . . . . . . . . . . 10 000 Survey data 1 … 500

Slide 6

Slide 6 text

Statistical methods for multi-source high-dimensional data 6 • Illustration: Health & Retirement Study; traditional + novel type of data • Multiple sources, heterogeneous in nature • High-dimensional (p>>n) Respondents 501 … 50 000 Novel kind of data Well being 1 . . . . . 1000 Survey data 1 … 500 Big Data

Slide 7

Slide 7 text

• Illustration: ALSPAC household panel data • Multiple sources / multi-block (heterogeneous) • High-dimensional Statistical methods for multi-source high-dimensional data 7 respondents

Slide 8

Slide 8 text

 Extremely information-rich data • 1) Adds context, detail => deeper understanding + more accurate prediction Eg. Same income, social network, health but difference in well-being? • 2) Gives insight in the interplay between multiple factors Eg. gene-environment interactions: find (epi-)genetic markers that make someone susceptible to obesity together with the protective/risk-provoking environmental conditions associated to these markers However, statistical tools fall short … Statistical methods for multi-source high-dimensional data 8

Slide 9

Slide 9 text

Statistical methods for multi-source high-dimensional data 9

Slide 10

Slide 10 text

Statistical methods for multi-source high-dimensional data 10 • First challenge = Data fusion? • How to guarantee that the different sources of variation complement each other? • Find common sources of structural variation? • Yet, heterogeneous data sources; often common sources of variation are subtle while source-specific variation is dominant

Slide 11

Slide 11 text

Statistical methods for multi-source high-dimensional data 11 • Second challenge = Relevant variables? • Find relevant variables • Information may be hidden in a bulk of irrelevant variables e.g., genetic markers for obesity hidden in a bulk of irrelevant markers • Interpretation based on many variables is not very insightful • Note: in general, we do not know which variables are relevant and which are not • S-O-A: penalties, eg lasso • However: selects only one variable out of a group of correlated variables • (i) have a high risk of not selecting the most relevant variable, • (ii) are highly instable, and • (iii) tend to also select variables of irrelevant groups

Slide 12

Slide 12 text

Method: Sparse common and specific components Statistical methods for multi-source high-dimensional data 12

Slide 13

Slide 13 text

Method: Sparse common & specific components • Structured analysis • Data fusion: Common components • Detection of relevant variables: Penalties (eg lasso) => Selection of linked variables (between blocks) 13 Respondents 501 … 50 000 Nieuw soort data Well being 1 . . . . . 1000 Survey data 1 … 500 Statistical methods for multi-source high-dimensional data

Slide 14

Slide 14 text

• Notation and naming conventions • data block: denotes the different data sources forming the multiblock data • Xk : data block k (with k=1,…,K ); the outcome(s) is denoted by Y (y if univariate) • Each of the data blocks: same set of observation units (respondents) 14 Respondents 501 … 50 000 X2 : Novel kind of data Y: Well being 1 . . . . . 1000 X1 : Survey data 1 … 500 Statistical methods for multi-source high-dimensional data

Slide 15

Slide 15 text

• Point of departure = Simultaneous component analysis (SCA) – Extension of PCA to the multiblock case – Promising method for data integration Method: Data fusion Statistical methods for multi-source high-dimensional data 15

Slide 16

Slide 16 text

• Principal component models • Weight based variant: Xk = Xk Wk Pk T + Ek s.t. Wk TWk = I, (1) = Tk Pk T + Ek with Wk (Jk ×R) the component weights, Tk (I×R) the component scores, and Pk (Jk ×R) the component loadings Interpretation of component trk based on Jk (!) regression weights: tirk =Sj wjrk xijk Statistical methods for multi-source high-dimensional data 16

Slide 17

Slide 17 text

• Principal component models • Loading based variant Xk = Tk Pk T + Ek s.t. Tk TTk = I, (2) Interpretation of component trk based on Jk (!) correlations: r(xjk ,trk )=pjrk • Note: In a least squares approach subject to Pk TPk =I, we have Wk =Pk Statistical methods for multi-source high-dimensional data 17

Slide 18

Slide 18 text

• Simultaneous component analysis For all k: Xk = TPk T + Ek s.t. TTT=I (3) ->same component scores for all data blocks! 1. Weight based variant [X1 ... XK ] = T [P1 T ... PK T] + [E1 ... EK T] = [X1 ... XK ] [W1 T ... WK T]T [P1 T ... PK T] + [E1 ... EK T] or, in shorthand notation: Xconc = Xconc Wconc Pconc T + Econc (4) 2. Loading based variant Xconc = T Pconc T + Econc (5) Statistical methods for multi-source high-dimensional data 18

Slide 19

Slide 19 text

• Simultaneous components • Guarantee same source of variation for the different data blocks • But … do not guarantee that the components are common! • Explanation: SC methods model the largest source of variation in the concatenated data, often this is source-specific variation as common variation is subtle (e.g. response tendencies and general socio-behavioral or genetic processes vs subtle gene- environment ineractions) Statistical methods for multi-source high-dimensional data 19

Slide 20

Slide 20 text

• Common component analysis (CCA) – Account for dominating source specific variation – Common component model: Xk = TPk ’ such that T’T=I and Pk has common / specific structure – Structure can be imposed (constrained analysis) – Or: Xk = Xconc Wconc Pk ’ with the Wconc having common / specific structure P1 P2 Common x x x x x x x Spec1 x x x 0 0 0 0 Spec1 0 0 0 x x x x Statistical methods for multi-source high-dimensional data 20

Slide 21

Slide 21 text

• So far, so good … • Yet: • Interest in relevant / important variables (social factors, genetic markers)? • Interpretation of the components based on 1000s of loadings is infeasible Need to automatically select the “relevant” variables! Sparse common components are needed. Statistical methods for multi-source high-dimensional data 21

Slide 22

Slide 22 text

• Structured sparsity: • Sparse common components: few non-zero loadings in each data block Xk • (Sparse) distinctive components: (few) non-zero loadings only in one/few data blocks Xk Statistical methods for multi-source high-dimensional data 22 X1 X2 Common x 0 0 x x 0 x Dist1 x x x 0 0 0 0 Dist1 0 0 0 x 0 x x

Slide 23

Slide 23 text

• Sparse analysis – Impose restriction on the loadings: many should become zero – S-O-A: add penalties (e.g., lasso) to the objective function Variable selection: How to? Statistical methods for multi-source high-dimensional data 23

Slide 24

Slide 24 text

• Sparse SCA: Objective function Add penalty known to have variable selection properties to SCA objective function: Minimize over T and PC and such that ′ = − ′ 2 + σ, , , with , = σ the L1 penalty or lasso tuned by lr,k ≥ 0 -> shrinks and selects variables Fit / SCA Penalty Statistical methods for multi-source high-dimensional data 24

Slide 25

Slide 25 text

• (adaptive) Lasso • Oracle properties (under some conditions) • Estimation: soft thresholding operator S(bOLS , l/2) = bOLS – l/2 if bOLS >l/2 bLASSO = 0 if – l/2

Slide 26

Slide 26 text

• Sparse common and specific components: Objective function Add penalties and/or constraints to obtain structured sparsity − ′ 2 + σ, , , + σ σ , 2 + σ , 1,2 Penalties: Lasso: , = σ and , ≥0 a tuning parameter Group lasso: , = σ ² and ≥0 a tuning parameter Elitist lasso: , , = σ ² and ≥0 a tuning parameter Fit / SCA Penalties Statistical methods for multi-source high-dimensional data 26 Kills blocks of variables Kills variables within each block

Slide 27

Slide 27 text

Generic objective function: • Allows for combinations of penalties, some known • Penalties can be imposed per component • Useful to find sparse common and specific components Statistical methods for multi-source high-dimensional data 27

Slide 28

Slide 28 text

• Algorithm: Alternating procedure Given fixed tuning parameters and number of common and distinctive components, do 0. Initialize Pconc 1. Update T conditional upon Pconc Closed form: T=UV’ with U and V from the SVD of XC P (I×R -> small for H-D data) 2. Update Pconc conditional upon T MM+UST: Introduce surrogated objective function and apply univariate soft thresholding to surrogate (see next) 3. Check stop criteria (convergence of the loss, maximum number of iterations) and return to step 1 or terminate Statistical methods for multi-source high-dimensional data 28

Slide 29

Slide 29 text

• Algorithm: Conditional estimation of Pconc given T Complicated optimization problem because of penalties => MM procedure Statistical methods for multi-source high-dimensional data 29

Slide 30

Slide 30 text

• MM in brief Find min of f(x) via surrogate function g(x,c) (c:constant): – g(x,c) easy to minimize – g(x,c) ≥ f(x) – g(c,c) = f(c) (supporting point) f(xmin ) ≤ g(xmin ,a) ≤ g(a,a)=f(a) =>min of prev.iter. = supp.point current it. Yields non-increasing series of loss values Loss Orig function f(x) MM function g(x,c) x First supp. point 2nd supp. point Statistical methods for multi-source high-dimensional data 30

Slide 31

Slide 31 text

Example: constructing a majorizing function for the elitist lasso penalty ෍ ² ≤ ෍ (0) ෍ 2 (0) =p’D1 p with D1 a diagonal matrix of σ (0) (0) Applied to group, elistist and regularly lasso we obtain as a surrogate function L ≤ + − ′ 2 + vec ′ vec = + vec( ) − ⨂ vec( ) 2 + vec ′ vec For which the root can be found using standard techniques Statistical methods for multi-source high-dimensional data 31

Slide 32

Slide 32 text

Hence, the problem is now to find the minimum of vec( ) − ⨂ vec( ) 2 + vec ′ vec which can be found using standard techniques: vec = + −vec Note that D+I is diagonal Statistical methods for multi-source high-dimensional data 32

Slide 33

Slide 33 text

Alternative approach: Coordinate descent L ≤ + − ′ 2 + vec ′∗ vec + σ, , , = + vec( ) − ⨂ vec( ) 2 + vec ′∗vec + σ, , , =k + σ , σ − ² + ∗ 2 + , For which the root can be found using subgradient techniques Statistical methods for multi-source high-dimensional data 33

Slide 34

Slide 34 text

• Hence, the following soft thresholding update of the loadings: = σ − /2 1 + if > 0 σ + /2 1 + if < 0 0 else which can be calculated for all loadings of component r simultaneously using simple vector and matrix operations!!! => Highly efficient (time+memory), scalable to large data Statistical methods for multi-source high-dimensional data 34

Slide 35

Slide 35 text

• Algorithm: Weight based variant (sparseness on weights) • Similar type of algorithms can be constructed Standard MM: vec = + ⊗ −vec • Expression to estimate ∗∗ using coordinate descent (cycle over all σ coefficients); sparse group lasso case The expression for an individual coefficient is not very expensive but has to be calculated many times. Statistical methods for multi-source high-dimensional data 35 ∗∗ = σ,, ∗ ∗ − /2 1 + if ∗∗ > 0 σ,, ∗ ∗ + /2 1 + if ∗∗ < 0 0 else

Slide 36

Slide 36 text

• Algorithm • Coordinate-wise approach allows to fix coefficients • This can be used to define the specific components by fixing to zero the coefficients corresponding to the block not accounted for by the component Statistical methods for multi-source high-dimensional data 36

Slide 37

Slide 37 text

Model selection • Number of components R, penalty tuning parameters and/or status (common or specific): • How to select suitable values? • No definite answer, only suggestions • R: inspect VAF • Use stability selection to tune variable selection (lL ) • Cross-validation • Exhaustive strategy for status if nr blocks & components limited Statistical methods for multi-source high-dimensional data 37

Slide 38

Slide 38 text

Some results Statistical methods for multi-source high-dimensional data 38

Slide 39

Slide 39 text

Weights or loadings: Simulation results Statistical methods for multi-source high-dimensional data 39 • Data generated with either sparse W or sparse P • All data analyzed with both a model with sparsity imposed on W and sparsity imposed on P • Recovery best when generated with sparse loadings and analyzed with sparse weights • More direct link between reproduced data and loadings

Slide 40

Slide 40 text

Structured sparsity needed or just lasso? Statistical methods for multi-source high-dimensional data 40 Tucker congruence between WTRUE and estimated W

Slide 41

Slide 41 text

DISCUSSION Statistical methods for multi-source high-dimensional data 41

Slide 42

Slide 42 text

• We presented a generalization of sparse PCA to the multiblock case • Sparsity can be imposed accounting for the block structure • Sparsity can be imposed either on the weights or the loadings Note: best recovery for generated with sp.loadings and estimated with sp.weights • Different algorithmic strategies possible: here MM and coord. Descent considered Statistical methods for multi-source high-dimensional data 42 Fit Stability estimates Comput. Effic. Sparse weights + - - Sparse loadings - + +

Slide 43

Slide 43 text

• Planned developments / in progress • Selection of relevant clusters of correlated variables (to deal with issues of high-dimensionality that haunt lasso, elastic net and so on) • Prediction by extensions of principal covariates regression with 0 ≤ a ≤ 1 (a = 0 ->PCR; a = 1 ->MLR/RRR) Statistical methods for multi-source high-dimensional data 43     2 2 1 2 2 2 2 1 , , W W X XWP X y XWp y p P W R L T X y y X L l l a a       

Slide 44

Slide 44 text

• SPCovR Application: Find early (day 3) genetic signature that predicts flu vaccine efficacy (day 28) (public data: Nakaya et al., 2011) Statistical methods for multi-source high-dimensional data 44 2 i . . . 24 . . . . . . . . . 1 X y 1 . . . . . 24

Slide 45

Slide 45 text

• Comparison of results with sparse PLS Statistical methods for multi-source high-dimensional data 45

Slide 46

Slide 46 text

• SPCovR: Biological content of selected transcripts • Sparse PLS: no (significant) terms found Statistical methods for multi-source high-dimensional data 46