Katrijn Van Deun

Big Data in the Social Sciences Statistical methods for multi-source
high-dimensional data Katrijn Van Deun, Tilburg University

Outline • Introduction & Motivation • Methods • Model •
Objective function • Estimation • Model Selection • Conclusion & Discussion Statistical methods for multi-source high-dimensional data 2

Introduction Statistical methods for multi-source high-dimensional data 3

Big Data in the Social Sciences • Everything is measured
• What we think and do: Social media, Web browsing behavior • Where we are and with whom: GPS tracking, cameras • At a very detailed level: neuron, DNA • Data are shared • Open data: in science, governments (open government data) • Data are linked • Government • Science: multi-disciplinary • Linked Data web-architecture Statistical methods for multi-source high-dimensional data 4

Statistical methods for multi-source high-dimensional data 5 • Illustration: Health
& Retirement Study; traditional data Respondents Well being 1 . . . . . . . . . . . . . . 10 000 Survey data 1 … 500

Statistical methods for multi-source high-dimensional data 6 • Illustration: Health
& Retirement Study; traditional + novel type of data • Multiple sources, heterogeneous in nature • High-dimensional (p>>n) Respondents 501 … 50 000 Novel kind of data Well being 1 . . . . . 1000 Survey data 1 … 500 Big Data

• Illustration: ALSPAC household panel data • Multiple sources /
multi-block (heterogeneous) • High-dimensional Statistical methods for multi-source high-dimensional data 7 respondents

 Extremely information-rich data • 1) Adds context, detail =>
deeper understanding + more accurate prediction Eg. Same income, social network, health but difference in well-being? • 2) Gives insight in the interplay between multiple factors Eg. gene-environment interactions: find (epi-)genetic markers that make someone susceptible to obesity together with the protective/risk-provoking environmental conditions associated to these markers However, statistical tools fall short … Statistical methods for multi-source high-dimensional data 8

Statistical methods for multi-source high-dimensional data 9

Statistical methods for multi-source high-dimensional data 10 • First challenge
= Data fusion? • How to guarantee that the different sources of variation complement each other? • Find common sources of structural variation? • Yet, heterogeneous data sources; often common sources of variation are subtle while source-specific variation is dominant

Statistical methods for multi-source high-dimensional data 11 • Second challenge
= Relevant variables? • Find relevant variables • Information may be hidden in a bulk of irrelevant variables e.g., genetic markers for obesity hidden in a bulk of irrelevant markers • Interpretation based on many variables is not very insightful • Note: in general, we do not know which variables are relevant and which are not • S-O-A: penalties, eg lasso • However: selects only one variable out of a group of correlated variables • (i) have a high risk of not selecting the most relevant variable, • (ii) are highly instable, and • (iii) tend to also select variables of irrelevant groups

Method: Sparse common and specific components Statistical methods for multi-source
high-dimensional data 12

Method: Sparse common & specific components • Structured analysis •
Data fusion: Common components • Detection of relevant variables: Penalties (eg lasso) => Selection of linked variables (between blocks) 13 Respondents 501 … 50 000 Nieuw soort data Well being 1 . . . . . 1000 Survey data 1 … 500 Statistical methods for multi-source high-dimensional data

• Notation and naming conventions • data block: denotes the
different data sources forming the multiblock data • Xk : data block k (with k=1,…,K ); the outcome(s) is denoted by Y (y if univariate) • Each of the data blocks: same set of observation units (respondents) 14 Respondents 501 … 50 000 X2 : Novel kind of data Y: Well being 1 . . . . . 1000 X1 : Survey data 1 … 500 Statistical methods for multi-source high-dimensional data

• Point of departure = Simultaneous component analysis (SCA) –
Extension of PCA to the multiblock case – Promising method for data integration Method: Data fusion Statistical methods for multi-source high-dimensional data 15

• Principal component models • Weight based variant: Xk =
Xk Wk Pk T + Ek s.t. Wk TWk = I, (1) = Tk Pk T + Ek with Wk (Jk ×R) the component weights, Tk (I×R) the component scores, and Pk (Jk ×R) the component loadings Interpretation of component trk based on Jk (!) regression weights: tirk =Sj wjrk xijk Statistical methods for multi-source high-dimensional data 16

• Principal component models • Loading based variant Xk =
Tk Pk T + Ek s.t. Tk TTk = I, (2) Interpretation of component trk based on Jk (!) correlations: r(xjk ,trk )=pjrk • Note: In a least squares approach subject to Pk TPk =I, we have Wk =Pk Statistical methods for multi-source high-dimensional data 17

• Simultaneous component analysis For all k: Xk = TPk
T + Ek s.t. TTT=I (3) ->same component scores for all data blocks! 1. Weight based variant [X1 ... XK ] = T [P1 T ... PK T] + [E1 ... EK T] = [X1 ... XK ] [W1 T ... WK T]T [P1 T ... PK T] + [E1 ... EK T] or, in shorthand notation: Xconc = Xconc Wconc Pconc T + Econc (4) 2. Loading based variant Xconc = T Pconc T + Econc (5) Statistical methods for multi-source high-dimensional data 18

• Simultaneous components • Guarantee same source of variation for
the different data blocks • But … do not guarantee that the components are common! • Explanation: SC methods model the largest source of variation in the concatenated data, often this is source-specific variation as common variation is subtle (e.g. response tendencies and general socio-behavioral or genetic processes vs subtle gene- environment ineractions) Statistical methods for multi-source high-dimensional data 19

• Common component analysis (CCA) – Account for dominating source
specific variation – Common component model: Xk = TPk ’ such that T’T=I and Pk has common / specific structure – Structure can be imposed (constrained analysis) – Or: Xk = Xconc Wconc Pk ’ with the Wconc having common / specific structure P1 P2 Common x x x x x x x Spec1 x x x 0 0 0 0 Spec1 0 0 0 x x x x Statistical methods for multi-source high-dimensional data 20

• So far, so good … • Yet: • Interest
in relevant / important variables (social factors, genetic markers)? • Interpretation of the components based on 1000s of loadings is infeasible Need to automatically select the “relevant” variables! Sparse common components are needed. Statistical methods for multi-source high-dimensional data 21

• Structured sparsity: • Sparse common components: few non-zero loadings
in each data block Xk • (Sparse) distinctive components: (few) non-zero loadings only in one/few data blocks Xk Statistical methods for multi-source high-dimensional data 22 X1 X2 Common x 0 0 x x 0 x Dist1 x x x 0 0 0 0 Dist1 0 0 0 x 0 x x

• Sparse analysis – Impose restriction on the loadings: many
should become zero – S-O-A: add penalties (e.g., lasso) to the objective function Variable selection: How to? Statistical methods for multi-source high-dimensional data 23

• Sparse SCA: Objective function Add penalty known to have
variable selection properties to SCA objective function: Minimize over T and PC and such that ′ = − ′ 2 + σ, , , with , = σ the L1 penalty or lasso tuned by lr,k ≥ 0 -> shrinks and selects variables Fit / SCA Penalty Statistical methods for multi-source high-dimensional data 24

• (adaptive) Lasso • Oracle properties (under some conditions) •
Estimation: soft thresholding operator S(bOLS , l/2) = bOLS – l/2 if bOLS >l/2 bLASSO = 0 if – l/2 <bOLS <l/2 = bOLS – l/2 if bOLS <-l/2 Statistical methods for multi-source high-dimensional data 25 0

• Sparse common and specific components: Objective function Add penalties
and/or constraints to obtain structured sparsity − ′ 2 + σ, , , + σ σ , 2 + σ , 1,2 Penalties: Lasso: , = σ and , ≥0 a tuning parameter Group lasso: , = σ ² and ≥0 a tuning parameter Elitist lasso: , , = σ ² and ≥0 a tuning parameter Fit / SCA Penalties Statistical methods for multi-source high-dimensional data 26 Kills blocks of variables Kills variables within each block

Generic objective function: • Allows for combinations of penalties, some
known • Penalties can be imposed per component • Useful to find sparse common and specific components Statistical methods for multi-source high-dimensional data 27

• Algorithm: Alternating procedure Given fixed tuning parameters and number
of common and distinctive components, do 0. Initialize Pconc 1. Update T conditional upon Pconc Closed form: T=UV’ with U and V from the SVD of XC P (I×R -> small for H-D data) 2. Update Pconc conditional upon T MM+UST: Introduce surrogated objective function and apply univariate soft thresholding to surrogate (see next) 3. Check stop criteria (convergence of the loss, maximum number of iterations) and return to step 1 or terminate Statistical methods for multi-source high-dimensional data 28

• Algorithm: Conditional estimation of Pconc given T Complicated optimization
problem because of penalties => MM procedure Statistical methods for multi-source high-dimensional data 29

• MM in brief Find min of f(x) via surrogate
function g(x,c) (c:constant): – g(x,c) easy to minimize – g(x,c) ≥ f(x) – g(c,c) = f(c) (supporting point) f(xmin ) ≤ g(xmin ,a) ≤ g(a,a)=f(a) =>min of prev.iter. = supp.point current it. Yields non-increasing series of loss values Loss Orig function f(x) MM function g(x,c) x First supp. point 2nd supp. point Statistical methods for multi-source high-dimensional data 30

Example: constructing a majorizing function for the elitist lasso penalty
෍ ² ≤ ෍ (0) ෍ 2 (0) =p’D1 p with D1 a diagonal matrix of σ (0) (0) Applied to group, elistist and regularly lasso we obtain as a surrogate function L ≤ + − ′ 2 + vec ′ vec = + vec( ) − ⨂ vec( ) 2 + vec ′ vec For which the root can be found using standard techniques Statistical methods for multi-source high-dimensional data 31

Hence, the problem is now to find the minimum of
vec( ) − ⨂ vec( ) 2 + vec ′ vec which can be found using standard techniques: vec = + −vec Note that D+I is diagonal Statistical methods for multi-source high-dimensional data 32

Alternative approach: Coordinate descent L ≤ + − ′ 2
+ vec ′∗ vec + σ, , , = + vec( ) − ⨂ vec( ) 2 + vec ′∗vec + σ, , , =k + σ , σ − ² + ∗ 2 + , For which the root can be found using subgradient techniques Statistical methods for multi-source high-dimensional data 33

• Hence, the following soft thresholding update of the loadings:
= σ − /2 1 + if > 0 σ + /2 1 + if < 0 0 else which can be calculated for all loadings of component r simultaneously using simple vector and matrix operations!!! => Highly efficient (time+memory), scalable to large data Statistical methods for multi-source high-dimensional data 34

• Algorithm: Weight based variant (sparseness on weights) • Similar
type of algorithms can be constructed Standard MM: vec = + ⊗ −vec • Expression to estimate ∗∗ using coordinate descent (cycle over all σ coefficients); sparse group lasso case The expression for an individual coefficient is not very expensive but has to be calculated many times. Statistical methods for multi-source high-dimensional data 35 ∗∗ = σ,, ∗ ∗ − /2 1 + if ∗∗ > 0 σ,, ∗ ∗ + /2 1 + if ∗∗ < 0 0 else

• Algorithm • Coordinate-wise approach allows to fix coefficients •
This can be used to define the specific components by fixing to zero the coefficients corresponding to the block not accounted for by the component Statistical methods for multi-source high-dimensional data 36

Model selection • Number of components R, penalty tuning parameters
and/or status (common or specific): • How to select suitable values? • No definite answer, only suggestions • R: inspect VAF • Use stability selection to tune variable selection (lL ) • Cross-validation • Exhaustive strategy for status if nr blocks & components limited Statistical methods for multi-source high-dimensional data 37

Some results Statistical methods for multi-source high-dimensional data 38

Weights or loadings: Simulation results Statistical methods for multi-source high-dimensional
data 39 • Data generated with either sparse W or sparse P • All data analyzed with both a model with sparsity imposed on W and sparsity imposed on P • Recovery best when generated with sparse loadings and analyzed with sparse weights • More direct link between reproduced data and loadings

Structured sparsity needed or just lasso? Statistical methods for multi-source
high-dimensional data 40 Tucker congruence between WTRUE and estimated W

DISCUSSION Statistical methods for multi-source high-dimensional data 41

• We presented a generalization of sparse PCA to the
multiblock case • Sparsity can be imposed accounting for the block structure • Sparsity can be imposed either on the weights or the loadings Note: best recovery for generated with sp.loadings and estimated with sp.weights • Different algorithmic strategies possible: here MM and coord. Descent considered Statistical methods for multi-source high-dimensional data 42 Fit Stability estimates Comput. Effic. Sparse weights + - - Sparse loadings - + +

• Planned developments / in progress • Selection of relevant
clusters of correlated variables (to deal with issues of high-dimensionality that haunt lasso, elastic net and so on) • Prediction by extensions of principal covariates regression with 0 ≤ a ≤ 1 (a = 0 ->PCR; a = 1 ->MLR/RRR) Statistical methods for multi-source high-dimensional data 43     2 2 1 2 2 2 2 1 , , W W X XWP X y XWp y p P W R L T X y y X L l l a a       

• SPCovR Application: Find early (day 3) genetic signature that
predicts flu vaccine efficacy (day 28) (public data: Nakaya et al., 2011) Statistical methods for multi-source high-dimensional data 44 2 i . . . 24 . . . . . . . . . 1 X y 1 . . . . . 24

• Comparison of results with sparse PLS Statistical methods for
multi-source high-dimensional data 45

• SPCovR: Biological content of selected transcripts • Sparse PLS:
no (significant) terms found Statistical methods for multi-source high-dimensional data 46

Thank you! &

48 • I am looking for a PhD candidate!!

Katrijn Van Deun

Katrijn Van Deun

More Decks by S³ Seminar

Other Decks in Research

Featured

Transcript