Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Katrijn Van Deun

S³ Seminar
October 06, 2017

Katrijn Van Deun

(Tilburg University, the Netherlands)

https://s3-seminar.github.io/seminars/katrijn-van-deun

Title — Big Data in the Social Sciences: Statistical methods for multi-source high-dimensional data

Abstract — Research in the behavioural and social sciences has entered the era of big data: Many detailed measurements are taken and multiple sources of information are used to unravel complex multivariate relations. For example, in studying obesity as the outcome of environmental and genetic influences, researchers increasingly collect survey, dietary, biomarker and genetic data from the same individuals. Although linked more-variables-than-samples (called high-dimensional) multi-source data form an extremely rich resource for research, extracting meaningful and integrated information is challenging and not appropriately addressed by current statistical methods. A first problem is that relevant information is hidden in a bulk of irrelevant variables with a high risk of finding incidental associations. Second, the sources are often very heterogeneous, which may obscure apparent links between the shared mechanisms. In this presentation we will discuss the challenges associated to the analysis of large scale multi-source data and present state-of-the-art statistical approaches to address the challenges.

S³ Seminar

October 06, 2017
Tweet

More Decks by S³ Seminar

Other Decks in Research

Transcript

  1. Big Data in the Social Sciences Statistical methods for multi-source

    high-dimensional data Katrijn Van Deun, Tilburg University
  2. Outline • Introduction & Motivation • Methods • Model •

    Objective function • Estimation • Model Selection • Conclusion & Discussion Statistical methods for multi-source high-dimensional data 2
  3. Big Data in the Social Sciences • Everything is measured

    • What we think and do: Social media, Web browsing behavior • Where we are and with whom: GPS tracking, cameras • At a very detailed level: neuron, DNA • Data are shared • Open data: in science, governments (open government data) • Data are linked • Government • Science: multi-disciplinary • Linked Data web-architecture Statistical methods for multi-source high-dimensional data 4
  4. Statistical methods for multi-source high-dimensional data 5 • Illustration: Health

    & Retirement Study; traditional data Respondents Well being 1 . . . . . . . . . . . . . . 10 000 Survey data 1 … 500
  5. Statistical methods for multi-source high-dimensional data 6 • Illustration: Health

    & Retirement Study; traditional + novel type of data • Multiple sources, heterogeneous in nature • High-dimensional (p>>n) Respondents 501 … 50 000 Novel kind of data Well being 1 . . . . . 1000 Survey data 1 … 500 Big Data
  6. • Illustration: ALSPAC household panel data • Multiple sources /

    multi-block (heterogeneous) • High-dimensional Statistical methods for multi-source high-dimensional data 7 respondents
  7.  Extremely information-rich data • 1) Adds context, detail =>

    deeper understanding + more accurate prediction Eg. Same income, social network, health but difference in well-being? • 2) Gives insight in the interplay between multiple factors Eg. gene-environment interactions: find (epi-)genetic markers that make someone susceptible to obesity together with the protective/risk-provoking environmental conditions associated to these markers However, statistical tools fall short … Statistical methods for multi-source high-dimensional data 8
  8. Statistical methods for multi-source high-dimensional data 10 • First challenge

    = Data fusion? • How to guarantee that the different sources of variation complement each other? • Find common sources of structural variation? • Yet, heterogeneous data sources; often common sources of variation are subtle while source-specific variation is dominant
  9. Statistical methods for multi-source high-dimensional data 11 • Second challenge

    = Relevant variables? • Find relevant variables • Information may be hidden in a bulk of irrelevant variables e.g., genetic markers for obesity hidden in a bulk of irrelevant markers • Interpretation based on many variables is not very insightful • Note: in general, we do not know which variables are relevant and which are not • S-O-A: penalties, eg lasso • However: selects only one variable out of a group of correlated variables • (i) have a high risk of not selecting the most relevant variable, • (ii) are highly instable, and • (iii) tend to also select variables of irrelevant groups
  10. Method: Sparse common & specific components • Structured analysis •

    Data fusion: Common components • Detection of relevant variables: Penalties (eg lasso) => Selection of linked variables (between blocks) 13 Respondents 501 … 50 000 Nieuw soort data Well being 1 . . . . . 1000 Survey data 1 … 500 Statistical methods for multi-source high-dimensional data
  11. • Notation and naming conventions • data block: denotes the

    different data sources forming the multiblock data • Xk : data block k (with k=1,…,K ); the outcome(s) is denoted by Y (y if univariate) • Each of the data blocks: same set of observation units (respondents) 14 Respondents 501 … 50 000 X2 : Novel kind of data Y: Well being 1 . . . . . 1000 X1 : Survey data 1 … 500 Statistical methods for multi-source high-dimensional data
  12. • Point of departure = Simultaneous component analysis (SCA) –

    Extension of PCA to the multiblock case – Promising method for data integration Method: Data fusion Statistical methods for multi-source high-dimensional data 15
  13. • Principal component models • Weight based variant: Xk =

    Xk Wk Pk T + Ek s.t. Wk TWk = I, (1) = Tk Pk T + Ek with Wk (Jk ×R) the component weights, Tk (I×R) the component scores, and Pk (Jk ×R) the component loadings Interpretation of component trk based on Jk (!) regression weights: tirk =Sj wjrk xijk Statistical methods for multi-source high-dimensional data 16
  14. • Principal component models • Loading based variant Xk =

    Tk Pk T + Ek s.t. Tk TTk = I, (2) Interpretation of component trk based on Jk (!) correlations: r(xjk ,trk )=pjrk • Note: In a least squares approach subject to Pk TPk =I, we have Wk =Pk Statistical methods for multi-source high-dimensional data 17
  15. • Simultaneous component analysis For all k: Xk = TPk

    T + Ek s.t. TTT=I (3) ->same component scores for all data blocks! 1. Weight based variant [X1 ... XK ] = T [P1 T ... PK T] + [E1 ... EK T] = [X1 ... XK ] [W1 T ... WK T]T [P1 T ... PK T] + [E1 ... EK T] or, in shorthand notation: Xconc = Xconc Wconc Pconc T + Econc (4) 2. Loading based variant Xconc = T Pconc T + Econc (5) Statistical methods for multi-source high-dimensional data 18
  16. • Simultaneous components • Guarantee same source of variation for

    the different data blocks • But … do not guarantee that the components are common! • Explanation: SC methods model the largest source of variation in the concatenated data, often this is source-specific variation as common variation is subtle (e.g. response tendencies and general socio-behavioral or genetic processes vs subtle gene- environment ineractions) Statistical methods for multi-source high-dimensional data 19
  17. • Common component analysis (CCA) – Account for dominating source

    specific variation – Common component model: Xk = TPk ’ such that T’T=I and Pk has common / specific structure – Structure can be imposed (constrained analysis) – Or: Xk = Xconc Wconc Pk ’ with the Wconc having common / specific structure P1 P2 Common x x x x x x x Spec1 x x x 0 0 0 0 Spec1 0 0 0 x x x x Statistical methods for multi-source high-dimensional data 20
  18. • So far, so good … • Yet: • Interest

    in relevant / important variables (social factors, genetic markers)? • Interpretation of the components based on 1000s of loadings is infeasible Need to automatically select the “relevant” variables! Sparse common components are needed. Statistical methods for multi-source high-dimensional data 21
  19. • Structured sparsity: • Sparse common components: few non-zero loadings

    in each data block Xk • (Sparse) distinctive components: (few) non-zero loadings only in one/few data blocks Xk Statistical methods for multi-source high-dimensional data 22 X1 X2 Common x 0 0 x x 0 x Dist1 x x x 0 0 0 0 Dist1 0 0 0 x 0 x x
  20. • Sparse analysis – Impose restriction on the loadings: many

    should become zero – S-O-A: add penalties (e.g., lasso) to the objective function Variable selection: How to? Statistical methods for multi-source high-dimensional data 23
  21. • Sparse SCA: Objective function Add penalty known to have

    variable selection properties to SCA objective function: Minimize over T and PC and such that ′ = − ′ 2 + σ, , , with , = σ the L1 penalty or lasso tuned by lr,k ≥ 0 -> shrinks and selects variables Fit / SCA Penalty Statistical methods for multi-source high-dimensional data 24
  22. • (adaptive) Lasso • Oracle properties (under some conditions) •

    Estimation: soft thresholding operator S(bOLS , l/2) = bOLS – l/2 if bOLS >l/2 bLASSO = 0 if – l/2 <bOLS <l/2 = bOLS – l/2 if bOLS <-l/2 Statistical methods for multi-source high-dimensional data 25 0
  23. • Sparse common and specific components: Objective function Add penalties

    and/or constraints to obtain structured sparsity − ′ 2 + σ, , , + σ σ , 2 + σ , 1,2 Penalties: Lasso: , = σ and , ≥0 a tuning parameter Group lasso: , = σ ² and ≥0 a tuning parameter Elitist lasso: , , = σ ² and ≥0 a tuning parameter Fit / SCA Penalties Statistical methods for multi-source high-dimensional data 26 Kills blocks of variables Kills variables within each block
  24. Generic objective function: • Allows for combinations of penalties, some

    known • Penalties can be imposed per component • Useful to find sparse common and specific components Statistical methods for multi-source high-dimensional data 27
  25. • Algorithm: Alternating procedure Given fixed tuning parameters and number

    of common and distinctive components, do 0. Initialize Pconc 1. Update T conditional upon Pconc Closed form: T=UV’ with U and V from the SVD of XC P (I×R -> small for H-D data) 2. Update Pconc conditional upon T MM+UST: Introduce surrogated objective function and apply univariate soft thresholding to surrogate (see next) 3. Check stop criteria (convergence of the loss, maximum number of iterations) and return to step 1 or terminate Statistical methods for multi-source high-dimensional data 28
  26. • Algorithm: Conditional estimation of Pconc given T Complicated optimization

    problem because of penalties => MM procedure Statistical methods for multi-source high-dimensional data 29
  27. • MM in brief Find min of f(x) via surrogate

    function g(x,c) (c:constant): – g(x,c) easy to minimize – g(x,c) ≥ f(x) – g(c,c) = f(c) (supporting point) f(xmin ) ≤ g(xmin ,a) ≤ g(a,a)=f(a) =>min of prev.iter. = supp.point current it. Yields non-increasing series of loss values Loss Orig function f(x) MM function g(x,c) x First supp. point 2nd supp. point Statistical methods for multi-source high-dimensional data 30
  28. Example: constructing a majorizing function for the elitist lasso penalty

    ෍ ² ≤ ෍ (0) ෍ 2 (0) =p’D1 p with D1 a diagonal matrix of σ (0) (0) Applied to group, elistist and regularly lasso we obtain as a surrogate function L ≤ + − ′ 2 + vec ′ vec = + vec( ) − ⨂ vec( ) 2 + vec ′ vec For which the root can be found using standard techniques Statistical methods for multi-source high-dimensional data 31
  29. Hence, the problem is now to find the minimum of

    vec( ) − ⨂ vec( ) 2 + vec ′ vec which can be found using standard techniques: vec = + −vec Note that D+I is diagonal Statistical methods for multi-source high-dimensional data 32
  30. Alternative approach: Coordinate descent L ≤ + − ′ 2

    + vec ′∗ vec + σ, , , = + vec( ) − ⨂ vec( ) 2 + vec ′∗vec + σ, , , =k + σ , σ − ² + ∗ 2 + , For which the root can be found using subgradient techniques Statistical methods for multi-source high-dimensional data 33
  31. • Hence, the following soft thresholding update of the loadings:

    = σ − /2 1 + if > 0 σ + /2 1 + if < 0 0 else which can be calculated for all loadings of component r simultaneously using simple vector and matrix operations!!! => Highly efficient (time+memory), scalable to large data Statistical methods for multi-source high-dimensional data 34
  32. • Algorithm: Weight based variant (sparseness on weights) • Similar

    type of algorithms can be constructed Standard MM: vec = + ⊗ −vec • Expression to estimate ∗∗ using coordinate descent (cycle over all σ coefficients); sparse group lasso case The expression for an individual coefficient is not very expensive but has to be calculated many times. Statistical methods for multi-source high-dimensional data 35 ∗∗ = σ,, ∗ ∗ − /2 1 + if ∗∗ > 0 σ,, ∗ ∗ + /2 1 + if ∗∗ < 0 0 else
  33. • Algorithm • Coordinate-wise approach allows to fix coefficients •

    This can be used to define the specific components by fixing to zero the coefficients corresponding to the block not accounted for by the component Statistical methods for multi-source high-dimensional data 36
  34. Model selection • Number of components R, penalty tuning parameters

    and/or status (common or specific): • How to select suitable values? • No definite answer, only suggestions • R: inspect VAF • Use stability selection to tune variable selection (lL ) • Cross-validation • Exhaustive strategy for status if nr blocks & components limited Statistical methods for multi-source high-dimensional data 37
  35. Weights or loadings: Simulation results Statistical methods for multi-source high-dimensional

    data 39 • Data generated with either sparse W or sparse P • All data analyzed with both a model with sparsity imposed on W and sparsity imposed on P • Recovery best when generated with sparse loadings and analyzed with sparse weights • More direct link between reproduced data and loadings
  36. Structured sparsity needed or just lasso? Statistical methods for multi-source

    high-dimensional data 40 Tucker congruence between WTRUE and estimated W
  37. • We presented a generalization of sparse PCA to the

    multiblock case • Sparsity can be imposed accounting for the block structure • Sparsity can be imposed either on the weights or the loadings Note: best recovery for generated with sp.loadings and estimated with sp.weights • Different algorithmic strategies possible: here MM and coord. Descent considered Statistical methods for multi-source high-dimensional data 42 Fit Stability estimates Comput. Effic. Sparse weights + - - Sparse loadings - + +
  38. • Planned developments / in progress • Selection of relevant

    clusters of correlated variables (to deal with issues of high-dimensionality that haunt lasso, elastic net and so on) • Prediction by extensions of principal covariates regression with 0 ≤ a ≤ 1 (a = 0 ->PCR; a = 1 ->MLR/RRR) Statistical methods for multi-source high-dimensional data 43     2 2 1 2 2 2 2 1 , , W W X XWP X y XWp y p P W R L T X y y X L l l a a       
  39. • SPCovR Application: Find early (day 3) genetic signature that

    predicts flu vaccine efficacy (day 28) (public data: Nakaya et al., 2011) Statistical methods for multi-source high-dimensional data 44 2 i . . . 24 . . . . . . . . . 1 X y 1 . . . . . 24
  40. • SPCovR: Biological content of selected transcripts • Sparse PLS:

    no (significant) terms found Statistical methods for multi-source high-dimensional data 46