Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A randomized pairwise likelihood method for complex statistical inferences

Andrea Rau
February 08, 2022

A randomized pairwise likelihood method for complex statistical inferences

Pairwise likelihood methods are commonly used for inference in parametric statistical models in cases where the full likelihood is too complex to be used, such as multivariate count data. Although pairwise likelihood methods represent a useful solution to perform inference for intractable likelihoods, several computational challenges remain, particularly in higher dimensions. To alleviate these issues, we consider a randomized pairwise likelihood approach, where only summands randomly sampled across observations and pairs are used for the estimation. In addition to the usual tradeoff between statistical and computational efficiency, we show that, under a condition on the sampling parameter, this two-way random sampling mechanism allows for the construction of less computationally expensive confidence intervals. The proposed approach, which is implemented in the rpl R package, is illustrated in tandem with copula-based models for multivariate count data in simulations and on a set of transcriptomic data.

(Joint work with Gildas Mazo and Dimitris Karlis)

Andrea Rau

February 08, 2022
Tweet

More Decks by Andrea Rau

Other Decks in Science

Transcript

  1. A randomized pairwise likelihood method
    for complex statistical inferences
    Gildas Mazo, Dimitris Karlis, and Andrea Rau
    February 7, 2022

    eminaire Parisien de Statistique @ IHP, Paris
    [email protected] 1 / 24

    View Slide

  2. Outline
    1 Introduction & context
    2 Randomized pairwise likelihood method
    3 Numerical illustrations
    4 RNA-seq data analysis
    5 rpl R package
    6 Discussion & conclusion
    [email protected] 2 / 24

    View Slide

  3. Introduction & context Multivariate count data in biology
    Multivariate counts in biology
    Multivariate count data are routinely collected by biologists:
    Animal traits relevant to production (number of eggs laid),
    reproduction (litter size), immunity (number of monocytes), ...
    Plant phenotypes relevant to physiology (number of leaves),
    development (number of flowering plants per parcel), yield (number
    of seeds per plant), ...
    Microbial ecology (counts of bacterial species in a given ecosystem)
    High-throughput sequencing data: counts of sequenced reads
    associated with a given biological entity
    [email protected] 3 / 24

    View Slide

  4. Introduction & context Multivariate count data in biology
    Motivating example: Varroa life cycle transcriptome
    Mondet et al. (2018)
    Varroa destructor is a parasitic mite that is a major threat to the
    western honeybee
    Impact on bees well-studied, but less is known about its functional
    biology during adult life cycle → collect transcriptome catalog
    Image adapted from Mondet et al. (2018) and https://flickr.com/photos/[email protected]/48697155713
    [email protected] 4 / 24

    View Slide

  5. Introduction & context Multivariate count data in biology
    Motivating example: Varroa life cycle transcriptome
    Counts associated with sequencing depth and gene length (offsets)
    Exploratory analyses → log(• + 1), differential analyses → NB
    GLMs, co-expression → (cond’l independent) Poisson mixtures, ...
    [email protected] 5 / 24

    View Slide

  6. Introduction & context Multivariate count data in biology
    Motivating example: Varroa life cycle transcriptome
    Counts associated with sequencing depth and gene length (offsets)
    Exploratory analyses → log(• + 1), differential analyses → NB
    GLMs, co-expression → (cond’l independent) Poisson mixtures, ...
    Our goal: estimate transcriptome-wide correlations among life stages in
    the observation space accounting for offsets
    [email protected] 5 / 24

    View Slide

  7. Introduction & context Multivariate count data in biology
    Logs or latent variables for multivariate count data
    Multivariate Poisson Log-Normal distribution (Aitchison & Ho, 1989)
    Variational probabilistic Poisson PCA (Chiquet et al. 2018)
    ...
    [email protected] 6 / 24

    View Slide

  8. Introduction & context Copulas
    What is a copula?
    A copula, among other things, is a multivariate c.d.f. C whose
    arguments are in (0, 1).
    Any multivariate joint distribution can be expressed as a function
    (copula) of its marginal c.d.f.’s (Sklar’s theorem):
    F(x1, . . . , xd ) = C(F1(x1), . . . , Fd (xd )),
    as if we were “adding a dependence structure” on top of sets of
    independent random variables
    → Build multivariate models with prescribed marginal distributions
    Many families of copulas available + easily analyze behavior of
    system under different (1) correlation structures or (2) marginals
    [email protected] 7 / 24

    View Slide

  9. Introduction & context Copulas
    Modeling multivariate count data with copulas
    The following choices, as a first approximation, may be reasonable to build
    a multivariate statistical model for counts:
    Poisson marginals: Fi (xi ) = Fµi
    (xi )
    A Gaussian copula for the dependence structure, e.g.
    C(u1, . . . , ud ; ρ) = Φd (Φ−1
    1
    (u1), . . . , Φ−1
    1
    (ud ); R(ρ)),
    where Φd (•; R(ρ)) is the distribution function of a standard d-variate
    Gaussian distribution with correlation matrix R(ρ)
    [email protected] 8 / 24

    View Slide

  10. Introduction & context Copulas
    The copula inference challenge
    If it is so easy to build multivariate count models, why is this
    approach not widespread? One reason is that inference is challenging.
    The probability density function associated with the above copula
    model is a sum of 2d terms, given by
    (v1,...,vd )
    sgn(v1, . . . , vd )C(Fµ1
    (v1), . . . , Fµd
    (vd ); ρ),
    where the sum is over all
    (v1, . . . , vd ) ∈ {x1 − 1, x1} × . . . × {xd − 1, xd }, and
    sgn(v1, . . . , vd ) ± 1 (according to the vector’s state)
    One solution: estimate θ = (µ1, . . . , µd , ρ) with pairwise likelihood
    methods (Varin et al., 2011)
    [email protected] 9 / 24

    View Slide

  11. Introduction & context Copulas
    Pairwise likelihood to approximate full likelihood
    Special case of composite likelihood using bivariate likelihoods → in large
    class of models, retains sufficient information for consistent estimators of
    parameters of interest
    But still some challenges:
    Computational burden (sum over all d2 pairs of variables for all n
    observations)
    Standard errors and confidence intervals (calculation requires double
    sum over pairs of pairs and all four-dimensional marginals)
    Some existing proposals: pre-select subsets of variable pairs (prior
    knowledge? regularization?), introduction of weights, ...
    [email protected] 10 / 24

    View Slide

  12. Randomized pairwise likelihood method Definition
    Randomized Pairwise Likelihood (RPL)
    To alleviate the computational burden, we propose the RPL:
    LRPL
    n
    (θ) =
    1
    nπn
    n
    i=1 a∈A
    W (a)
    ni
    log fa(X(a)
    i
    ; θ),
    where:
    A is the set of pairs,
    the functions fa(X(a)
    i
    ; θ) are the bivariate densities, and
    for each n, W (a)
    ni
    , i = 1, . . . , n, a ∈ A, are independent Bernoulli
    random variables with common parameter 0 < πn ≤ 1 that is allowed
    to decrease with n (more on this in a bit...)
    → controls the tradeoff between statistical efficiency and
    computational efficiency
    [email protected] 11 / 24

    View Slide

  13. Randomized pairwise likelihood method Definition
    A remark on statistical vs computational efficiency for RPL
    Average number of summands needed to calculate RPL is
    nπnd(d − 1)/2
    → Example: if πn = 1/6, d = 3, n = 10k, 5k terms to compute for
    RPL vs 30k for PL
    Letting πn → 0, the bivariate log density gradients become
    asymptotically independent ⇒ disappearance of term with
    computational complexity d4 in the estimator’s asymptotic variance
    → approximate confidence intervals with lower computational cost
    [email protected] 12 / 24

    View Slide

  14. Randomized pairwise likelihood method Asymptotic normality
    Asymptotic normality
    If θ denotes the maximum RPL estimator (MRPLE), then:
    If π is fixed and n → ∞, then

    n(ˆ
    θ − θ) → N(0, S−1CS−1 + π−1S−1),
    where C and S are matrices of cost O(d4) and O(d2), respectively
    S =
    a∈A
    E ˙
    a
    ˙
    a
    C =
    a=b∈A
    E ˙
    a
    ˙
    b
    [email protected] 13 / 24

    View Slide

  15. Randomized pairwise likelihood method Asymptotic normality
    Asymptotic normality
    If θ denotes the maximum RPL estimator (MRPLE), then:
    If π is fixed and n → ∞, then

    n(ˆ
    θ − θ) → N(0, S−1CS−1 + π−1S−1),
    where C and S are matrices of cost O(d4) and O(d2), respectively
    S =
    a∈A
    E ˙
    a
    ˙
    a
    C =
    a=b∈A
    E ˙
    a
    ˙
    b
    If πn → 0 and some condition that implies nπn → ∞, then

    nπn(ˆ
    θ − θ) → N(0, S−1),
    The second result suggests that an approximate standard error of the
    estimator can be computed with cost O(d2)
    [email protected] 13 / 24

    View Slide

  16. Randomized pairwise likelihood method Inflation factor of MRPLE
    A reasonable inflation of MRPLE asymptotic variance?
    Consider MRPLE ˆ
    θMRPL
    n
    (π) and V (π) its asymptotic variance
    Inflation factor from π to π given by:
    IF(π |π) :=
    V (π )
    V (π)
    =
    πS−1CS−1
    πS−1CS−1 + S−1
    +
    S−1
    πS−1CS−1 + S−1
    π
    π
    When is this inflation factor subhomogeneous, i.e.
    IF(k−1π|π) < k IF(π|π) = k for all π?
    [email protected] 14 / 24

    View Slide

  17. Randomized pairwise likelihood method Inflation factor of MRPLE
    A reasonable inflation of MRPLE asymptotic variance?
    Consider MRPLE ˆ
    θMRPL
    n
    (π) and V (π) its asymptotic variance
    Inflation factor from π to π given by:
    IF(π |π) :=
    V (π )
    V (π)
    =
    πS−1CS−1
    πS−1CS−1 + S−1
    +
    S−1
    πS−1CS−1 + S−1
    π
    π
    When is this inflation factor subhomogeneous, i.e.
    IF(k−1π|π) < k IF(π|π) = k for all π?
    → when S−1CS−1 is nonnegative definite, i.e. when ˙
    a, ˙
    b, a = b,
    tend to be positively correlated
    [email protected] 14 / 24

    View Slide

  18. Randomized pairwise likelihood method Inflation factor of MRPLE
    Large enough n, small but not too small πn
    ...aka large enough nπn
    So how to choose πn in practice?
    Under some conditions, choosing πn = n−α, 0 < α < 1/2, makes
    n(1−α)/2(ˆ
    θn − θ) go to a Gaussian limit ⇒ πn can decrease almost as
    fast as 1/

    n
    Simulations could help:
    Identify region in (π, nπ) ∈ (0, a] × [b, ∞) for some constants a and
    b, where the approximation seems to be “good enough”
    Choose π within this region and adjust according to acceptable
    standard errors S−1/(nπ)
    [email protected] 15 / 24

    View Slide

  19. Numerical illustrations Effect of π on performance & computational time
    Description of simulation study: impact of π
    d = 30, n ∈ {100, 500, 1000} simulated from unit Poisson marginals and
    Gaussian copula for two different correlation structures × 500 datasets:
    Factorized (30 distinct copula parameters)
    Blockwise exchangeable (3 blocks; 6 distinct copula parameters)
    → Parameter estimates obtained with RPL using π = 0.1, 0.3, 0.5
    [email protected] 16 / 24

    View Slide

  20. Numerical illustrations Effect of π on performance & computational time
    [email protected] 17 / 24

    View Slide

  21. Numerical illustrations Effect of π on performance & computational time
    [email protected] 17 / 24

    View Slide

  22. Numerical illustrations Effect of π on performance & computational time
    [email protected] 17 / 24

    View Slide

  23. Numerical illustrations Coverage of approximate confidence intervals
    Description of simulation study: approximate CI coverage
    aka How good is the approximated variance?
    d = 3, n ∈ {500, 100, 5000} simulated from unit Poisson marginals and
    Gaussian copula with unstructured correlation × 500 datasets:
    ρ12 = 0.3, ρ13 = 0.2, ρ23 = 0
    “Small π, large nπ” setting
    → Approximate confidence intervals constructed with RPL using
    π ∈ [0.01, 0.90] using S−1/(nπ) ⇒ proportion of simulated datasets for
    which true parameter values were within the 95% confidence intervals
    [email protected] 18 / 24

    View Slide

  24. Numerical illustrations Coverage of approximate confidence intervals
    All coverages exceeding 92% correspond to π ≤ 0.25 and nπ ≥ 125
    ⇒ reasonable rule of thumb for choosing π in practice?
    [email protected] 19 / 24

    View Slide

  25. RNA-seq data analysis Description
    Back to the Varroa life cycle RNA-seq data
    Read counts for n = 22, 372 contigs in d = 10 Varroa life cycle
    groups from a single colony → transcriptome-wide correlations among
    life stages
    Contig lengths ( i ) and TMM-normalized sequencing depth (sj )
    considered to be a known terms in offset for marginal Poisson GLMs:
    log E(Xij ) = log( i sj ) + µj
    for contig i in life stage j
    Poisson marginals coupled with an unstructured Gaussian copula
    → Parameters of interest are θ = (µj , ρj,j ), j = j ∈ {1, . . . , 10}
    [email protected] 20 / 24

    View Slide

  26. RNA-seq data analysis RPL results (π = 0.01, nπ = 224)
    Phor
    Pre−Lay
    Non−reproductive
    Laying
    Arrest
    Post−Lay
    Emerg
    Lab−reared
    Male
    Young
    Phor
    Pre−Lay
    Non−reproductive
    Laying
    Arrest
    Post−Lay
    Emerg
    Lab−reared
    Male
    Young
    Type
    Intercept
    Type
    Pre−reproductive
    Reproductive
    Post−reproductive
    Other
    Intercept
    −0.2 0 0.2 0.4
    Correlation
    −1 −0.5 0 0.5 1
    Significant gain in computational time: 25 min (π = 0.01) vs 8.5h (π = 1)
    [email protected] 21 / 24

    View Slide

  27. rpl R package
    Randomized pairwise likelihood in the rpl R package
    Beta version on GitHub: andreamrau/rpl
    rpl optim: optimize parameters for given correlation structure
    rpl se: calculate standard errors for model parameters for given
    correlation structure ("unstructured")
    init: obtain reasonable values to initialize marginal and correlation
    parameters for use in rpl optim
    simulate mvt poisson: simulate multivariate Poisson data with
    given correlation structure (using Gaussian copulas)
    → For now, only adapted to multivariate Poisson data but the aim is to
    further generalize the code
    [email protected] 22 / 24

    View Slide

  28. Discussion & conclusion
    Summary
    Copula models make it easy to build multivariate (count) models
    Computational issues in inference for pairwise likelihood can be
    mitigated by the RPL
    → tradeoff between statistical efficiency and computational efficiency
    → approximate confidence intervals obtained with lower
    computational cost
    → large enough n, small but not too small πn !
    [email protected] 23 / 24

    View Slide

  29. Discussion & conclusion
    Future work
    Many possible improvements for RPL: parallelization to reduce the
    loss of statistical efficiency, optimizing several RPLs in parallel and
    averaging, ...
    Alternative sampling schemes for specific situations (temporal or
    spatial autocorrelation), structural or sparsity constraints on
    parameters
    Theoretical results for d → ∞ still needed (beyond exchangeable
    Gaussian model)
    RPL could benefit situations well beyond copula-based models for
    counts (e.g. mixed-type data)
    Graphical models for network inference
    [email protected] 24 / 24

    View Slide

  30. Thanks for your attention!
    INRAE DIGIT-BIO metaprogramme funding: DINAMIC
    R package: https:/github.com/andreamrau/rpl
    Preprint: https://hal.archives-ouvertes.fr/hal-03126621

    View Slide