Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Maxime Vono

S³ Seminar
December 11, 2020

Maxime Vono

(IRIT — INP-ENSEEIHT)

https://s3-seminar.github.io/seminars/maxime-vono/

Title — Efficient MCMC sampling via asymptotically exact data augmentation

Abstract — Performing exact Bayesian inference for complex models is computationally intractable. Markov chain Monte Carlo (MCMC) algorithms can provide reliable approximations of the posterior distribution but are expensive for large datasets and high-dimensional models. A standard approach to mitigate this complexity consists in using subsampling techniques or distributing the data across a cluster. However, these approaches are typically unreliable in high-dimensional scenarios. We focus here on an alternative MCMC scheme, coined Split Gibbs Sampler (SGS), exploiting a splitting strategy akin to the one used by quadratic penalty approaches in optimisation. This MCMC algorithm targets an approximation of a given posterior distribution which comes from an approximate statistical framework called Asymptotically Exact Data Augmentation (AXDA). Under standard regularity conditions, we quantitatively assess the bias associated to AXDA, establish explicit convergence rates for SGS and provide practitioners with explicit guidelines to fix the number of SGS steps. We finally support our methodology with numerical illustrations on challenging Bayesian inference problems.

Biography — Maxime Vono recently obtained his Ph.D. in statistics from the University of Toulouse where he was supervised by Nicolas Dobigeon (IRIT, INP-ENSEEIHT) and Pierre Chainais (CRIStAL, Centrale Lille). In spring 2019, he was a visiting research scholar at the Department of Statistics of the University of Oxford working with Arnaud Doucet (Oxford & Deepmind). Prior to that, he both graduated in 2017 from Ecole Centrale de Lille majoring in data sciences and University of Lille majoring in probability and statistics. He is affiliated to the ORION-B project which gathers astrophysicists, data scientists and statisticians in order to better understand the formation of stars in our universe. His research interests lie in Bayesian modelling, the development of efficient computational methods for inferring unknown objects in complex problems and their applications to signal/image processing and machine learning.

S³ Seminar

December 11, 2020
Tweet

More Decks by S³ Seminar

Other Decks in Research

Transcript

  1. 1 / 44
    Efficient MCMC sampling via asymptotically exact data
    augmentation
    Maxime Vono
    Joint work with P. Chainais, N. Dobigeon, A. Doucet and D. Paulin
    S3 Seminar - The Paris-Saclay Signal Seminar
    December 11, 2020

    View Slide

  2. 2 / 44
    Bayesian inference1
    θ : unknown object of interest
    y : available data (observations)
    Prior × Likelihood −→ Posterior
    θ ∼ π(θ) y|θ ∼ π(y|θ) θ|y ∼ π(θ|y)
    1
    Robert (2001)
    , Gelman et al. (2003)

    View Slide

  3. 2 / 44
    Bayesian inference
    θ : unknown object of interest
    y : available data (observations)
    E(θ|y) Cα
    1 − α
    Bayesian estimators Credibility regions
    arg min
    δ
    L(δ, θ)π(θ|y)dθ

    π(θ|y)dθ = 1 − α, α ∈ (0, 1)

    View Slide

  4. 3 / 44
    Monte Carlo integration4
    h(θ)π(θ|y)dθ ≈
    1
    N
    N
    n=1
    h θ(n) , θ(n) ∼ π(θ|y)
    Numerous applications :
    inverse problems1
    statistical machine learning2
    aggregation of estimators3
    1
    Idier (2008)
    2
    Andrieu et al. (2003)
    3
    Dalalyan and Tsybakov (2012)
    4
    Robert and Casella (2004)

    View Slide

  5. 3 / 44
    Monte Carlo integration1
    h(θ)π(θ|y)dθ ≈
    1
    N
    N
    n=1
    h θ(n) , θ(n) ∼ π(θ|y)
    Sampling challenges:
    − log π(θ|y) = b
    i=1
    Ui
    (Ai
    θ)
    {Ui
    ; i ∈ [k]} non-smooth
    θ ∈ Rd with d 1
    1
    Robert and Casella (2004)

    View Slide

  6. 4 / 44
    Illustrative example: image inpainting
    y = Hθ + ε, ε ∼ N(0d
    , σ2Id
    )
    π(θ) : TV prior1
    Posterior π(θ|y) :
    not standard: no conjugacy
    not smooth: TV prior
    high-dimensional: d ∼ 105
    1
    Chambolle et al. (2010)

    View Slide

  7. 5 / 44
    Aim of this talk
    Present a general statistical framework to perform efficient sampling
    Seminal works: HMC1, (MY)ULA2
    efficient & simple sampling
    scalability in high dimension
    scalability in big data settings
    0 5000 10000 15000
    Iteration t
    3
    2
    1
    0
    1
    2
    3

    N(0, 1)
    Random walk with = 1
    1
    Duane et al. (1987); Neal (2011); Maddison et al. (2018); Dang et al. (2019)
    2
    Roberts and Tweedie (1996); Pereyra (2016); Durmus et al. (2018); Luu et al. (2020)

    View Slide

  8. 6 / 44
    Outline
    1. Asymptotically exact data augmentation (AXDA)
    2. Theoretical guarantees for Gaussian smoothing
    3. Split Gibbs sampler (SGS)
    4. Convergence and complexity analyses
    5. Experiments

    View Slide

  9. 7 / 44
    Outline
    1. Asymptotically exact data augmentation (AXDA)
    2. Theoretical guarantees for Gaussian smoothing
    3. Split Gibbs sampler (SGS)
    4. Convergence and complexity analyses
    5. Experiments

    View Slide

  10. 8 / 44
    Data augmentation (DA)
    π(θ, z|y)dz = π(θ|y)
    Numerous well-known advantages:
    at the core of EM1
    π(θ, z|y): simpler full conditionals2
    better mixing3
    z
    θ(t)
    π(θ(t))
    Slice sampling
    1
    Hartley (1958); Dempster et al. (1977); Celeux et al. (2001)
    2
    Neal (2003)
    3
    Edwards and Sokal (1988); Marnissi et al. (2018)

    View Slide

  11. 9 / 44
    Example: Bayesian LASSO1
    y = Xθ + ε, ε ∼ N(0d
    , σ2Id
    )
    π(θ) ∝ e−τ θ 1
    1
    Park and Casella (2008)

    View Slide

  12. 9 / 44
    Example: Bayesian LASSO1
    y = Xθ + ε, ε ∼ N(0d
    , σ2Id
    )
    π(θ) ∝ e−τ θ 1
    Surrogate: DA by exploiting the mixture representation
    π(θ) =
    d
    i=1

    0
    N(θi
    |0, zi
    )E(zi
    |τ2)dzi
    1
    Park and Casella (2008)

    View Slide

  13. 9 / 44
    Example: Bayesian LASSO1
    y = Xθ + ε, ε ∼ N(0d
    , σ2Id
    )
    π(θ) ∝ e−τ θ 1
    Surrogate: DA by exploiting the mixture representation
    π(θ) =
    d
    i=1

    0
    N(θi
    |0, zi
    )E(zi
    |τ2)dzi
    Simple Gibbs sampling:
    π(θ|y, z1
    , . . ., zd
    ) : Gaussian
    π(1/zi
    |θ) : Inverse-Gaussian
    1
    Park and Casella (2008)

    View Slide

  14. 10 / 44
    The art of data augmentation
    Unfortunately, satisfying
    π(θ, z|y)dz = π(θ|y) (1)
    while leading to efficient algorithms is a matter of art.1
    1
    van Dyk and Meng (2001)

    View Slide

  15. 10 / 44
    The art of data augmentation
    Unfortunately, satisfying
    π(θ, z|y)dz = π(θ|y) (1)
    while leading to efficient algorithms is a matter of art.1
    Proposed approach:
    systematic & simple way to relax (1) leading to efficient algorithms
    πρ
    (θ, z|y)dz −



    ρ→0
    π(θ|y)
    1
    van Dyk and Meng (2001)

    View Slide

  16. 10 / 44
    The art of data augmentation
    Unfortunately, satisfying
    π(θ, z|y)dz = π(θ|y) (1)
    while leading to efficient algorithms is a matter of art.1
    Proposed approach:
    systematic & simple way to relax (1) leading to efficient algorithms
    πρ
    (θ, z|y)dz −



    ρ→0
    π(θ|y)
    −→ How to build πρ
    (θ, z|y) ?
    1
    van Dyk and Meng (2001)

    View Slide

  17. 11 / 44
    On building πρ
    A natural manner to construct πρ
    is to define
    πρ
    (θ, z|y) = π(z|y)κρ
    (z, θ)
    where κρ
    (·, θ) weak




    ρ−→0
    δθ
    (·)
    θ
    z
    0
    2
    4
    6
    8
    κρ
    (z; θ)
    ρ = 0.05
    ρ = 0.1
    ρ = 0.2
    ρ = 0.3

    View Slide

  18. 11 / 44
    On building πρ
    A natural manner to construct πρ
    is to define
    πρ
    (θ, z|y) = π(z|y)κρ
    (z, θ)
    where κρ
    (·, θ) weak




    ρ−→0
    δθ
    (·)
    θ
    z
    0
    2
    4
    6
    8
    κρ
    (z; θ)
    ρ = 0.05
    ρ = 0.1
    ρ = 0.2
    ρ = 0.3
    Divide to conquer generalization:
    π(θ|y) ∝
    b
    i=1
    πi
    (Ai
    θ|yi
    ) −→ πρ
    (θ, z1:b
    |y) ∝
    b
    i=1
    πi
    (zi
    |yi
    )κρ
    (zi
    , Ai
    θ)

    View Slide

  19. 12 / 44
    Split Gibbs sampler (SGS)
    π(θ|y) ∝
    b
    i=1
    πi
    (Ai
    θ|yi
    ) −→ πρ
    (θ, z1:b
    |y) ∝
    b
    i=1
    πi
    (zi
    |yi
    )κρ
    (zi
    , Ai
    θ)
    Algorithm 1: Split Gibbs Sampler (SGS)
    1 for t ← 1 to T do
    2 for i ← 1 to b do
    3 z(t)
    i
    ∼ πi
    (zi
    |yi
    )κρ
    (zi
    , Ai
    θ)
    4 end
    5 θ(t) ∼
    b
    i=1
    κρ
    (zi
    , Ai
    θ)
    6 end
    Divide-to-conquer :
    b simpler steps −→ πi
    (zi
    |yi
    )
    b local steps −→ yi

    View Slide

  20. 13 / 44
    Kernel choice : κρ
    (z, θ) ∝ ρ−dK(z−θ
    ρ
    )
    name support K(u)
    Gaussian R 1


    exp −u2/2
    Cauchy R 1
    π(1+u2)
    Laplace R 1
    2
    exp (−|u|)
    Dirichlet R
    sin2(u)
    πu2
    Uniform [−1, 1] 1
    2
    1
    |u|≤1
    Triangular [−1, 1] (1 − |u|)1
    |u|≤1
    Epanechnikov [−1, 1] 3
    4
    (1 − u2)1
    |u|≤1
    −4 0 4
    u
    0
    1
    K(u)
    K(0)
    Gaussian
    Cauchy
    Laplace
    Dirichlet
    −1 0 1
    u
    0
    1
    K(u)
    K(0)
    Uniform
    Triangular
    Epanechnikov
    −→ Similarity with “noisy” ABC methods (Marin et al. 2012; Wilkinson 2013)

    View Slide

  21. 13 / 44
    Kernel choice : κρ
    (z, θ) ∝ ρ−dK(z−θ
    ρ
    )
    name support K(u)
    Gaussian R 1


    exp −u2/2
    Cauchy R 1
    π(1+u2)
    Laplace R 1
    2
    exp (−|u|)
    Dirichlet R
    sin2(u)
    πu2
    Uniform [−1, 1] 1
    2
    1
    |u|≤1
    Triangular [−1, 1] (1 − |u|)1
    |u|≤1
    Epanechnikov [−1, 1] 3
    4
    (1 − u2)1
    |u|≤1
    −4 0 4
    u
    0
    1
    K(u)
    K(0)
    Gaussian
    Cauchy
    Laplace
    Dirichlet
    −1 0 1
    u
    0
    1
    K(u)
    K(0)
    Uniform
    Triangular
    Epanechnikov
    −→ Similarity with “noisy” ABC methods (Marin et al. 2012; Wilkinson 2013)

    View Slide

  22. 14 / 44
    Outline
    1. Asymptotically exact data augmentation (AXDA)
    2. Theoretical guarantees for Gaussian smoothing
    3. Split Gibbs sampler (SGS)
    4. Convergence and complexity analyses
    5. Experiments

    View Slide

  23. 15 / 44
    Densities with full domains
    We assume here that supp(π) = Rd.
    π(θ|y) ∝
    b
    i=1
    e−Ui(Aiθ)
    πi(Aiθ)
    Distance Main assumptions Upper bound
    πρ
    − π
    TV
    Ui
    Li
    -Lipschitz
    b
    i=1
    2 di
    Li
    + o(ρ)
    U1
    M1
    -smooth, b = 1
    1
    2
    M1
    d
    Ui
    Mi
    -smooth & strongly convex
    1
    2
    b
    i=1
    Mi
    di
    + o(ρ2)
    W1
    (πρ
    , π)
    U1
    M1
    -smooth & strongly convex
    min

    d, 1
    2

    M1
    d
    b = 1, A1
    = Id

    View Slide

  24. 15 / 44
    Densities with full domains
    π(θ|y) ∝
    b
    i=1
    e−Ui(Aiθ)
    πi(Aiθ)
    πρ
    (θ|y) ∝
    b
    i=1 Rdi
    e−Ui(zi)
    πi(zi)
    N zi
    |Ai
    θ, ρ2Id
    κρ(zi,Aiθ)
    dzi
    Distance Main assumptions Upper bound
    πρ
    − π
    TV
    Ui
    Li
    -Lipschitz
    b
    i=1
    2 di
    Li
    + o(ρ)
    U1
    M1
    -smooth, b = 1
    1
    2
    M1
    d
    Ui
    Mi
    -smooth & strongly convex
    1
    2
    b
    i=1
    Mi
    di
    + o(ρ2)
    W1
    (πρ
    , π)
    U1
    M1
    -smooth & strongly convex
    min

    d, 1
    2

    M1
    d
    b = 1, A = I

    View Slide

  25. 15 / 44
    Densities with full domains
    πρ
    (θ|y) ∝
    b
    i=1 Rdi
    e−Ui(zi)
    πi(zi)
    N zi
    |Ai
    θ, ρ2Id
    κρ(zi,Aiθ)
    dzi
    Distance Upper bound Main assumptions
    πρ
    − π
    TV
    ρ
    b
    i=1
    2 di
    Li
    + o(ρ) Ui
    Li
    -Lipschitz
    1
    2
    ρ2M1
    d U1
    M1
    -smooth, b = 1
    1
    2
    ρ2
    b
    i=1
    Mi
    di
    + o(ρ2) Ui
    Mi
    -smooth & strongly convex
    W1
    (πρ
    , π) min ρ

    d, 1
    2
    ρ2

    M1
    d
    U1
    M1
    -smooth & strongly convex
    b = 1, A1
    = Id

    View Slide

  26. 15 / 44
    Densities with full domains
    Illustration of these bounds for an univariate Gaussian toy example
    with b = 1 and U1
    (θ) = θ2
    2σ2
    :
    π(θ) =
    1

    2πσ2
    e− θ2
    2σ2 πρ
    (θ) =
    1
    2π(σ2 + ρ2)
    e− θ2
    2(σ2+ρ2)
    10−2 10−1 100 101 102 103
    ρ
    10−7
    10−6
    10−5
    10−4
    10−3
    10−2
    10−1
    100
    π − πρ TV
    Our bound
    Cρ2
    10−2 10−1 100 101 102 103
    ρ
    10−5
    10−3
    10−1
    101
    103
    W1
    (π, πρ
    )
    Upper bound
    Cρ2

    View Slide

  27. 16 / 44
    Uniform prior densities on compact convex sets1
    π has bounded support: supp(π) = K, where K ⊂ Rd is a convex body.
    π(θ) ∝ e−ιK(θ), ιK
    (θ) =
    0 if θ ∈ K
    ∞ if θ /
    ∈ K
    πρ
    (θ) ∝
    Rd
    e−ιK(z)− z−θ 2
    2ρ2 dz
    H1. There exists r > 0 such that B(0d
    , r) ⊂ K.
    For ρ ∈ 0, r
    2

    2d
    , πρ
    − π
    TV

    2

    2ρd
    r
    r
    K
    B(0d
    , r)
    1
    Gelfand et al. (1992); Bubeck et al. (2015); Brosse et al. (2017); Hsieh et al. (2018)

    View Slide

  28. 17 / 44
    Outline
    1. Asymptotically exact data augmentation (AXDA)
    2. Theoretical guarantees for Gaussian smoothing
    3. Split Gibbs sampler (SGS)
    4. Convergence and complexity analyses
    5. Experiments

    View Slide

  29. 18 / 44
    Sampling problem
    π(θ|y) ∝
    b
    i=1
    e−Ui(Aiθ)
    πi(Aiθ)
    Ai
    ∈ Rdi×d
    Ui
    depends on yi

    View Slide

  30. 18 / 44
    Sampling problem
    π(θ|y) ∝
    b
    i=1
    e−Ui(Aiθ)
    πi(Aiθ)
    Ai
    ∈ Rdi×d
    Ui
    depends on yi
    Example: Bayesian logistic regression with Gaussian prior
    π(θ|y) ∝ e−τ||θ||2
    prior
    n
    i=1
    e−logistic(xT
    i
    θ;yi)
    likelihood
    not standard
    high-dimensional
    possibly distributed data {yi
    , xi
    }i∈[n]

    View Slide

  31. 19 / 44
    Split Gibbs sampler (SGS)
    πρ
    (θ, z1:b
    |y) ∝
    b
    i=1
    e−Ui(zi)κρ
    (zi
    , Ai
    θ)
    Algorithm 2: Split Gibbs Sampler (SGS)
    1 for t ← 1 to T do
    2 for i ← 1 to b do
    3 z(t)
    i
    ∼ πρ
    (zi
    |θ(t−1), yi
    ) ∝ exp −Ui
    (zi
    ) − 1
    2ρ2
    zi
    − Ai
    θ 2
    4 end
    5 θ(t) ∼ πρ
    (θ|z(t)
    1:b
    ) ∝
    b
    i=1
    N z(t)
    i
    |Ai
    θ, ρ2Id
    6 end
    Divide-to-conquer :
    • b simpler steps −→ Ui
    (zi
    )
    • b local steps −→ yi

    View Slide

  32. 19 / 44
    Split Gibbs sampler (SGS)
    πρ
    (θ, z1:b
    |y) ∝
    b
    i=1
    e−Ui(zi)N zi
    |Ai
    θ, ρ2Id
    Algorithm 3: Split Gibbs Sampler (SGS)
    1 for t ← 1 to T do
    2 for i ← 1 to b do
    3 z(t)
    i
    ∼ πρ
    (zi
    |θ(t−1), yi
    ) ∝ exp −Ui
    (zi
    ) − 1
    2ρ2
    zi
    − Ai
    θ 2
    4 end
    5 θ(t) ∼ πρ
    (θ|z(t)
    1:b
    ) ∝
    b
    i=1
    N z(t)
    i
    |Ai
    θ, ρ2Id
    6 end
    Divide-to-conquer :
    • b simpler steps −→ Ui
    (zi
    )
    • b local steps −→ yi

    View Slide

  33. 19 / 44
    Split Gibbs sampler (SGS)
    πρ
    (θ, z1:b
    |y) ∝
    b
    i=1
    e−Ui(zi)N zi
    |Ai
    θ, ρ2Id
    Algorithm 1: Split Gibbs Sampler (SGS)
    1 for t ← 1 to T do
    2 for i ← 1 to b do
    3 z(t)
    i
    ∼ πρ
    (zi
    |θ(t−1), yi
    ) ∝ exp −Ui
    (zi
    ) − 1
    2ρ2
    zi
    − Ai
    θ 2
    4 end
    5 θ(t) ∼ πρ
    (θ|z(t)
    1:b
    ) ∝
    b
    i=1
    N z(t)
    i
    |Ai
    θ, ρ2Id
    6 end
    Divide-to-conquer :
    b simpler steps −→ Ui
    (zi
    )
    b local steps −→ yi

    View Slide

  34. 20 / 44
    On the scalability in distributed environments1
    θ
    yi
    i ∈ [n]
    θ
    z1
    y1
    . . .
    zi
    yi
    . . .
    zb
    yb
    . . .
    . . .
    → Each latent variable zi
    depends on the subset of observations yi
    1
    For a comprehensive review, see the work by Rendell et al. (2020)

    View Slide

  35. 21 / 44
    Sampling efficiently the auxiliary variables {zi
    }i∈[b]
    Rejection sampling1
    πρ
    zi
    |θ(t−1), y
    i
    ∝ exp −Ui
    (zi
    ) −
    1
    2ρ2
    zi
    − Ai
    θ 2
    If Ui
    convex & smooth and ρ ≤ ¯
    ρ with ¯
    ρ = O(1/di
    ):
    exact sampling by rejection sampling with O(1) evals of Ui
    and ∇Ui
    .
    1
    Gilks and Wild (1992)

    View Slide

  36. 21 / 44
    Sampling efficiently the auxiliary variables {zi
    }i∈[b]
    Rejection sampling1
    πρ
    zi
    |θ(t−1), y
    i
    ∝ exp −Ui
    (zi
    ) −
    1
    2ρ2
    zi
    − Ai
    θ 2
    If Ui
    convex & smooth and ρ ≤ ¯
    ρ with ¯
    ρ = O(1/di
    ):
    exact sampling by rejection sampling with O(1) evals of Ui
    and ∇Ui
    .
    ρ = 4.49
    πρ
    (zi
    |θ)

    (zi
    |θ)
    ρ = 2.46
    πρ
    (zi
    |θ)

    (zi
    |θ)
    ρ = 1.27
    πρ
    (zi
    |θ)

    (zi
    |θ)
    Here Ui
    (zi
    ) = yi
    zi
    + log(1 + e−zi ) + αz2
    i
    2
    1
    Gilks and Wild (1992)

    View Slide

  37. 22 / 44
    Sampling efficiently the master parameter θ
    High-dimensional Gaussian sampling1
    πρ
    (θ|z(t)
    1:b
    ) ∝
    b
    i=1
    N z(t)
    i
    |Ai
    θ, ρ2Id
    Main difficulty: Handling the high-dimensional matrix Q = b
    i=1
    AT
    i
    Ai
    .
    10−4
    10−3
    10−2
    10−1
    100
    101
    102
    103
    104
    −15
    −10
    −5
    0
    5
    10
    15
    −40
    −20
    0
    20
    40
    −→ Numerous efficient sampling approaches have been proposed by building
    on numerical linear algebra techniques.
    1
    For a recent overview, see the work by Vono et al. (2020)

    View Slide

  38. 23 / 44
    Comparison with the Unadjusted Langevin Algorithm
    When SGS meets ULA
    In the single splitting case (b = 1) with A1
    = Id
    , SGS writes
    1. Sample z(t) ∼ πρ
    (z|θ(t)) ∝ exp −U(z) − z−θ(t) 2
    2ρ2
    2. Sample θ(t+1) ∼ N(z(t), ρ2Id
    ).
    By Taylor’s expansion, when ρ → 0, we have:
    θ(t+1) ∼ N(θ(t)−ρ2∇U(θ(t)), 2ρ2Id
    )
    −→ one ULA step wih stepsize γ = ρ2.

    View Slide

  39. 24 / 44
    Comparison with the Unadjusted Langevin Algorithm
    Bias comparison1
    Distance Upper bound Main assumptions
    πρ
    − π
    TV
    1
    2
    ρ2M1
    d M-smooth
    πULA
    − π
    TV

    2Md · ρ M-smooth & convex
    W1
    (πρ
    , π) min ρ

    d, 1
    2
    ρ2

    Md M-smooth, m-strongly convex
    W1
    (πULA
    , π) 2M
    m
    d · ρ M-smooth, m-strongly convex
    −→ Higher order approximation based on the explicit form of πρ
    .
    1
    Durmus et al. (2019)

    View Slide

  40. 25 / 44
    Relation with quadratic penalty methods
    MAP estimation under each full conditional of
    πρ
    (θ, z1:b
    |y) ∝
    b
    i=1
    exp −Ui
    (zi
    ) −
    1
    2ρ2
    zi
    − Ai
    θ 2 (2)
    yields at iteration t
    z(t)
    i
    = arg min
    zi
    Ui
    (zi
    ) +
    1
    2ρ2
    zi
    − Ai
    θ(t−1)
    2
    , ∀i ∈ [b]
    θ(t) = arg min
    θ
    b
    i=1
    1
    2ρ2
    z(t)
    i
    − Ai
    θ
    2
    → Steps involved in quadratic penalty methods

    View Slide

  41. 26 / 44
    Outline
    1. Asymptotically exact data augmentation (AXDA)
    2. Theoretical guarantees for Gaussian smoothing
    3. Split Gibbs sampler (SGS)
    4. Convergence and complexity analyses
    5. Experiments

    View Slide

  42. 27 / 44
    Mixing times
    How many SGS iterations should I run to obtain a sample from π?
    This question can be answered by giving an upper bound on the
    so-called -mixing time:
    tmix
    ( ; ν) = min t ≥ 0 D(νPt
    SGS
    , π) ≤ ,
    where > 0, ν is an initial distibution, PSGS
    is the Markov kernel of
    SGS and D is a statistical distance.
    −→ We are here interested in providing explicit bounds w.r.t. d, and
    regularity constants associated to U.

    View Slide

  43. 28 / 44
    Explicit convergence rates
    Assumptions : Ai
    = Id
    and Ui
    mi
    -strongly convex for i ∈ [b]
    W1
    (νPt
    SGS
    , πρ
    ) ≤ W1
    (ν, πρ
    ) · (1 − KSGS
    )t
    νPt
    SGS
    − πρ TV
    ≤ Varπρ

    dπρ
    · (1 − KSGS
    )t,
    where KSGS
    =
    ρ2
    b
    b
    i=1
    mi
    1 + mi
    ρ2
    , ν init., PSGS
    Markov kernel of SGS
    0 20 40 60 80 100
    Iteration t
    −12
    −10
    −8
    −6
    −4
    −2
    0
    ρ = 1
    log νPt
    SGS
    − πρ TV
    Our bound
    0 200 400 600 800 1000
    Iteration t
    −20
    −15
    −10
    −5
    0
    ρ = 0.1
    log W1
    (δθ0
    Pt
    SGS
    , πρ
    )
    Our bound

    View Slide

  44. 29 / 44
    Explicit guidelines for practitioners
    Single splitting case
    π(θ|y) ∝ e−U(θ) πρ
    (θ, z|y) ∝ e−U(z)N(z|θ, ρ2Id
    )
    1. Set ∈ (0, 1)
    2. Set ρ2 =
    dM
    , M: Lipschitz constant of ∇U
    3. Start from ν(θ) = N θ|θ , M−1Id
    4. Run T iterations of SGS with T =
    C( , d)
    KSGS
    Then,
    νPT
    SGS
    − π
    TV

    −4 −2 0 2 4
    aT θ
    a
    0.00
    0.05
    0.10
    0.15
    0.20
    0.25
    0.30
    Exact - d = 60
    −4 −2 0 2 4
    aT θ
    a
    0.00
    0.05
    0.10
    0.15
    0.20
    0.25
    0.30
    SGS - d = 60

    View Slide

  45. 30 / 44
    Complexity results for strongly log-concave densities
    1-Wasserstein distance
    Assumptions:
    U is M-smooth & m-strongly convex, κ = M/m
    starting point: minimizer θ of U
    Method Condition on Evals for W1
    err.

    d

    m
    Unadjusted Langevin 0 < ≤ 1 O∗(κ2/ 2)
    Underdamped Langevin 0 < ≤ 1 O∗(κ2/ )
    Underdamped Langevin 0 < ≤ 1/

    κ O∗(κ3/2/ )
    Hamiltonian Dynamics 0 < ≤ 1 O∗(κ3/2/ )
    SGS with single splitting 0 < ≤ 1/(d

    κ) O∗(κ1/2/ )
    → For small , the rate of the SGS improves upon previous works.

    View Slide

  46. 31 / 44
    Complexity results for strongly log-concave densities
    Total variation distance
    Assumptions:
    U is M-smooth & m-strongly convex, κ = M/m
    starting point θ(0) ∼ N(θ; θ , Id
    /M)
    Method Validity Evals
    ULA 0 < ≤ 1 O∗(κ2d3/ 2)
    MALA 0 < ≤ 1 O(κ2d2 log1.5( κ
    1/d
    ))
    SGS 0 < ≤ 1 O∗(κd2/ )
    101 102 103
    d
    103
    104
    105
    106
    107
    tmix
    ( ; ν)
    Total variation distance
    theoretical slope = 2
    SGS, slope = 1.04 (± 0.02)
    102 103
    log(2/ )/
    102
    103
    104
    tmix
    ( ; ν)
    Total variation distance
    theoretical slope = 1
    SGS, slope = 1.06 (± 0.04)

    View Slide

  47. 32 / 44
    Complexity results for log-concave densities
    Total variation distance
    Idea: Replace each potential Ui
    by a strongly-convex approximation:
    ˜
    U(θ) =
    b
    i=1
    Ui
    (Ai
    θ) +
    λ
    2
    θ − θ 2
    By using results from Dalalyan (2017)
    , the triangle inequality and our previous
    bounds, we have:
    Method Validity Evals
    ULA 0 < ≤ 1 O∗(M2d3/ 4)
    MRW 0 < ≤ 1 O∗(M2d3/ 2)
    SGS 0 < ≤ 1 O∗(Md3/ 2)
    → Our complexity results improve upon that of ULA and MRW.

    View Slide

  48. 33 / 44
    Outline
    1. Asymptotically exact data augmentation (AXDA)
    2. Theoretical guarantees for Gaussian smoothing
    3. Split Gibbs sampler (SGS)
    4. Convergence and complexity analyses
    5. Experiments

    View Slide

  49. 34 / 44
    Unsupervised image deconvolution with a smooth prior
    Problem statement
    y = Bθ + ε, ε ∼ N(0d
    , Ω−1)
    π(θ) = N 0d
    , γLT L
    −1
    ,
    B: circulant matrix w.r.t. a blurring kernel
    Ω−1 = diag(σ2
    1
    , . . . , σ2
    d
    ), σi
    ∼ (1 − β)δκ1
    + βδκ2
    ,
    where κ1
    κ2
    L: circulant matrix w.r.t. a Laplacian filter
    Posterior π(θ|y): Gaussian w/ Q = BT ΩB + γLT L
    challenging: Q cannot be diagonalized easily
    high-dimensional: d ∼ 105
    Observed
    Original

    View Slide

  50. 35 / 44
    Unsupervised image deconvolution with a smooth prior
    AXDA approximation and SGS
    π(θ|y, σ, κ1
    , κ2
    , β, γ) ∝ exp −
    1
    2
    (Bθ − y)T Ω(Bθ − y) + γ Lθ 2
    Approximate joint posterior :
    πρ
    (θ, z|y, σ, κ1
    , κ2
    , β, γ) ∝ exp −
    1
    2
    (z − y)T Ω(z − y) + γ Lθ 2 +
    1
    ρ2
    z − Bθ 2 .

    View Slide

  51. 35 / 44
    Unsupervised image deconvolution with a smooth prior
    AXDA approximation and SGS
    π(θ|y, σ, κ1
    , κ2
    , β, γ) ∝ exp −
    1
    2
    (Bθ − y)T Ω(Bθ − y) + γ Lθ 2
    Approximate joint posterior :
    πρ
    (θ, z|y, σ, κ1
    , κ2
    , β, γ) ∝ exp −
    1
    2
    (z − y)T Ω(z − y) + γ Lθ 2 +
    1
    ρ2
    z − Bθ 2 .
    Sampling θ :
    πρ
    (θ|z, γ) : Gaussian distribution with Qθ
    =
    1
    ρ2
    BT B + γLT L

    View Slide

  52. 35 / 44
    Unsupervised image deconvolution with a smooth prior
    AXDA approximation and SGS
    π(θ|y, σ, κ1
    , κ2
    , β, γ) ∝ exp −
    1
    2
    (Bθ − y)T Ω(Bθ − y) + γ Lθ 2
    Approximate joint posterior :
    πρ
    (θ, z|y, σ, κ1
    , κ2
    , β, γ) ∝ exp −
    1
    2
    (z − y)T Ω(z − y) + γ Lθ 2 +
    1
    ρ2
    z − Bθ 2 .
    Sampling θ :
    πρ
    (θ|z, γ) : Gaussian distribution with Qθ
    =
    1
    ρ2
    BT B + γLT L
    Sampling {zi
    }i∈[d]
    :
    πρ
    (z|θ, y, σ, κ1
    , κ2
    , β) : Gaussian distribution with Qz
    =
    1
    ρ2
    Id
    + Ω

    View Slide

  53. 36 / 44
    Unsupervised image deconvolution with a smooth prior
    Restoration results with the MMSE estimator
    Original Observed MMSE under π
    Rel. error = 0.004
    MMSE under πρ
    method SNR (dB) PSNR (dB)
    SGS 17.57 22.92
    AuxV1 17.57 22.92
    AuxV2 17.58 22.93
    RJPO 17.57 22.91

    View Slide

  54. 37 / 44
    Unsupervised image deconvolution with a smooth prior
    Convergence diagnotics of the MCMC samplers
    0 200 400 600 800 1000
    Iteration t
    0.000
    0.001
    0.002
    0.003
    0.004
    0.005
    0.006
    γ[t]
    SGS
    AuxV1
    AuxV2
    RJPO
    0 200 400 600 800 1000
    Iteration t
    0.30
    0.32
    0.34
    0.36
    0.38
    0.40
    β[t]
    SGS
    AuxV1
    AuxV2
    RJPO
    True value
    0 200 400 600 800 1000
    Iteration t
    10
    11
    12
    13
    14
    15
    16
    κ[t]
    1
    SGS
    AuxV1
    AuxV2
    RJPO
    True value
    0 200 400 600 800 1000
    Iteration t
    36
    38
    40
    42
    44
    κ[t]
    2
    SGS
    AuxV1
    AuxV2
    RJPO
    True value

    View Slide

  55. 38 / 44
    Image inpainting with non-smooth prior
    Problem statement and challenges
    y = Hθ + ε, ε ∼ N(0d
    , σ2Id
    )
    π(θ) : TV prior
    H: decimation matrix associated to a binary mask
    TV(θ) = τ
    1≤i≤d
    Di
    θ
    Di
    θ: 2D discrete gradient applied at pixel i of θ
    Posterior π(θ|y) :
    not standard : no conjugacy
    not smooth : TV prior
    high-dimensional : d ∼ 105

    View Slide

  56. 39 / 44
    Image inpainting with non-smooth prior
    Possible surrogate: MYULA1
    π(θ|y) ∝ N y|Hθ, σ2In
    d
    i=1
    e−τ Diθ
    π(θ)
    Approximate posterior :
    πλ
    (θ|y) ∝ N y|Hθ, σ2In
    e−Mλ[T V ](θ)
    1
    Durmus et al. (2018)

    View Slide

  57. 39 / 44
    Image inpainting with non-smooth prior
    Possible surrogate: MYULA1
    π(θ|y) ∝ N y|Hθ, σ2In
    d
    i=1
    e−τ Diθ
    π(θ)
    Approximate posterior :
    πλ
    (θ|y) ∝ N y|Hθ, σ2In
    e−Mλ[T V ](θ)
    MYULA :
    θ(t+1) ∼ N 1 −
    λ
    γ
    θ(t) + λ∇ log π(y|θ(t)) +
    λ
    γ
    proxγ
    TV
    (θ(t)), 2γId
    1
    Durmus et al. (2018)

    View Slide

  58. 39 / 44
    Image inpainting with non-smooth prior
    Possible surrogate: MYULA1
    π(θ|y) ∝ N y|Hθ, σ2In
    d
    i=1
    e−τ Diθ
    π(θ)
    Approximate posterior :
    πλ
    (θ|y) ∝ N y|Hθ, σ2In
    e−Mλ[T V ](θ)
    MYULA :
    θ(t+1) ∼ N 1 −
    λ
    γ
    θ(t) + λ∇ log π(y|θ(t)) +
    λ
    γ
    proxγ
    TV
    (θ(t)), 2γId
    No closed-form for proxTV
    :
    −→ additional approximation
    −→ computational burden
    1
    Durmus et al. (2018)

    View Slide

  59. 40 / 44
    Image inpainting with non-smooth prior
    AXDA approximation and SGS
    π(θ|y) ∝ N y|Hθ, σ2In
    d
    i=1
    e−τ Diθ
    π(θ)
    Approximate joint posterior :
    πρ
    (θ, z1:d
    |y) ∝ N y|Hθ, σ2In
    d
    i=1
    e−τ zi N(zi
    |Di
    θ, ρ2Id
    )

    View Slide

  60. 40 / 44
    Image inpainting with non-smooth prior
    AXDA approximation and SGS
    π(θ|y) ∝ N y|Hθ, σ2In
    d
    i=1
    e−τ Diθ
    π(θ)
    Approximate joint posterior :
    πρ
    (θ, z1:d
    |y) ∝ N y|Hθ, σ2In
    d
    i=1
    e−τ zi N(zi
    |Di
    θ, ρ2Id
    )
    Sampling θ :
    πρ
    (θ|z1:d
    , y) ∝ N y|Hθ, σ2In
    d
    i=1
    N(zi
    |Di
    θ, ρ2Id
    )

    View Slide

  61. 40 / 44
    Image inpainting with non-smooth prior
    AXDA approximation and SGS
    π(θ|y) ∝ N y|Hθ, σ2In
    d
    i=1
    e−τ Diθ
    π(θ)
    Approximate joint posterior :
    πρ
    (θ, z1:d
    |y) ∝ N y|Hθ, σ2In
    d
    i=1
    e−τ zi N(zi
    |Di
    θ, ρ2Id
    )
    Sampling θ :
    πρ
    (θ|z1:d
    , y) ∝ N y|Hθ, σ2In
    d
    i=1
    N(zi
    |Di
    θ, ρ2Id
    )
    Sampling {zi
    }i∈[d]
    :
    πρ
    (zi
    |θ) ∝ e−τ zi N(zi
    |Di
    θ, ρ2Id
    )
    −→ DA applied to e−τ zi : Gaussian and Inverse Gaussian conditionals

    View Slide

  62. 41 / 44
    Image inpainting with non-smooth prior
    Highest posterior density regions
    C
    α
    = {θ ∈ Rd | π(θ|y) ≥ e−γα },

    π(θ|y)dθ = 1 − α
    0
    10
    20
    30
    40
    50
    60
    70
    80
    0.0 0.2 0.4 0.6 0.8 1.0
    α
    0.5
    1.0
    1.5
    2.0
    2.5
    3.0

    γα
    −γα
    |
    γα
    ×10−3
    SGS
    MYULA
    −→ Relative error of the order 0.1%

    View Slide

  63. 42 / 44
    Image inpainting with non-smooth prior
    Sampling efficiency
    U(θ) = − log π(θ|y)
    100 101 102 103 104 105
    Iteration t
    106
    107
    108
    U(θ)
    SGS
    MYULA
    104 105
    Iteration t
    1.65 × 105
    1.66 × 105
    1.67 × 105
    1.68 × 105
    1.69 × 105
    1.7 × 105
    U(θ)
    SGS
    MYULA
    −→ Faster convergence towards high probability regions

    View Slide

  64. 42 / 44
    Image inpainting with non-smooth prior
    Sampling efficiency
    ESSR: effective sample size ratio
    T1
    : CPU time in seconds to draw one sample
    method ESSR T1
    [seconds]
    SGS 0.22 3.83
    MYULA 0.15 4.22
    −→ ESSR improved by 47%

    View Slide

  65. 43 / 44
    Conclusion
    Broad and simple approximate statistical framework
    • Multiple convolutions with smoothing kernels
    • Related to Approximate Bayesian Computation
    • Rich interpretation from both simulation and optimisation
    • Theoretical guarantees
    Efficient sampling based on AXDA
    • Parallel and simple Gibbs sampling
    • Explicit convergence rates and mixing times
    • State-of-the-art performances on image processing problems
    Perspectives
    • Convergence and complexity analyses for Metropolis-within-SGS
    • Distributed and asynchronous sampling
    • Applications to challenging machine and deep learning problems
    • Comparison with stochastic gradient MCMC

    View Slide

  66. 44 / 44
    References
    Theory and methods
    M. Vono, N. Dobigeon and P. Chainais (2019), “Split-and-augmented Gibbs sampler - Application
    to large-scale inference problems,” IEEE Transactions on Signal Processing, vol. 67, no. 6, pp.
    1648-1661.
    M. Vono, N. Dobigeon and P. Chainais (2020), “Asymptotically exact data augmentation: models,
    properties and algorithms,” Journal of Computational and Graphical Statistics (in press).
    M. Vono, D. Paulin and A. Doucet (2019), “Efficient MCMC sampling with dimension-free
    convergence rate,” arXiv: 1905.11937.
    M. Vono, N. Dobigeon and P. Chainais (2020), “High-dimensional Gaussian sampling: A review
    and a unifying approach based on a stochastic proximal point algorithm,” arXiv: 2010.01510.
    L. J. Rendell et al. (2020), “Global consensus Monte Carlo,” Journal of Computational and
    Graphical Statistics (in press).
    Applications
    M. Vono et al. (2019), “Bayesian image restoration under Poisson noise and log-concave prior,” in
    Proc. ICASSP.
    M. Vono et al. (2018), “Sparse Bayesian binary logistic regression using the split-and-augmented
    Gibbs sampler,” in Proc. MLSP.
    Code
    https://github.com/mvono

    View Slide

  67. 44 / 44
    Thank you for your attention! Questions?
    Joint work with Pierre Chainais, Nicolas Dobigeon, Arnaud Doucet & Daniel Paulin

    View Slide

  68. 45 / 44
    Andrieu, C., de Freitas, N., Doucet, A., and Jordan, M. I. (2003), “An Introduction to MCMC for
    Machine Learning,” Machine Learning, 50, 5–43.
    Brosse, N., Durmus, A., Moulines, E., and Pereyra, M. (2017), “Sampling from a log-concave distribution
    with compact support with proximal Langevin Monte Carlo,” in Conference on Learning Theory,
    vol. 65, pp. 319–342.
    Bubeck, S., Eldan, R., and Lehec, J. (2015), “Finite-Time Analysis of Projected Langevin Monte Carlo,”
    in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume
    1, NIPS’15, p. 1243–1251.
    Celeux, G., Chretien, S., Forbes, F., and Mkhadri, A. (2001), “A Component-Wise EM Algorithm for
    Mixtures,” Journal of Computational and Graphical Statistics, 10, 697–712.
    Chambolle, A., Novaga, M., Cremers, D., and Pock, T. (2010), “An introduction to total variation for
    image analysis,” in Theoretical Foundations and Numerical Methods for Sparse Recovery, De Gruyter.
    Dalalyan, A. and Tsybakov, A. (2012), “Sparse regression learning by aggregation and Langevin
    Monte-Carlo,” Journal of Computer and System Sciences, 78, 1423–1443.
    Dalalyan, A. S. (2017), “Theoretical guarantees for approximate sampling from smooth and log-concave
    densities,” Journal of the Royal Statistical Society, Series B, 79, 651–676.
    Dang, K.-D., Quiroz, M., Kohn, R., Tran, M.-N., and Villani, M. (2019), “Hamiltonian Monte Carlo with
    Energy Conserving Subsampling,” Journal of Machine Learning Research, 20, 1–31.
    Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), “Maximum Likelihood from Incomplete Data via
    the EM Algorithm,” Journal of the Royal Statistical Society, Series B, 39, 1–38.
    Duane, S., Kennedy, A., Pendleton, B. J., and Roweth, D. (1987), “Hybrid Monte Carlo,” Physics Letters
    B, 195, 216–222.
    Durmus, A., Majewski, S., and Miasojedow, B. (2019), “Analysis of Langevin Monte Carlo via convex
    optimization,” Journal of Machine Learning Research, 20, 1–46, available at
    http://www.jmlr.org/papers/volume20/18-173/18-173.pdf.
    Durmus, A., Moulines, E., and Pereyra, M. (2018), “Efficient Bayesian Computation by Proximal Markov
    chain Monte Carlo: When Langevin Meets Moreau,” SIAM Journal on Imaging Sciences, 11, 473–506.
    Edwards, R. G. and Sokal, A. D. (1988), “Generalization of the Fortuin-Kasteleyn-Swendsen-Wang
    representation and Monte Carlo algorithm,” Physical Review D, 38, 2009–2012.

    View Slide

  69. 46 / 44
    Gelfand, A. E., Smith, A. F. M., and Lee, T.-M. (1992), “Bayesian Analysis of Constrained Parameter
    and Truncated Data Problems Using Gibbs Sampling,” Journal of the American Statistical Association,
    87, 523–532.
    Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003), Bayesian Data Analysis, Chapman and
    Hall/CRC, 2nd ed.
    Gilks, W. R. and Wild, P. (1992), “Adaptive Rejection Sampling for Gibbs Sampling,” Journal of the
    Royal Statistical Society. Series C (Applied Statistics), 41, 337–348.
    Hanson, T. E., Branscum, A. J., and Johnson, W. O. (2014), “Informative g -Priors for Logistic
    Regression,” Bayesian Analysis, 9, 597–612.
    Hartley, H. O. (1958), “Maximum Likelihood Estimation from Incomplete Data,” Biometrics, 14, 174–194.
    Hsieh, Y.-P., Kavis, A., Rolland, P., and Cevher, V. (2018), “Mirrored Langevin Dynamics,” in Advances
    in Neural Information Processing Systems, vol. 31, pp. 2878–2887.
    Idier, J. (ed.) (2008), Bayesian Approach to Inverse Problems, Wiley.
    Luu, T., Fadili, J., and Chesneau, C. (2020), “Sampling from non-smooth distribution through Langevin
    diffusion,” Methodology and Computing in Applied Probability (in press).
    Maddison, C. J., Paulin, D., Teh, Y. W., O’Donoghue, B. r., and Doucet, A. (2018), “Hamiltonian
    Descent Methods,” [online]. Technical report. Available at https://arxiv.org/abs/1809.05042/.
    Marin, J.-M., Pudlo, P., Robert, C. P., and Ryder, R. J. (2012), “Approximate Bayesian computational
    methods,” Statistics and Computing, 22, 1167–1180.
    Marnissi, Y., Chouzenoux, E., Benazza-Benyahia, A., and Pesquet, J.-C. (2018), “An Auxiliary Variable
    Method for Markov Chain Monte Carlo Algorithms in High Dimension,” Entropy, 20.
    Neal, R. M. (2003), “Slice sampling,” Ann. Statist., 31, 705–767.
    — (2011), “MCMC Using Hamiltonian Dynamics,” Handbook of Markov Chain Monte Carlo, 54, 113–162.
    Park, T. and Casella, G. (2008), “The Bayesian Lasso,” Journal of the American Statistical Association,
    103, 681–686.
    Pereyra, M. (2016), “Proximal Markov chain Monte Carlo algorithms,” Statistics and Computing, 26,
    745–760.
    Rendell, L. J., Johansen, A. M., Lee, A., and Whiteley, N. (2020), “Global consensus Monte Carlo,”
    Journal of Computational and Graphical Statistics, to appear.

    View Slide

  70. 47 / 44
    Robert, C. P. (2001), The Bayesian Choice: from decision-theoretic foundations to computational
    implementation, New York: Springer, 2nd ed.
    Robert, C. P. and Casella, G. (2004), Monte Carlo Statistical Methods, Berlin: Springer, 2nd ed.
    Roberts, G. O. and Tweedie, R. L. (1996), “Exponential convergence of Langevin distributions and their
    discrete approximations,” Bernoulli, 2, 341–363.
    Sabanes Bove, D. and Held, L. (2011), “Hyper-g priors for generalized linear models,” Bayesian Analysis,
    6, 387–410.
    van Dyk, D. A. and Meng, X.-L. (2001), “The Art of Data Augmentation,” Journal of Computational and
    Graphical Statistics, 10, 1–50.
    Vono, M., Dobigeon, N., and Chainais, P. (2020), “High-dimensional Gaussian sampling: A review and a
    unifying approach based on a stochastic proximal point algorithm,” SIAM Review, in 1st round of
    review.
    Wilkinson, R. (2013), “Approximate Bayesian Computation (ABC) gives exact results under the
    assumption of model error,” Statistical applications in genetics and molecular biology, 12, 1–13.

    View Slide