Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Maxime Vono

C3bc10b8a72ed3c3bfd843793b8a9868?s=47 S³ Seminar
December 11, 2020

Maxime Vono

(IRIT — INP-ENSEEIHT)

https://s3-seminar.github.io/seminars/maxime-vono/

Title — Efficient MCMC sampling via asymptotically exact data augmentation

Abstract — Performing exact Bayesian inference for complex models is computationally intractable. Markov chain Monte Carlo (MCMC) algorithms can provide reliable approximations of the posterior distribution but are expensive for large datasets and high-dimensional models. A standard approach to mitigate this complexity consists in using subsampling techniques or distributing the data across a cluster. However, these approaches are typically unreliable in high-dimensional scenarios. We focus here on an alternative MCMC scheme, coined Split Gibbs Sampler (SGS), exploiting a splitting strategy akin to the one used by quadratic penalty approaches in optimisation. This MCMC algorithm targets an approximation of a given posterior distribution which comes from an approximate statistical framework called Asymptotically Exact Data Augmentation (AXDA). Under standard regularity conditions, we quantitatively assess the bias associated to AXDA, establish explicit convergence rates for SGS and provide practitioners with explicit guidelines to fix the number of SGS steps. We finally support our methodology with numerical illustrations on challenging Bayesian inference problems.

Biography — Maxime Vono recently obtained his Ph.D. in statistics from the University of Toulouse where he was supervised by Nicolas Dobigeon (IRIT, INP-ENSEEIHT) and Pierre Chainais (CRIStAL, Centrale Lille). In spring 2019, he was a visiting research scholar at the Department of Statistics of the University of Oxford working with Arnaud Doucet (Oxford & Deepmind). Prior to that, he both graduated in 2017 from Ecole Centrale de Lille majoring in data sciences and University of Lille majoring in probability and statistics. He is affiliated to the ORION-B project which gathers astrophysicists, data scientists and statisticians in order to better understand the formation of stars in our universe. His research interests lie in Bayesian modelling, the development of efficient computational methods for inferring unknown objects in complex problems and their applications to signal/image processing and machine learning.

C3bc10b8a72ed3c3bfd843793b8a9868?s=128

S³ Seminar

December 11, 2020
Tweet

Transcript

  1. 1 / 44 Efficient MCMC sampling via asymptotically exact data

    augmentation Maxime Vono Joint work with P. Chainais, N. Dobigeon, A. Doucet and D. Paulin S3 Seminar - The Paris-Saclay Signal Seminar December 11, 2020
  2. 2 / 44 Bayesian inference1 θ : unknown object of

    interest y : available data (observations) Prior × Likelihood −→ Posterior θ ∼ π(θ) y|θ ∼ π(y|θ) θ|y ∼ π(θ|y) 1 Robert (2001) , Gelman et al. (2003)
  3. 2 / 44 Bayesian inference θ : unknown object of

    interest y : available data (observations) E(θ|y) Cα 1 − α Bayesian estimators Credibility regions arg min δ L(δ, θ)π(θ|y)dθ Cα π(θ|y)dθ = 1 − α, α ∈ (0, 1)
  4. 3 / 44 Monte Carlo integration4 h(θ)π(θ|y)dθ ≈ 1 N

    N n=1 h θ(n) , θ(n) ∼ π(θ|y) Numerous applications : inverse problems1 statistical machine learning2 aggregation of estimators3 1 Idier (2008) 2 Andrieu et al. (2003) 3 Dalalyan and Tsybakov (2012) 4 Robert and Casella (2004)
  5. 3 / 44 Monte Carlo integration1 h(θ)π(θ|y)dθ ≈ 1 N

    N n=1 h θ(n) , θ(n) ∼ π(θ|y) Sampling challenges: − log π(θ|y) = b i=1 Ui (Ai θ) {Ui ; i ∈ [k]} non-smooth θ ∈ Rd with d 1 1 Robert and Casella (2004)
  6. 4 / 44 Illustrative example: image inpainting y = Hθ

    + ε, ε ∼ N(0d , σ2Id ) π(θ) : TV prior1 Posterior π(θ|y) : not standard: no conjugacy not smooth: TV prior high-dimensional: d ∼ 105 1 Chambolle et al. (2010)
  7. 5 / 44 Aim of this talk Present a general

    statistical framework to perform efficient sampling Seminal works: HMC1, (MY)ULA2 efficient & simple sampling scalability in high dimension scalability in big data settings 0 5000 10000 15000 Iteration t 3 2 1 0 1 2 3 ✓ N(0, 1) Random walk with = 1 1 Duane et al. (1987); Neal (2011); Maddison et al. (2018); Dang et al. (2019) 2 Roberts and Tweedie (1996); Pereyra (2016); Durmus et al. (2018); Luu et al. (2020)
  8. 6 / 44 Outline 1. Asymptotically exact data augmentation (AXDA)

    2. Theoretical guarantees for Gaussian smoothing 3. Split Gibbs sampler (SGS) 4. Convergence and complexity analyses 5. Experiments
  9. 7 / 44 Outline 1. Asymptotically exact data augmentation (AXDA)

    2. Theoretical guarantees for Gaussian smoothing 3. Split Gibbs sampler (SGS) 4. Convergence and complexity analyses 5. Experiments
  10. 8 / 44 Data augmentation (DA) π(θ, z|y)dz = π(θ|y)

    Numerous well-known advantages: at the core of EM1 π(θ, z|y): simpler full conditionals2 better mixing3 z θ(t) π(θ(t)) Slice sampling 1 Hartley (1958); Dempster et al. (1977); Celeux et al. (2001) 2 Neal (2003) 3 Edwards and Sokal (1988); Marnissi et al. (2018)
  11. 9 / 44 Example: Bayesian LASSO1 y = Xθ +

    ε, ε ∼ N(0d , σ2Id ) π(θ) ∝ e−τ θ 1 1 Park and Casella (2008)
  12. 9 / 44 Example: Bayesian LASSO1 y = Xθ +

    ε, ε ∼ N(0d , σ2Id ) π(θ) ∝ e−τ θ 1 Surrogate: DA by exploiting the mixture representation π(θ) = d i=1 ∞ 0 N(θi |0, zi )E(zi |τ2)dzi 1 Park and Casella (2008)
  13. 9 / 44 Example: Bayesian LASSO1 y = Xθ +

    ε, ε ∼ N(0d , σ2Id ) π(θ) ∝ e−τ θ 1 Surrogate: DA by exploiting the mixture representation π(θ) = d i=1 ∞ 0 N(θi |0, zi )E(zi |τ2)dzi Simple Gibbs sampling: π(θ|y, z1 , . . ., zd ) : Gaussian π(1/zi |θ) : Inverse-Gaussian 1 Park and Casella (2008)
  14. 10 / 44 The art of data augmentation Unfortunately, satisfying

    π(θ, z|y)dz = π(θ|y) (1) while leading to efficient algorithms is a matter of art.1 1 van Dyk and Meng (2001)
  15. 10 / 44 The art of data augmentation Unfortunately, satisfying

    π(θ, z|y)dz = π(θ|y) (1) while leading to efficient algorithms is a matter of art.1 Proposed approach: systematic & simple way to relax (1) leading to efficient algorithms πρ (θ, z|y)dz − − − → ρ→0 π(θ|y) 1 van Dyk and Meng (2001)
  16. 10 / 44 The art of data augmentation Unfortunately, satisfying

    π(θ, z|y)dz = π(θ|y) (1) while leading to efficient algorithms is a matter of art.1 Proposed approach: systematic & simple way to relax (1) leading to efficient algorithms πρ (θ, z|y)dz − − − → ρ→0 π(θ|y) −→ How to build πρ (θ, z|y) ? 1 van Dyk and Meng (2001)
  17. 11 / 44 On building πρ A natural manner to

    construct πρ is to define πρ (θ, z|y) = π(z|y)κρ (z, θ) where κρ (·, θ) weak − − − → ρ−→0 δθ (·) θ z 0 2 4 6 8 κρ (z; θ) ρ = 0.05 ρ = 0.1 ρ = 0.2 ρ = 0.3
  18. 11 / 44 On building πρ A natural manner to

    construct πρ is to define πρ (θ, z|y) = π(z|y)κρ (z, θ) where κρ (·, θ) weak − − − → ρ−→0 δθ (·) θ z 0 2 4 6 8 κρ (z; θ) ρ = 0.05 ρ = 0.1 ρ = 0.2 ρ = 0.3 Divide to conquer generalization: π(θ|y) ∝ b i=1 πi (Ai θ|yi ) −→ πρ (θ, z1:b |y) ∝ b i=1 πi (zi |yi )κρ (zi , Ai θ)
  19. 12 / 44 Split Gibbs sampler (SGS) π(θ|y) ∝ b

    i=1 πi (Ai θ|yi ) −→ πρ (θ, z1:b |y) ∝ b i=1 πi (zi |yi )κρ (zi , Ai θ) Algorithm 1: Split Gibbs Sampler (SGS) 1 for t ← 1 to T do 2 for i ← 1 to b do 3 z(t) i ∼ πi (zi |yi )κρ (zi , Ai θ) 4 end 5 θ(t) ∼ b i=1 κρ (zi , Ai θ) 6 end Divide-to-conquer : b simpler steps −→ πi (zi |yi ) b local steps −→ yi
  20. 13 / 44 Kernel choice : κρ (z, θ) ∝

    ρ−dK(z−θ ρ ) name support K(u) Gaussian R 1 √ 2π exp −u2/2 Cauchy R 1 π(1+u2) Laplace R 1 2 exp (−|u|) Dirichlet R sin2(u) πu2 Uniform [−1, 1] 1 2 1 |u|≤1 Triangular [−1, 1] (1 − |u|)1 |u|≤1 Epanechnikov [−1, 1] 3 4 (1 − u2)1 |u|≤1 −4 0 4 u 0 1 K(u) K(0) Gaussian Cauchy Laplace Dirichlet −1 0 1 u 0 1 K(u) K(0) Uniform Triangular Epanechnikov −→ Similarity with “noisy” ABC methods (Marin et al. 2012; Wilkinson 2013)
  21. 13 / 44 Kernel choice : κρ (z, θ) ∝

    ρ−dK(z−θ ρ ) name support K(u) Gaussian R 1 √ 2π exp −u2/2 Cauchy R 1 π(1+u2) Laplace R 1 2 exp (−|u|) Dirichlet R sin2(u) πu2 Uniform [−1, 1] 1 2 1 |u|≤1 Triangular [−1, 1] (1 − |u|)1 |u|≤1 Epanechnikov [−1, 1] 3 4 (1 − u2)1 |u|≤1 −4 0 4 u 0 1 K(u) K(0) Gaussian Cauchy Laplace Dirichlet −1 0 1 u 0 1 K(u) K(0) Uniform Triangular Epanechnikov −→ Similarity with “noisy” ABC methods (Marin et al. 2012; Wilkinson 2013)
  22. 14 / 44 Outline 1. Asymptotically exact data augmentation (AXDA)

    2. Theoretical guarantees for Gaussian smoothing 3. Split Gibbs sampler (SGS) 4. Convergence and complexity analyses 5. Experiments
  23. 15 / 44 Densities with full domains We assume here

    that supp(π) = Rd. π(θ|y) ∝ b i=1 e−Ui(Aiθ) πi(Aiθ) Distance Main assumptions Upper bound πρ − π TV Ui Li -Lipschitz b i=1 2 di Li + o(ρ) U1 M1 -smooth, b = 1 1 2 M1 d Ui Mi -smooth & strongly convex 1 2 b i=1 Mi di + o(ρ2) W1 (πρ , π) U1 M1 -smooth & strongly convex min √ d, 1 2 √ M1 d b = 1, A1 = Id
  24. 15 / 44 Densities with full domains π(θ|y) ∝ b

    i=1 e−Ui(Aiθ) πi(Aiθ) πρ (θ|y) ∝ b i=1 Rdi e−Ui(zi) πi(zi) N zi |Ai θ, ρ2Id κρ(zi,Aiθ) dzi Distance Main assumptions Upper bound πρ − π TV Ui Li -Lipschitz b i=1 2 di Li + o(ρ) U1 M1 -smooth, b = 1 1 2 M1 d Ui Mi -smooth & strongly convex 1 2 b i=1 Mi di + o(ρ2) W1 (πρ , π) U1 M1 -smooth & strongly convex min √ d, 1 2 √ M1 d b = 1, A = I
  25. 15 / 44 Densities with full domains πρ (θ|y) ∝

    b i=1 Rdi e−Ui(zi) πi(zi) N zi |Ai θ, ρ2Id κρ(zi,Aiθ) dzi Distance Upper bound Main assumptions πρ − π TV ρ b i=1 2 di Li + o(ρ) Ui Li -Lipschitz 1 2 ρ2M1 d U1 M1 -smooth, b = 1 1 2 ρ2 b i=1 Mi di + o(ρ2) Ui Mi -smooth & strongly convex W1 (πρ , π) min ρ √ d, 1 2 ρ2 √ M1 d U1 M1 -smooth & strongly convex b = 1, A1 = Id
  26. 15 / 44 Densities with full domains Illustration of these

    bounds for an univariate Gaussian toy example with b = 1 and U1 (θ) = θ2 2σ2 : π(θ) = 1 √ 2πσ2 e− θ2 2σ2 πρ (θ) = 1 2π(σ2 + ρ2) e− θ2 2(σ2+ρ2) 10−2 10−1 100 101 102 103 ρ 10−7 10−6 10−5 10−4 10−3 10−2 10−1 100 π − πρ TV Our bound Cρ2 10−2 10−1 100 101 102 103 ρ 10−5 10−3 10−1 101 103 W1 (π, πρ ) Upper bound Cρ2
  27. 16 / 44 Uniform prior densities on compact convex sets1

    π has bounded support: supp(π) = K, where K ⊂ Rd is a convex body. π(θ) ∝ e−ιK(θ), ιK (θ) = 0 if θ ∈ K ∞ if θ / ∈ K πρ (θ) ∝ Rd e−ιK(z)− z−θ 2 2ρ2 dz H1. There exists r > 0 such that B(0d , r) ⊂ K. For ρ ∈ 0, r 2 √ 2d , πρ − π TV ≤ 2 √ 2ρd r r K B(0d , r) 1 Gelfand et al. (1992); Bubeck et al. (2015); Brosse et al. (2017); Hsieh et al. (2018)
  28. 17 / 44 Outline 1. Asymptotically exact data augmentation (AXDA)

    2. Theoretical guarantees for Gaussian smoothing 3. Split Gibbs sampler (SGS) 4. Convergence and complexity analyses 5. Experiments
  29. 18 / 44 Sampling problem π(θ|y) ∝ b i=1 e−Ui(Aiθ)

    πi(Aiθ) Ai ∈ Rdi×d Ui depends on yi
  30. 18 / 44 Sampling problem π(θ|y) ∝ b i=1 e−Ui(Aiθ)

    πi(Aiθ) Ai ∈ Rdi×d Ui depends on yi Example: Bayesian logistic regression with Gaussian prior π(θ|y) ∝ e−τ||θ||2 prior n i=1 e−logistic(xT i θ;yi) likelihood not standard high-dimensional possibly distributed data {yi , xi }i∈[n]
  31. 19 / 44 Split Gibbs sampler (SGS) πρ (θ, z1:b

    |y) ∝ b i=1 e−Ui(zi)κρ (zi , Ai θ) Algorithm 2: Split Gibbs Sampler (SGS) 1 for t ← 1 to T do 2 for i ← 1 to b do 3 z(t) i ∼ πρ (zi |θ(t−1), yi ) ∝ exp −Ui (zi ) − 1 2ρ2 zi − Ai θ 2 4 end 5 θ(t) ∼ πρ (θ|z(t) 1:b ) ∝ b i=1 N z(t) i |Ai θ, ρ2Id 6 end Divide-to-conquer : • b simpler steps −→ Ui (zi ) • b local steps −→ yi
  32. 19 / 44 Split Gibbs sampler (SGS) πρ (θ, z1:b

    |y) ∝ b i=1 e−Ui(zi)N zi |Ai θ, ρ2Id Algorithm 3: Split Gibbs Sampler (SGS) 1 for t ← 1 to T do 2 for i ← 1 to b do 3 z(t) i ∼ πρ (zi |θ(t−1), yi ) ∝ exp −Ui (zi ) − 1 2ρ2 zi − Ai θ 2 4 end 5 θ(t) ∼ πρ (θ|z(t) 1:b ) ∝ b i=1 N z(t) i |Ai θ, ρ2Id 6 end Divide-to-conquer : • b simpler steps −→ Ui (zi ) • b local steps −→ yi
  33. 19 / 44 Split Gibbs sampler (SGS) πρ (θ, z1:b

    |y) ∝ b i=1 e−Ui(zi)N zi |Ai θ, ρ2Id Algorithm 1: Split Gibbs Sampler (SGS) 1 for t ← 1 to T do 2 for i ← 1 to b do 3 z(t) i ∼ πρ (zi |θ(t−1), yi ) ∝ exp −Ui (zi ) − 1 2ρ2 zi − Ai θ 2 4 end 5 θ(t) ∼ πρ (θ|z(t) 1:b ) ∝ b i=1 N z(t) i |Ai θ, ρ2Id 6 end Divide-to-conquer : b simpler steps −→ Ui (zi ) b local steps −→ yi
  34. 20 / 44 On the scalability in distributed environments1 θ

    yi i ∈ [n] θ z1 y1 . . . zi yi . . . zb yb . . . . . . → Each latent variable zi depends on the subset of observations yi 1 For a comprehensive review, see the work by Rendell et al. (2020)
  35. 21 / 44 Sampling efficiently the auxiliary variables {zi }i∈[b]

    Rejection sampling1 πρ zi |θ(t−1), y i ∝ exp −Ui (zi ) − 1 2ρ2 zi − Ai θ 2 If Ui convex & smooth and ρ ≤ ¯ ρ with ¯ ρ = O(1/di ): exact sampling by rejection sampling with O(1) evals of Ui and ∇Ui . 1 Gilks and Wild (1992)
  36. 21 / 44 Sampling efficiently the auxiliary variables {zi }i∈[b]

    Rejection sampling1 πρ zi |θ(t−1), y i ∝ exp −Ui (zi ) − 1 2ρ2 zi − Ai θ 2 If Ui convex & smooth and ρ ≤ ¯ ρ with ¯ ρ = O(1/di ): exact sampling by rejection sampling with O(1) evals of Ui and ∇Ui . ρ = 4.49 πρ (zi |θ) qρ (zi |θ) ρ = 2.46 πρ (zi |θ) qρ (zi |θ) ρ = 1.27 πρ (zi |θ) qρ (zi |θ) Here Ui (zi ) = yi zi + log(1 + e−zi ) + αz2 i 2 1 Gilks and Wild (1992)
  37. 22 / 44 Sampling efficiently the master parameter θ High-dimensional

    Gaussian sampling1 πρ (θ|z(t) 1:b ) ∝ b i=1 N z(t) i |Ai θ, ρ2Id Main difficulty: Handling the high-dimensional matrix Q = b i=1 AT i Ai . 10−4 10−3 10−2 10−1 100 101 102 103 104 −15 −10 −5 0 5 10 15 −40 −20 0 20 40 −→ Numerous efficient sampling approaches have been proposed by building on numerical linear algebra techniques. 1 For a recent overview, see the work by Vono et al. (2020)
  38. 23 / 44 Comparison with the Unadjusted Langevin Algorithm When

    SGS meets ULA In the single splitting case (b = 1) with A1 = Id , SGS writes 1. Sample z(t) ∼ πρ (z|θ(t)) ∝ exp −U(z) − z−θ(t) 2 2ρ2 2. Sample θ(t+1) ∼ N(z(t), ρ2Id ). By Taylor’s expansion, when ρ → 0, we have: θ(t+1) ∼ N(θ(t)−ρ2∇U(θ(t)), 2ρ2Id ) −→ one ULA step wih stepsize γ = ρ2.
  39. 24 / 44 Comparison with the Unadjusted Langevin Algorithm Bias

    comparison1 Distance Upper bound Main assumptions πρ − π TV 1 2 ρ2M1 d M-smooth πULA − π TV √ 2Md · ρ M-smooth & convex W1 (πρ , π) min ρ √ d, 1 2 ρ2 √ Md M-smooth, m-strongly convex W1 (πULA , π) 2M m d · ρ M-smooth, m-strongly convex −→ Higher order approximation based on the explicit form of πρ . 1 Durmus et al. (2019)
  40. 25 / 44 Relation with quadratic penalty methods MAP estimation

    under each full conditional of πρ (θ, z1:b |y) ∝ b i=1 exp −Ui (zi ) − 1 2ρ2 zi − Ai θ 2 (2) yields at iteration t z(t) i = arg min zi Ui (zi ) + 1 2ρ2 zi − Ai θ(t−1) 2 , ∀i ∈ [b] θ(t) = arg min θ b i=1 1 2ρ2 z(t) i − Ai θ 2 → Steps involved in quadratic penalty methods
  41. 26 / 44 Outline 1. Asymptotically exact data augmentation (AXDA)

    2. Theoretical guarantees for Gaussian smoothing 3. Split Gibbs sampler (SGS) 4. Convergence and complexity analyses 5. Experiments
  42. 27 / 44 Mixing times How many SGS iterations should

    I run to obtain a sample from π? This question can be answered by giving an upper bound on the so-called -mixing time: tmix ( ; ν) = min t ≥ 0 D(νPt SGS , π) ≤ , where > 0, ν is an initial distibution, PSGS is the Markov kernel of SGS and D is a statistical distance. −→ We are here interested in providing explicit bounds w.r.t. d, and regularity constants associated to U.
  43. 28 / 44 Explicit convergence rates Assumptions : Ai =

    Id and Ui mi -strongly convex for i ∈ [b] W1 (νPt SGS , πρ ) ≤ W1 (ν, πρ ) · (1 − KSGS )t νPt SGS − πρ TV ≤ Varπρ dν dπρ · (1 − KSGS )t, where KSGS = ρ2 b b i=1 mi 1 + mi ρ2 , ν init., PSGS Markov kernel of SGS 0 20 40 60 80 100 Iteration t −12 −10 −8 −6 −4 −2 0 ρ = 1 log νPt SGS − πρ TV Our bound 0 200 400 600 800 1000 Iteration t −20 −15 −10 −5 0 ρ = 0.1 log W1 (δθ0 Pt SGS , πρ ) Our bound
  44. 29 / 44 Explicit guidelines for practitioners Single splitting case

    π(θ|y) ∝ e−U(θ) πρ (θ, z|y) ∝ e−U(z)N(z|θ, ρ2Id ) 1. Set ∈ (0, 1) 2. Set ρ2 = dM , M: Lipschitz constant of ∇U 3. Start from ν(θ) = N θ|θ , M−1Id 4. Run T iterations of SGS with T = C( , d) KSGS Then, νPT SGS − π TV ≤ −4 −2 0 2 4 aT θ a 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Exact - d = 60 −4 −2 0 2 4 aT θ a 0.00 0.05 0.10 0.15 0.20 0.25 0.30 SGS - d = 60
  45. 30 / 44 Complexity results for strongly log-concave densities 1-Wasserstein

    distance Assumptions: U is M-smooth & m-strongly convex, κ = M/m starting point: minimizer θ of U Method Condition on Evals for W1 err. √ d √ m Unadjusted Langevin 0 < ≤ 1 O∗(κ2/ 2) Underdamped Langevin 0 < ≤ 1 O∗(κ2/ ) Underdamped Langevin 0 < ≤ 1/ √ κ O∗(κ3/2/ ) Hamiltonian Dynamics 0 < ≤ 1 O∗(κ3/2/ ) SGS with single splitting 0 < ≤ 1/(d √ κ) O∗(κ1/2/ ) → For small , the rate of the SGS improves upon previous works.
  46. 31 / 44 Complexity results for strongly log-concave densities Total

    variation distance Assumptions: U is M-smooth & m-strongly convex, κ = M/m starting point θ(0) ∼ N(θ; θ , Id /M) Method Validity Evals ULA 0 < ≤ 1 O∗(κ2d3/ 2) MALA 0 < ≤ 1 O(κ2d2 log1.5( κ 1/d )) SGS 0 < ≤ 1 O∗(κd2/ ) 101 102 103 d 103 104 105 106 107 tmix ( ; ν) Total variation distance theoretical slope = 2 SGS, slope = 1.04 (± 0.02) 102 103 log(2/ )/ 102 103 104 tmix ( ; ν) Total variation distance theoretical slope = 1 SGS, slope = 1.06 (± 0.04)
  47. 32 / 44 Complexity results for log-concave densities Total variation

    distance Idea: Replace each potential Ui by a strongly-convex approximation: ˜ U(θ) = b i=1 Ui (Ai θ) + λ 2 θ − θ 2 By using results from Dalalyan (2017) , the triangle inequality and our previous bounds, we have: Method Validity Evals ULA 0 < ≤ 1 O∗(M2d3/ 4) MRW 0 < ≤ 1 O∗(M2d3/ 2) SGS 0 < ≤ 1 O∗(Md3/ 2) → Our complexity results improve upon that of ULA and MRW.
  48. 33 / 44 Outline 1. Asymptotically exact data augmentation (AXDA)

    2. Theoretical guarantees for Gaussian smoothing 3. Split Gibbs sampler (SGS) 4. Convergence and complexity analyses 5. Experiments
  49. 34 / 44 Unsupervised image deconvolution with a smooth prior

    Problem statement y = Bθ + ε, ε ∼ N(0d , Ω−1) π(θ) = N 0d , γLT L −1 , B: circulant matrix w.r.t. a blurring kernel Ω−1 = diag(σ2 1 , . . . , σ2 d ), σi ∼ (1 − β)δκ1 + βδκ2 , where κ1 κ2 L: circulant matrix w.r.t. a Laplacian filter Posterior π(θ|y): Gaussian w/ Q = BT ΩB + γLT L challenging: Q cannot be diagonalized easily high-dimensional: d ∼ 105 Observed Original
  50. 35 / 44 Unsupervised image deconvolution with a smooth prior

    AXDA approximation and SGS π(θ|y, σ, κ1 , κ2 , β, γ) ∝ exp − 1 2 (Bθ − y)T Ω(Bθ − y) + γ Lθ 2 Approximate joint posterior : πρ (θ, z|y, σ, κ1 , κ2 , β, γ) ∝ exp − 1 2 (z − y)T Ω(z − y) + γ Lθ 2 + 1 ρ2 z − Bθ 2 .
  51. 35 / 44 Unsupervised image deconvolution with a smooth prior

    AXDA approximation and SGS π(θ|y, σ, κ1 , κ2 , β, γ) ∝ exp − 1 2 (Bθ − y)T Ω(Bθ − y) + γ Lθ 2 Approximate joint posterior : πρ (θ, z|y, σ, κ1 , κ2 , β, γ) ∝ exp − 1 2 (z − y)T Ω(z − y) + γ Lθ 2 + 1 ρ2 z − Bθ 2 . Sampling θ : πρ (θ|z, γ) : Gaussian distribution with Qθ = 1 ρ2 BT B + γLT L
  52. 35 / 44 Unsupervised image deconvolution with a smooth prior

    AXDA approximation and SGS π(θ|y, σ, κ1 , κ2 , β, γ) ∝ exp − 1 2 (Bθ − y)T Ω(Bθ − y) + γ Lθ 2 Approximate joint posterior : πρ (θ, z|y, σ, κ1 , κ2 , β, γ) ∝ exp − 1 2 (z − y)T Ω(z − y) + γ Lθ 2 + 1 ρ2 z − Bθ 2 . Sampling θ : πρ (θ|z, γ) : Gaussian distribution with Qθ = 1 ρ2 BT B + γLT L Sampling {zi }i∈[d] : πρ (z|θ, y, σ, κ1 , κ2 , β) : Gaussian distribution with Qz = 1 ρ2 Id + Ω
  53. 36 / 44 Unsupervised image deconvolution with a smooth prior

    Restoration results with the MMSE estimator Original Observed MMSE under π Rel. error = 0.004 MMSE under πρ method SNR (dB) PSNR (dB) SGS 17.57 22.92 AuxV1 17.57 22.92 AuxV2 17.58 22.93 RJPO 17.57 22.91
  54. 37 / 44 Unsupervised image deconvolution with a smooth prior

    Convergence diagnotics of the MCMC samplers 0 200 400 600 800 1000 Iteration t 0.000 0.001 0.002 0.003 0.004 0.005 0.006 γ[t] SGS AuxV1 AuxV2 RJPO 0 200 400 600 800 1000 Iteration t 0.30 0.32 0.34 0.36 0.38 0.40 β[t] SGS AuxV1 AuxV2 RJPO True value 0 200 400 600 800 1000 Iteration t 10 11 12 13 14 15 16 κ[t] 1 SGS AuxV1 AuxV2 RJPO True value 0 200 400 600 800 1000 Iteration t 36 38 40 42 44 κ[t] 2 SGS AuxV1 AuxV2 RJPO True value
  55. 38 / 44 Image inpainting with non-smooth prior Problem statement

    and challenges y = Hθ + ε, ε ∼ N(0d , σ2Id ) π(θ) : TV prior H: decimation matrix associated to a binary mask TV(θ) = τ 1≤i≤d Di θ Di θ: 2D discrete gradient applied at pixel i of θ Posterior π(θ|y) : not standard : no conjugacy not smooth : TV prior high-dimensional : d ∼ 105
  56. 39 / 44 Image inpainting with non-smooth prior Possible surrogate:

    MYULA1 π(θ|y) ∝ N y|Hθ, σ2In d i=1 e−τ Diθ π(θ) Approximate posterior : πλ (θ|y) ∝ N y|Hθ, σ2In e−Mλ[T V ](θ) 1 Durmus et al. (2018)
  57. 39 / 44 Image inpainting with non-smooth prior Possible surrogate:

    MYULA1 π(θ|y) ∝ N y|Hθ, σ2In d i=1 e−τ Diθ π(θ) Approximate posterior : πλ (θ|y) ∝ N y|Hθ, σ2In e−Mλ[T V ](θ) MYULA : θ(t+1) ∼ N 1 − λ γ θ(t) + λ∇ log π(y|θ(t)) + λ γ proxγ TV (θ(t)), 2γId 1 Durmus et al. (2018)
  58. 39 / 44 Image inpainting with non-smooth prior Possible surrogate:

    MYULA1 π(θ|y) ∝ N y|Hθ, σ2In d i=1 e−τ Diθ π(θ) Approximate posterior : πλ (θ|y) ∝ N y|Hθ, σ2In e−Mλ[T V ](θ) MYULA : θ(t+1) ∼ N 1 − λ γ θ(t) + λ∇ log π(y|θ(t)) + λ γ proxγ TV (θ(t)), 2γId No closed-form for proxTV : −→ additional approximation −→ computational burden 1 Durmus et al. (2018)
  59. 40 / 44 Image inpainting with non-smooth prior AXDA approximation

    and SGS π(θ|y) ∝ N y|Hθ, σ2In d i=1 e−τ Diθ π(θ) Approximate joint posterior : πρ (θ, z1:d |y) ∝ N y|Hθ, σ2In d i=1 e−τ zi N(zi |Di θ, ρ2Id )
  60. 40 / 44 Image inpainting with non-smooth prior AXDA approximation

    and SGS π(θ|y) ∝ N y|Hθ, σ2In d i=1 e−τ Diθ π(θ) Approximate joint posterior : πρ (θ, z1:d |y) ∝ N y|Hθ, σ2In d i=1 e−τ zi N(zi |Di θ, ρ2Id ) Sampling θ : πρ (θ|z1:d , y) ∝ N y|Hθ, σ2In d i=1 N(zi |Di θ, ρ2Id )
  61. 40 / 44 Image inpainting with non-smooth prior AXDA approximation

    and SGS π(θ|y) ∝ N y|Hθ, σ2In d i=1 e−τ Diθ π(θ) Approximate joint posterior : πρ (θ, z1:d |y) ∝ N y|Hθ, σ2In d i=1 e−τ zi N(zi |Di θ, ρ2Id ) Sampling θ : πρ (θ|z1:d , y) ∝ N y|Hθ, σ2In d i=1 N(zi |Di θ, ρ2Id ) Sampling {zi }i∈[d] : πρ (zi |θ) ∝ e−τ zi N(zi |Di θ, ρ2Id ) −→ DA applied to e−τ zi : Gaussian and Inverse Gaussian conditionals
  62. 41 / 44 Image inpainting with non-smooth prior Highest posterior

    density regions C α = {θ ∈ Rd | π(θ|y) ≥ e−γα }, Cα π(θ|y)dθ = 1 − α 0 10 20 30 40 50 60 70 80 0.0 0.2 0.4 0.6 0.8 1.0 α 0.5 1.0 1.5 2.0 2.5 3.0 |˜ γα −γα | γα ×10−3 SGS MYULA −→ Relative error of the order 0.1%
  63. 42 / 44 Image inpainting with non-smooth prior Sampling efficiency

    U(θ) = − log π(θ|y) 100 101 102 103 104 105 Iteration t 106 107 108 U(θ) SGS MYULA 104 105 Iteration t 1.65 × 105 1.66 × 105 1.67 × 105 1.68 × 105 1.69 × 105 1.7 × 105 U(θ) SGS MYULA −→ Faster convergence towards high probability regions
  64. 42 / 44 Image inpainting with non-smooth prior Sampling efficiency

    ESSR: effective sample size ratio T1 : CPU time in seconds to draw one sample method ESSR T1 [seconds] SGS 0.22 3.83 MYULA 0.15 4.22 −→ ESSR improved by 47%
  65. 43 / 44 Conclusion Broad and simple approximate statistical framework

    • Multiple convolutions with smoothing kernels • Related to Approximate Bayesian Computation • Rich interpretation from both simulation and optimisation • Theoretical guarantees Efficient sampling based on AXDA • Parallel and simple Gibbs sampling • Explicit convergence rates and mixing times • State-of-the-art performances on image processing problems Perspectives • Convergence and complexity analyses for Metropolis-within-SGS • Distributed and asynchronous sampling • Applications to challenging machine and deep learning problems • Comparison with stochastic gradient MCMC
  66. 44 / 44 References Theory and methods M. Vono, N.

    Dobigeon and P. Chainais (2019), “Split-and-augmented Gibbs sampler - Application to large-scale inference problems,” IEEE Transactions on Signal Processing, vol. 67, no. 6, pp. 1648-1661. M. Vono, N. Dobigeon and P. Chainais (2020), “Asymptotically exact data augmentation: models, properties and algorithms,” Journal of Computational and Graphical Statistics (in press). M. Vono, D. Paulin and A. Doucet (2019), “Efficient MCMC sampling with dimension-free convergence rate,” arXiv: 1905.11937. M. Vono, N. Dobigeon and P. Chainais (2020), “High-dimensional Gaussian sampling: A review and a unifying approach based on a stochastic proximal point algorithm,” arXiv: 2010.01510. L. J. Rendell et al. (2020), “Global consensus Monte Carlo,” Journal of Computational and Graphical Statistics (in press). Applications M. Vono et al. (2019), “Bayesian image restoration under Poisson noise and log-concave prior,” in Proc. ICASSP. M. Vono et al. (2018), “Sparse Bayesian binary logistic regression using the split-and-augmented Gibbs sampler,” in Proc. MLSP. Code https://github.com/mvono
  67. 44 / 44 Thank you for your attention! Questions? Joint

    work with Pierre Chainais, Nicolas Dobigeon, Arnaud Doucet & Daniel Paulin
  68. 45 / 44 Andrieu, C., de Freitas, N., Doucet, A.,

    and Jordan, M. I. (2003), “An Introduction to MCMC for Machine Learning,” Machine Learning, 50, 5–43. Brosse, N., Durmus, A., Moulines, E., and Pereyra, M. (2017), “Sampling from a log-concave distribution with compact support with proximal Langevin Monte Carlo,” in Conference on Learning Theory, vol. 65, pp. 319–342. Bubeck, S., Eldan, R., and Lehec, J. (2015), “Finite-Time Analysis of Projected Langevin Monte Carlo,” in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, p. 1243–1251. Celeux, G., Chretien, S., Forbes, F., and Mkhadri, A. (2001), “A Component-Wise EM Algorithm for Mixtures,” Journal of Computational and Graphical Statistics, 10, 697–712. Chambolle, A., Novaga, M., Cremers, D., and Pock, T. (2010), “An introduction to total variation for image analysis,” in Theoretical Foundations and Numerical Methods for Sparse Recovery, De Gruyter. Dalalyan, A. and Tsybakov, A. (2012), “Sparse regression learning by aggregation and Langevin Monte-Carlo,” Journal of Computer and System Sciences, 78, 1423–1443. Dalalyan, A. S. (2017), “Theoretical guarantees for approximate sampling from smooth and log-concave densities,” Journal of the Royal Statistical Society, Series B, 79, 651–676. Dang, K.-D., Quiroz, M., Kohn, R., Tran, M.-N., and Villani, M. (2019), “Hamiltonian Monte Carlo with Energy Conserving Subsampling,” Journal of Machine Learning Research, 20, 1–31. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, Series B, 39, 1–38. Duane, S., Kennedy, A., Pendleton, B. J., and Roweth, D. (1987), “Hybrid Monte Carlo,” Physics Letters B, 195, 216–222. Durmus, A., Majewski, S., and Miasojedow, B. (2019), “Analysis of Langevin Monte Carlo via convex optimization,” Journal of Machine Learning Research, 20, 1–46, available at http://www.jmlr.org/papers/volume20/18-173/18-173.pdf. Durmus, A., Moulines, E., and Pereyra, M. (2018), “Efficient Bayesian Computation by Proximal Markov chain Monte Carlo: When Langevin Meets Moreau,” SIAM Journal on Imaging Sciences, 11, 473–506. Edwards, R. G. and Sokal, A. D. (1988), “Generalization of the Fortuin-Kasteleyn-Swendsen-Wang representation and Monte Carlo algorithm,” Physical Review D, 38, 2009–2012.
  69. 46 / 44 Gelfand, A. E., Smith, A. F. M.,

    and Lee, T.-M. (1992), “Bayesian Analysis of Constrained Parameter and Truncated Data Problems Using Gibbs Sampling,” Journal of the American Statistical Association, 87, 523–532. Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003), Bayesian Data Analysis, Chapman and Hall/CRC, 2nd ed. Gilks, W. R. and Wild, P. (1992), “Adaptive Rejection Sampling for Gibbs Sampling,” Journal of the Royal Statistical Society. Series C (Applied Statistics), 41, 337–348. Hanson, T. E., Branscum, A. J., and Johnson, W. O. (2014), “Informative g -Priors for Logistic Regression,” Bayesian Analysis, 9, 597–612. Hartley, H. O. (1958), “Maximum Likelihood Estimation from Incomplete Data,” Biometrics, 14, 174–194. Hsieh, Y.-P., Kavis, A., Rolland, P., and Cevher, V. (2018), “Mirrored Langevin Dynamics,” in Advances in Neural Information Processing Systems, vol. 31, pp. 2878–2887. Idier, J. (ed.) (2008), Bayesian Approach to Inverse Problems, Wiley. Luu, T., Fadili, J., and Chesneau, C. (2020), “Sampling from non-smooth distribution through Langevin diffusion,” Methodology and Computing in Applied Probability (in press). Maddison, C. J., Paulin, D., Teh, Y. W., O’Donoghue, B. r., and Doucet, A. (2018), “Hamiltonian Descent Methods,” [online]. Technical report. Available at https://arxiv.org/abs/1809.05042/. Marin, J.-M., Pudlo, P., Robert, C. P., and Ryder, R. J. (2012), “Approximate Bayesian computational methods,” Statistics and Computing, 22, 1167–1180. Marnissi, Y., Chouzenoux, E., Benazza-Benyahia, A., and Pesquet, J.-C. (2018), “An Auxiliary Variable Method for Markov Chain Monte Carlo Algorithms in High Dimension,” Entropy, 20. Neal, R. M. (2003), “Slice sampling,” Ann. Statist., 31, 705–767. — (2011), “MCMC Using Hamiltonian Dynamics,” Handbook of Markov Chain Monte Carlo, 54, 113–162. Park, T. and Casella, G. (2008), “The Bayesian Lasso,” Journal of the American Statistical Association, 103, 681–686. Pereyra, M. (2016), “Proximal Markov chain Monte Carlo algorithms,” Statistics and Computing, 26, 745–760. Rendell, L. J., Johansen, A. M., Lee, A., and Whiteley, N. (2020), “Global consensus Monte Carlo,” Journal of Computational and Graphical Statistics, to appear.
  70. 47 / 44 Robert, C. P. (2001), The Bayesian Choice:

    from decision-theoretic foundations to computational implementation, New York: Springer, 2nd ed. Robert, C. P. and Casella, G. (2004), Monte Carlo Statistical Methods, Berlin: Springer, 2nd ed. Roberts, G. O. and Tweedie, R. L. (1996), “Exponential convergence of Langevin distributions and their discrete approximations,” Bernoulli, 2, 341–363. Sabanes Bove, D. and Held, L. (2011), “Hyper-g priors for generalized linear models,” Bayesian Analysis, 6, 387–410. van Dyk, D. A. and Meng, X.-L. (2001), “The Art of Data Augmentation,” Journal of Computational and Graphical Statistics, 10, 1–50. Vono, M., Dobigeon, N., and Chainais, P. (2020), “High-dimensional Gaussian sampling: A review and a unifying approach based on a stochastic proximal point algorithm,” SIAM Review, in 1st round of review. Wilkinson, R. (2013), “Approximate Bayesian Computation (ABC) gives exact results under the assumption of model error,” Statistical applications in genetics and molecular biology, 12, 1–13.