Maxime Vono

1 / 44 Eﬃcient MCMC sampling via asymptotically exact data
augmentation Maxime Vono Joint work with P. Chainais, N. Dobigeon, A. Doucet and D. Paulin S3 Seminar - The Paris-Saclay Signal Seminar December 11, 2020

2 / 44 Bayesian inference1 θ : unknown object of
interest y : available data (observations) Prior × Likelihood −→ Posterior θ ∼ π(θ) y|θ ∼ π(y|θ) θ|y ∼ π(θ|y) 1 Robert (2001) , Gelman et al. (2003)

2 / 44 Bayesian inference θ : unknown object of
interest y : available data (observations) E(θ|y) Cα 1 − α Bayesian estimators Credibility regions arg min δ L(δ, θ)π(θ|y)dθ Cα π(θ|y)dθ = 1 − α, α ∈ (0, 1)

3 / 44 Monte Carlo integration4 h(θ)π(θ|y)dθ ≈ 1 N
N n=1 h θ(n) , θ(n) ∼ π(θ|y) Numerous applications : inverse problems1 statistical machine learning2 aggregation of estimators3 1 Idier (2008) 2 Andrieu et al. (2003) 3 Dalalyan and Tsybakov (2012) 4 Robert and Casella (2004)

3 / 44 Monte Carlo integration1 h(θ)π(θ|y)dθ ≈ 1 N
N n=1 h θ(n) , θ(n) ∼ π(θ|y) Sampling challenges: − log π(θ|y) = b i=1 Ui (Ai θ) {Ui ; i ∈ [k]} non-smooth θ ∈ Rd with d 1 1 Robert and Casella (2004)

4 / 44 Illustrative example: image inpainting y = Hθ
+ ε, ε ∼ N(0d , σ2Id ) π(θ) : TV prior1 Posterior π(θ|y) : not standard: no conjugacy not smooth: TV prior high-dimensional: d ∼ 105 1 Chambolle et al. (2010)

5 / 44 Aim of this talk Present a general
statistical framework to perform eﬃcient sampling Seminal works: HMC1, (MY)ULA2 eﬃcient & simple sampling scalability in high dimension scalability in big data settings 0 5000 10000 15000 Iteration t 3 2 1 0 1 2 3 ✓ N(0, 1) Random walk with = 1 1 Duane et al. (1987); Neal (2011); Maddison et al. (2018); Dang et al. (2019) 2 Roberts and Tweedie (1996); Pereyra (2016); Durmus et al. (2018); Luu et al. (2020)

6 / 44 Outline 1. Asymptotically exact data augmentation (AXDA)
2. Theoretical guarantees for Gaussian smoothing 3. Split Gibbs sampler (SGS) 4. Convergence and complexity analyses 5. Experiments

8 / 44 Data augmentation (DA) π(θ, z|y)dz = π(θ|y)
Numerous well-known advantages: at the core of EM1 π(θ, z|y): simpler full conditionals2 better mixing3 z θ(t) π(θ(t)) Slice sampling 1 Hartley (1958); Dempster et al. (1977); Celeux et al. (2001) 2 Neal (2003) 3 Edwards and Sokal (1988); Marnissi et al. (2018)

9 / 44 Example: Bayesian LASSO1 y = Xθ +
ε, ε ∼ N(0d , σ2Id ) π(θ) ∝ e−τ θ 1 1 Park and Casella (2008)

ε, ε ∼ N(0d , σ2Id ) π(θ) ∝ e−τ θ 1 Surrogate: DA by exploiting the mixture representation π(θ) = d i=1 ∞ 0 N(θi |0, zi )E(zi |τ2)dzi 1 Park and Casella (2008)

ε, ε ∼ N(0d , σ2Id ) π(θ) ∝ e−τ θ 1 Surrogate: DA by exploiting the mixture representation π(θ) = d i=1 ∞ 0 N(θi |0, zi )E(zi |τ2)dzi Simple Gibbs sampling: π(θ|y, z1 , . . ., zd ) : Gaussian π(1/zi |θ) : Inverse-Gaussian 1 Park and Casella (2008)

10 / 44 The art of data augmentation Unfortunately, satisfying
π(θ, z|y)dz = π(θ|y) (1) while leading to eﬃcient algorithms is a matter of art.1 1 van Dyk and Meng (2001)

π(θ, z|y)dz = π(θ|y) (1) while leading to eﬃcient algorithms is a matter of art.1 Proposed approach: systematic & simple way to relax (1) leading to eﬃcient algorithms πρ (θ, z|y)dz − − − → ρ→0 π(θ|y) 1 van Dyk and Meng (2001)

π(θ, z|y)dz = π(θ|y) (1) while leading to eﬃcient algorithms is a matter of art.1 Proposed approach: systematic & simple way to relax (1) leading to eﬃcient algorithms πρ (θ, z|y)dz − − − → ρ→0 π(θ|y) −→ How to build πρ (θ, z|y) ? 1 van Dyk and Meng (2001)

11 / 44 On building πρ A natural manner to
construct πρ is to deﬁne πρ (θ, z|y) = π(z|y)κρ (z, θ) where κρ (·, θ) weak − − − → ρ−→0 δθ (·) θ z 0 2 4 6 8 κρ (z; θ) ρ = 0.05 ρ = 0.1 ρ = 0.2 ρ = 0.3

11 / 44 On building πρ A natural manner to
construct πρ is to deﬁne πρ (θ, z|y) = π(z|y)κρ (z, θ) where κρ (·, θ) weak − − − → ρ−→0 δθ (·) θ z 0 2 4 6 8 κρ (z; θ) ρ = 0.05 ρ = 0.1 ρ = 0.2 ρ = 0.3 Divide to conquer generalization: π(θ|y) ∝ b i=1 πi (Ai θ|yi ) −→ πρ (θ, z1:b |y) ∝ b i=1 πi (zi |yi )κρ (zi , Ai θ)

12 / 44 Split Gibbs sampler (SGS) π(θ|y) ∝ b
i=1 πi (Ai θ|yi ) −→ πρ (θ, z1:b |y) ∝ b i=1 πi (zi |yi )κρ (zi , Ai θ) Algorithm 1: Split Gibbs Sampler (SGS) 1 for t ← 1 to T do 2 for i ← 1 to b do 3 z(t) i ∼ πi (zi |yi )κρ (zi , Ai θ) 4 end 5 θ(t) ∼ b i=1 κρ (zi , Ai θ) 6 end Divide-to-conquer : b simpler steps −→ πi (zi |yi ) b local steps −→ yi

13 / 44 Kernel choice : κρ (z, θ) ∝
ρ−dK(z−θ ρ ) name support K(u) Gaussian R 1 √ 2π exp −u2/2 Cauchy R 1 π(1+u2) Laplace R 1 2 exp (−|u|) Dirichlet R sin2(u) πu2 Uniform [−1, 1] 1 2 1 |u|≤1 Triangular [−1, 1] (1 − |u|)1 |u|≤1 Epanechnikov [−1, 1] 3 4 (1 − u2)1 |u|≤1 −4 0 4 u 0 1 K(u) K(0) Gaussian Cauchy Laplace Dirichlet −1 0 1 u 0 1 K(u) K(0) Uniform Triangular Epanechnikov −→ Similarity with “noisy” ABC methods (Marin et al. 2012; Wilkinson 2013)

15 / 44 Densities with full domains We assume here
that supp(π) = Rd. π(θ|y) ∝ b i=1 e−Ui(Aiθ) πi(Aiθ) Distance Main assumptions Upper bound πρ − π TV Ui Li -Lipschitz b i=1 2 di Li + o(ρ) U1 M1 -smooth, b = 1 1 2 M1 d Ui Mi -smooth & strongly convex 1 2 b i=1 Mi di + o(ρ2) W1 (πρ , π) U1 M1 -smooth & strongly convex min √ d, 1 2 √ M1 d b = 1, A1 = Id

15 / 44 Densities with full domains π(θ|y) ∝ b
i=1 e−Ui(Aiθ) πi(Aiθ) πρ (θ|y) ∝ b i=1 Rdi e−Ui(zi) πi(zi) N zi |Ai θ, ρ2Id κρ(zi,Aiθ) dzi Distance Main assumptions Upper bound πρ − π TV Ui Li -Lipschitz b i=1 2 di Li + o(ρ) U1 M1 -smooth, b = 1 1 2 M1 d Ui Mi -smooth & strongly convex 1 2 b i=1 Mi di + o(ρ2) W1 (πρ , π) U1 M1 -smooth & strongly convex min √ d, 1 2 √ M1 d b = 1, A = I

15 / 44 Densities with full domains πρ (θ|y) ∝
b i=1 Rdi e−Ui(zi) πi(zi) N zi |Ai θ, ρ2Id κρ(zi,Aiθ) dzi Distance Upper bound Main assumptions πρ − π TV ρ b i=1 2 di Li + o(ρ) Ui Li -Lipschitz 1 2 ρ2M1 d U1 M1 -smooth, b = 1 1 2 ρ2 b i=1 Mi di + o(ρ2) Ui Mi -smooth & strongly convex W1 (πρ , π) min ρ √ d, 1 2 ρ2 √ M1 d U1 M1 -smooth & strongly convex b = 1, A1 = Id

15 / 44 Densities with full domains Illustration of these
bounds for an univariate Gaussian toy example with b = 1 and U1 (θ) = θ2 2σ2 : π(θ) = 1 √ 2πσ2 e− θ2 2σ2 πρ (θ) = 1 2π(σ2 + ρ2) e− θ2 2(σ2+ρ2) 10−2 10−1 100 101 102 103 ρ 10−7 10−6 10−5 10−4 10−3 10−2 10−1 100 π − πρ TV Our bound Cρ2 10−2 10−1 100 101 102 103 ρ 10−5 10−3 10−1 101 103 W1 (π, πρ ) Upper bound Cρ2

16 / 44 Uniform prior densities on compact convex sets1
π has bounded support: supp(π) = K, where K ⊂ Rd is a convex body. π(θ) ∝ e−ιK(θ), ιK (θ) = 0 if θ ∈ K ∞ if θ / ∈ K πρ (θ) ∝ Rd e−ιK(z)− z−θ 2 2ρ2 dz H1. There exists r > 0 such that B(0d , r) ⊂ K. For ρ ∈ 0, r 2 √ 2d , πρ − π TV ≤ 2 √ 2ρd r r K B(0d , r) 1 Gelfand et al. (1992); Bubeck et al. (2015); Brosse et al. (2017); Hsieh et al. (2018)

18 / 44 Sampling problem π(θ|y) ∝ b i=1 e−Ui(Aiθ)
πi(Aiθ) Ai ∈ Rdi×d Ui depends on yi

18 / 44 Sampling problem π(θ|y) ∝ b i=1 e−Ui(Aiθ)
πi(Aiθ) Ai ∈ Rdi×d Ui depends on yi Example: Bayesian logistic regression with Gaussian prior π(θ|y) ∝ e−τ||θ||2 prior n i=1 e−logistic(xT i θ;yi) likelihood not standard high-dimensional possibly distributed data {yi , xi }i∈[n]

19 / 44 Split Gibbs sampler (SGS) πρ (θ, z1:b
|y) ∝ b i=1 e−Ui(zi)κρ (zi , Ai θ) Algorithm 2: Split Gibbs Sampler (SGS) 1 for t ← 1 to T do 2 for i ← 1 to b do 3 z(t) i ∼ πρ (zi |θ(t−1), yi ) ∝ exp −Ui (zi ) − 1 2ρ2 zi − Ai θ 2 4 end 5 θ(t) ∼ πρ (θ|z(t) 1:b ) ∝ b i=1 N z(t) i |Ai θ, ρ2Id 6 end Divide-to-conquer : • b simpler steps −→ Ui (zi ) • b local steps −→ yi

|y) ∝ b i=1 e−Ui(zi)N zi |Ai θ, ρ2Id Algorithm 3: Split Gibbs Sampler (SGS) 1 for t ← 1 to T do 2 for i ← 1 to b do 3 z(t) i ∼ πρ (zi |θ(t−1), yi ) ∝ exp −Ui (zi ) − 1 2ρ2 zi − Ai θ 2 4 end 5 θ(t) ∼ πρ (θ|z(t) 1:b ) ∝ b i=1 N z(t) i |Ai θ, ρ2Id 6 end Divide-to-conquer : • b simpler steps −→ Ui (zi ) • b local steps −→ yi

|y) ∝ b i=1 e−Ui(zi)N zi |Ai θ, ρ2Id Algorithm 1: Split Gibbs Sampler (SGS) 1 for t ← 1 to T do 2 for i ← 1 to b do 3 z(t) i ∼ πρ (zi |θ(t−1), yi ) ∝ exp −Ui (zi ) − 1 2ρ2 zi − Ai θ 2 4 end 5 θ(t) ∼ πρ (θ|z(t) 1:b ) ∝ b i=1 N z(t) i |Ai θ, ρ2Id 6 end Divide-to-conquer : b simpler steps −→ Ui (zi ) b local steps −→ yi

20 / 44 On the scalability in distributed environments1 θ
yi i ∈ [n] θ z1 y1 . . . zi yi . . . zb yb . . . . . . → Each latent variable zi depends on the subset of observations yi 1 For a comprehensive review, see the work by Rendell et al. (2020)

21 / 44 Sampling eﬃciently the auxiliary variables {zi }i∈[b]
Rejection sampling1 πρ zi |θ(t−1), y i ∝ exp −Ui (zi ) − 1 2ρ2 zi − Ai θ 2 If Ui convex & smooth and ρ ≤ ¯ ρ with ¯ ρ = O(1/di ): exact sampling by rejection sampling with O(1) evals of Ui and ∇Ui . 1 Gilks and Wild (1992)

21 / 44 Sampling eﬃciently the auxiliary variables {zi }i∈[b]
Rejection sampling1 πρ zi |θ(t−1), y i ∝ exp −Ui (zi ) − 1 2ρ2 zi − Ai θ 2 If Ui convex & smooth and ρ ≤ ¯ ρ with ¯ ρ = O(1/di ): exact sampling by rejection sampling with O(1) evals of Ui and ∇Ui . ρ = 4.49 πρ (zi |θ) qρ (zi |θ) ρ = 2.46 πρ (zi |θ) qρ (zi |θ) ρ = 1.27 πρ (zi |θ) qρ (zi |θ) Here Ui (zi ) = yi zi + log(1 + e−zi ) + αz2 i 2 1 Gilks and Wild (1992)

22 / 44 Sampling efficiently the master parameter θ High-dimensional
Gaussian sampling1 πρ (θ|z(t) 1:b ) ∝ b i=1 N z(t) i |Ai θ, ρ2Id Main difficulty: Handling the high-dimensional matrix Q = b i=1 AT i Ai . 10−4 10−3 10−2 10−1 100 101 102 103 104 −15 −10 −5 0 5 10 15 −40 −20 0 20 40 −→ Numerous efficient sampling approaches have been proposed by building on numerical linear algebra techniques. 1 For a recent overview, see the work by Vono et al. (2020)

23 / 44 Comparison with the Unadjusted Langevin Algorithm When
SGS meets ULA In the single splitting case (b = 1) with A1 = Id , SGS writes 1. Sample z(t) ∼ πρ (z|θ(t)) ∝ exp −U(z) − z−θ(t) 2 2ρ2 2. Sample θ(t+1) ∼ N(z(t), ρ2Id ). By Taylor’s expansion, when ρ → 0, we have: θ(t+1) ∼ N(θ(t)−ρ2∇U(θ(t)), 2ρ2Id ) −→ one ULA step wih stepsize γ = ρ2.

24 / 44 Comparison with the Unadjusted Langevin Algorithm Bias
comparison1 Distance Upper bound Main assumptions πρ − π TV 1 2 ρ2M1 d M-smooth πULA − π TV √ 2Md · ρ M-smooth & convex W1 (πρ , π) min ρ √ d, 1 2 ρ2 √ Md M-smooth, m-strongly convex W1 (πULA , π) 2M m d · ρ M-smooth, m-strongly convex −→ Higher order approximation based on the explicit form of πρ . 1 Durmus et al. (2019)

25 / 44 Relation with quadratic penalty methods MAP estimation
under each full conditional of πρ (θ, z1:b |y) ∝ b i=1 exp −Ui (zi ) − 1 2ρ2 zi − Ai θ 2 (2) yields at iteration t z(t) i = arg min zi Ui (zi ) + 1 2ρ2 zi − Ai θ(t−1) 2 , ∀i ∈ [b] θ(t) = arg min θ b i=1 1 2ρ2 z(t) i − Ai θ 2 → Steps involved in quadratic penalty methods

27 / 44 Mixing times How many SGS iterations should
I run to obtain a sample from π? This question can be answered by giving an upper bound on the so-called -mixing time: tmix ( ; ν) = min t ≥ 0 D(νPt SGS , π) ≤ , where > 0, ν is an initial distibution, PSGS is the Markov kernel of SGS and D is a statistical distance. −→ We are here interested in providing explicit bounds w.r.t. d, and regularity constants associated to U.

28 / 44 Explicit convergence rates Assumptions : Ai =
Id and Ui mi -strongly convex for i ∈ [b] W1 (νPt SGS , πρ ) ≤ W1 (ν, πρ ) · (1 − KSGS )t νPt SGS − πρ TV ≤ Varπρ dν dπρ · (1 − KSGS )t, where KSGS = ρ2 b b i=1 mi 1 + mi ρ2 , ν init., PSGS Markov kernel of SGS 0 20 40 60 80 100 Iteration t −12 −10 −8 −6 −4 −2 0 ρ = 1 log νPt SGS − πρ TV Our bound 0 200 400 600 800 1000 Iteration t −20 −15 −10 −5 0 ρ = 0.1 log W1 (δθ0 Pt SGS , πρ ) Our bound

29 / 44 Explicit guidelines for practitioners Single splitting case
π(θ|y) ∝ e−U(θ) πρ (θ, z|y) ∝ e−U(z)N(z|θ, ρ2Id ) 1. Set ∈ (0, 1) 2. Set ρ2 = dM , M: Lipschitz constant of ∇U 3. Start from ν(θ) = N θ|θ , M−1Id 4. Run T iterations of SGS with T = C( , d) KSGS Then, νPT SGS − π TV ≤ −4 −2 0 2 4 aT θ a 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Exact - d = 60 −4 −2 0 2 4 aT θ a 0.00 0.05 0.10 0.15 0.20 0.25 0.30 SGS - d = 60

30 / 44 Complexity results for strongly log-concave densities 1-Wasserstein
distance Assumptions: U is M-smooth & m-strongly convex, κ = M/m starting point: minimizer θ of U Method Condition on Evals for W1 err. √ d √ m Unadjusted Langevin 0 < ≤ 1 O∗(κ2/ 2) Underdamped Langevin 0 < ≤ 1 O∗(κ2/ ) Underdamped Langevin 0 < ≤ 1/ √ κ O∗(κ3/2/ ) Hamiltonian Dynamics 0 < ≤ 1 O∗(κ3/2/ ) SGS with single splitting 0 < ≤ 1/(d √ κ) O∗(κ1/2/ ) → For small , the rate of the SGS improves upon previous works.

31 / 44 Complexity results for strongly log-concave densities Total
variation distance Assumptions: U is M-smooth & m-strongly convex, κ = M/m starting point θ(0) ∼ N(θ; θ , Id /M) Method Validity Evals ULA 0 < ≤ 1 O∗(κ2d3/ 2) MALA 0 < ≤ 1 O(κ2d2 log1.5( κ 1/d )) SGS 0 < ≤ 1 O∗(κd2/ ) 101 102 103 d 103 104 105 106 107 tmix ( ; ν) Total variation distance theoretical slope = 2 SGS, slope = 1.04 (± 0.02) 102 103 log(2/ )/ 102 103 104 tmix ( ; ν) Total variation distance theoretical slope = 1 SGS, slope = 1.06 (± 0.04)

32 / 44 Complexity results for log-concave densities Total variation
distance Idea: Replace each potential Ui by a strongly-convex approximation: ˜ U(θ) = b i=1 Ui (Ai θ) + λ 2 θ − θ 2 By using results from Dalalyan (2017) , the triangle inequality and our previous bounds, we have: Method Validity Evals ULA 0 < ≤ 1 O∗(M2d3/ 4) MRW 0 < ≤ 1 O∗(M2d3/ 2) SGS 0 < ≤ 1 O∗(Md3/ 2) → Our complexity results improve upon that of ULA and MRW.

34 / 44 Unsupervised image deconvolution with a smooth prior
Problem statement y = Bθ + ε, ε ∼ N(0d , Ω−1) π(θ) = N 0d , γLT L −1 , B: circulant matrix w.r.t. a blurring kernel Ω−1 = diag(σ2 1 , . . . , σ2 d ), σi ∼ (1 − β)δκ1 + βδκ2 , where κ1 κ2 L: circulant matrix w.r.t. a Laplacian ﬁlter Posterior π(θ|y): Gaussian w/ Q = BT ΩB + γLT L challenging: Q cannot be diagonalized easily high-dimensional: d ∼ 105 Observed Original

AXDA approximation and SGS π(θ|y, σ, κ1 , κ2 , β, γ) ∝ exp − 1 2 (Bθ − y)T Ω(Bθ − y) + γ Lθ 2 Approximate joint posterior : πρ (θ, z|y, σ, κ1 , κ2 , β, γ) ∝ exp − 1 2 (z − y)T Ω(z − y) + γ Lθ 2 + 1 ρ2 z − Bθ 2 .

AXDA approximation and SGS π(θ|y, σ, κ1 , κ2 , β, γ) ∝ exp − 1 2 (Bθ − y)T Ω(Bθ − y) + γ Lθ 2 Approximate joint posterior : πρ (θ, z|y, σ, κ1 , κ2 , β, γ) ∝ exp − 1 2 (z − y)T Ω(z − y) + γ Lθ 2 + 1 ρ2 z − Bθ 2 . Sampling θ : πρ (θ|z, γ) : Gaussian distribution with Qθ = 1 ρ2 BT B + γLT L

AXDA approximation and SGS π(θ|y, σ, κ1 , κ2 , β, γ) ∝ exp − 1 2 (Bθ − y)T Ω(Bθ − y) + γ Lθ 2 Approximate joint posterior : πρ (θ, z|y, σ, κ1 , κ2 , β, γ) ∝ exp − 1 2 (z − y)T Ω(z − y) + γ Lθ 2 + 1 ρ2 z − Bθ 2 . Sampling θ : πρ (θ|z, γ) : Gaussian distribution with Qθ = 1 ρ2 BT B + γLT L Sampling {zi }i∈[d] : πρ (z|θ, y, σ, κ1 , κ2 , β) : Gaussian distribution with Qz = 1 ρ2 Id + Ω

Restoration results with the MMSE estimator Original Observed MMSE under π Rel. error = 0.004 MMSE under πρ method SNR (dB) PSNR (dB) SGS 17.57 22.92 AuxV1 17.57 22.92 AuxV2 17.58 22.93 RJPO 17.57 22.91

Convergence diagnotics of the MCMC samplers 0 200 400 600 800 1000 Iteration t 0.000 0.001 0.002 0.003 0.004 0.005 0.006 γ[t] SGS AuxV1 AuxV2 RJPO 0 200 400 600 800 1000 Iteration t 0.30 0.32 0.34 0.36 0.38 0.40 β[t] SGS AuxV1 AuxV2 RJPO True value 0 200 400 600 800 1000 Iteration t 10 11 12 13 14 15 16 κ[t] 1 SGS AuxV1 AuxV2 RJPO True value 0 200 400 600 800 1000 Iteration t 36 38 40 42 44 κ[t] 2 SGS AuxV1 AuxV2 RJPO True value

38 / 44 Image inpainting with non-smooth prior Problem statement
and challenges y = Hθ + ε, ε ∼ N(0d , σ2Id ) π(θ) : TV prior H: decimation matrix associated to a binary mask TV(θ) = τ 1≤i≤d Di θ Di θ: 2D discrete gradient applied at pixel i of θ Posterior π(θ|y) : not standard : no conjugacy not smooth : TV prior high-dimensional : d ∼ 105

39 / 44 Image inpainting with non-smooth prior Possible surrogate:
MYULA1 π(θ|y) ∝ N y|Hθ, σ2In d i=1 e−τ Diθ π(θ) Approximate posterior : πλ (θ|y) ∝ N y|Hθ, σ2In e−Mλ[T V ](θ) 1 Durmus et al. (2018)

MYULA1 π(θ|y) ∝ N y|Hθ, σ2In d i=1 e−τ Diθ π(θ) Approximate posterior : πλ (θ|y) ∝ N y|Hθ, σ2In e−Mλ[T V ](θ) MYULA : θ(t+1) ∼ N 1 − λ γ θ(t) + λ∇ log π(y|θ(t)) + λ γ proxγ TV (θ(t)), 2γId No closed-form for proxTV : −→ additional approximation −→ computational burden 1 Durmus et al. (2018)

and SGS π(θ|y) ∝ N y|Hθ, σ2In d i=1 e−τ Diθ π(θ) Approximate joint posterior : πρ (θ, z1:d |y) ∝ N y|Hθ, σ2In d i=1 e−τ zi N(zi |Di θ, ρ2Id ) Sampling θ : πρ (θ|z1:d , y) ∝ N y|Hθ, σ2In d i=1 N(zi |Di θ, ρ2Id ) Sampling {zi }i∈[d] : πρ (zi |θ) ∝ e−τ zi N(zi |Di θ, ρ2Id ) −→ DA applied to e−τ zi : Gaussian and Inverse Gaussian conditionals

41 / 44 Image inpainting with non-smooth prior Highest posterior
density regions C α = {θ ∈ Rd | π(θ|y) ≥ e−γα }, Cα π(θ|y)dθ = 1 − α 0 10 20 30 40 50 60 70 80 0.0 0.2 0.4 0.6 0.8 1.0 α 0.5 1.0 1.5 2.0 2.5 3.0 |˜ γα −γα | γα ×10−3 SGS MYULA −→ Relative error of the order 0.1%

42 / 44 Image inpainting with non-smooth prior Sampling eﬃciency
U(θ) = − log π(θ|y) 100 101 102 103 104 105 Iteration t 106 107 108 U(θ) SGS MYULA 104 105 Iteration t 1.65 × 105 1.66 × 105 1.67 × 105 1.68 × 105 1.69 × 105 1.7 × 105 U(θ) SGS MYULA −→ Faster convergence towards high probability regions

42 / 44 Image inpainting with non-smooth prior Sampling eﬃciency
ESSR: eﬀective sample size ratio T1 : CPU time in seconds to draw one sample method ESSR T1 [seconds] SGS 0.22 3.83 MYULA 0.15 4.22 −→ ESSR improved by 47%

43 / 44 Conclusion Broad and simple approximate statistical framework
• Multiple convolutions with smoothing kernels • Related to Approximate Bayesian Computation • Rich interpretation from both simulation and optimisation • Theoretical guarantees Eﬃcient sampling based on AXDA • Parallel and simple Gibbs sampling • Explicit convergence rates and mixing times • State-of-the-art performances on image processing problems Perspectives • Convergence and complexity analyses for Metropolis-within-SGS • Distributed and asynchronous sampling • Applications to challenging machine and deep learning problems • Comparison with stochastic gradient MCMC

44 / 44 References Theory and methods M. Vono, N.
Dobigeon and P. Chainais (2019), “Split-and-augmented Gibbs sampler - Application to large-scale inference problems,” IEEE Transactions on Signal Processing, vol. 67, no. 6, pp. 1648-1661. M. Vono, N. Dobigeon and P. Chainais (2020), “Asymptotically exact data augmentation: models, properties and algorithms,” Journal of Computational and Graphical Statistics (in press). M. Vono, D. Paulin and A. Doucet (2019), “Eﬃcient MCMC sampling with dimension-free convergence rate,” arXiv: 1905.11937. M. Vono, N. Dobigeon and P. Chainais (2020), “High-dimensional Gaussian sampling: A review and a unifying approach based on a stochastic proximal point algorithm,” arXiv: 2010.01510. L. J. Rendell et al. (2020), “Global consensus Monte Carlo,” Journal of Computational and Graphical Statistics (in press). Applications M. Vono et al. (2019), “Bayesian image restoration under Poisson noise and log-concave prior,” in Proc. ICASSP. M. Vono et al. (2018), “Sparse Bayesian binary logistic regression using the split-and-augmented Gibbs sampler,” in Proc. MLSP. Code https://github.com/mvono

44 / 44 Thank you for your attention! Questions? Joint
work with Pierre Chainais, Nicolas Dobigeon, Arnaud Doucet & Daniel Paulin

45 / 44 Andrieu, C., de Freitas, N., Doucet, A.,
and Jordan, M. I. (2003), “An Introduction to MCMC for Machine Learning,” Machine Learning, 50, 5–43. Brosse, N., Durmus, A., Moulines, E., and Pereyra, M. (2017), “Sampling from a log-concave distribution with compact support with proximal Langevin Monte Carlo,” in Conference on Learning Theory, vol. 65, pp. 319–342. Bubeck, S., Eldan, R., and Lehec, J. (2015), “Finite-Time Analysis of Projected Langevin Monte Carlo,” in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, p. 1243–1251. Celeux, G., Chretien, S., Forbes, F., and Mkhadri, A. (2001), “A Component-Wise EM Algorithm for Mixtures,” Journal of Computational and Graphical Statistics, 10, 697–712. Chambolle, A., Novaga, M., Cremers, D., and Pock, T. (2010), “An introduction to total variation for image analysis,” in Theoretical Foundations and Numerical Methods for Sparse Recovery, De Gruyter. Dalalyan, A. and Tsybakov, A. (2012), “Sparse regression learning by aggregation and Langevin Monte-Carlo,” Journal of Computer and System Sciences, 78, 1423–1443. Dalalyan, A. S. (2017), “Theoretical guarantees for approximate sampling from smooth and log-concave densities,” Journal of the Royal Statistical Society, Series B, 79, 651–676. Dang, K.-D., Quiroz, M., Kohn, R., Tran, M.-N., and Villani, M. (2019), “Hamiltonian Monte Carlo with Energy Conserving Subsampling,” Journal of Machine Learning Research, 20, 1–31. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, Series B, 39, 1–38. Duane, S., Kennedy, A., Pendleton, B. J., and Roweth, D. (1987), “Hybrid Monte Carlo,” Physics Letters B, 195, 216–222. Durmus, A., Majewski, S., and Miasojedow, B. (2019), “Analysis of Langevin Monte Carlo via convex optimization,” Journal of Machine Learning Research, 20, 1–46, available at http://www.jmlr.org/papers/volume20/18-173/18-173.pdf. Durmus, A., Moulines, E., and Pereyra, M. (2018), “Eﬃcient Bayesian Computation by Proximal Markov chain Monte Carlo: When Langevin Meets Moreau,” SIAM Journal on Imaging Sciences, 11, 473–506. Edwards, R. G. and Sokal, A. D. (1988), “Generalization of the Fortuin-Kasteleyn-Swendsen-Wang representation and Monte Carlo algorithm,” Physical Review D, 38, 2009–2012.

46 / 44 Gelfand, A. E., Smith, A. F. M.,
and Lee, T.-M. (1992), “Bayesian Analysis of Constrained Parameter and Truncated Data Problems Using Gibbs Sampling,” Journal of the American Statistical Association, 87, 523–532. Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003), Bayesian Data Analysis, Chapman and Hall/CRC, 2nd ed. Gilks, W. R. and Wild, P. (1992), “Adaptive Rejection Sampling for Gibbs Sampling,” Journal of the Royal Statistical Society. Series C (Applied Statistics), 41, 337–348. Hanson, T. E., Branscum, A. J., and Johnson, W. O. (2014), “Informative g -Priors for Logistic Regression,” Bayesian Analysis, 9, 597–612. Hartley, H. O. (1958), “Maximum Likelihood Estimation from Incomplete Data,” Biometrics, 14, 174–194. Hsieh, Y.-P., Kavis, A., Rolland, P., and Cevher, V. (2018), “Mirrored Langevin Dynamics,” in Advances in Neural Information Processing Systems, vol. 31, pp. 2878–2887. Idier, J. (ed.) (2008), Bayesian Approach to Inverse Problems, Wiley. Luu, T., Fadili, J., and Chesneau, C. (2020), “Sampling from non-smooth distribution through Langevin diﬀusion,” Methodology and Computing in Applied Probability (in press). Maddison, C. J., Paulin, D., Teh, Y. W., O’Donoghue, B. r., and Doucet, A. (2018), “Hamiltonian Descent Methods,” [online]. Technical report. Available at https://arxiv.org/abs/1809.05042/. Marin, J.-M., Pudlo, P., Robert, C. P., and Ryder, R. J. (2012), “Approximate Bayesian computational methods,” Statistics and Computing, 22, 1167–1180. Marnissi, Y., Chouzenoux, E., Benazza-Benyahia, A., and Pesquet, J.-C. (2018), “An Auxiliary Variable Method for Markov Chain Monte Carlo Algorithms in High Dimension,” Entropy, 20. Neal, R. M. (2003), “Slice sampling,” Ann. Statist., 31, 705–767. — (2011), “MCMC Using Hamiltonian Dynamics,” Handbook of Markov Chain Monte Carlo, 54, 113–162. Park, T. and Casella, G. (2008), “The Bayesian Lasso,” Journal of the American Statistical Association, 103, 681–686. Pereyra, M. (2016), “Proximal Markov chain Monte Carlo algorithms,” Statistics and Computing, 26, 745–760. Rendell, L. J., Johansen, A. M., Lee, A., and Whiteley, N. (2020), “Global consensus Monte Carlo,” Journal of Computational and Graphical Statistics, to appear.

47 / 44 Robert, C. P. (2001), The Bayesian Choice:
from decision-theoretic foundations to computational implementation, New York: Springer, 2nd ed. Robert, C. P. and Casella, G. (2004), Monte Carlo Statistical Methods, Berlin: Springer, 2nd ed. Roberts, G. O. and Tweedie, R. L. (1996), “Exponential convergence of Langevin distributions and their discrete approximations,” Bernoulli, 2, 341–363. Sabanes Bove, D. and Held, L. (2011), “Hyper-g priors for generalized linear models,” Bayesian Analysis, 6, 387–410. van Dyk, D. A. and Meng, X.-L. (2001), “The Art of Data Augmentation,” Journal of Computational and Graphical Statistics, 10, 1–50. Vono, M., Dobigeon, N., and Chainais, P. (2020), “High-dimensional Gaussian sampling: A review and a unifying approach based on a stochastic proximal point algorithm,” SIAM Review, in 1st round of review. Wilkinson, R. (2013), “Approximate Bayesian Computation (ABC) gives exact results under the assumption of model error,” Statistical applications in genetics and molecular biology, 12, 1–13.

Maxime Vono

Maxime Vono

More Decks by S³ Seminar

Other Decks in Research

Featured

Transcript