Slide 1

Slide 1 text

1 / 44 Efficient MCMC sampling via asymptotically exact data augmentation Maxime Vono Joint work with P. Chainais, N. Dobigeon, A. Doucet and D. Paulin S3 Seminar - The Paris-Saclay Signal Seminar December 11, 2020

Slide 2

Slide 2 text

2 / 44 Bayesian inference1 θ : unknown object of interest y : available data (observations) Prior × Likelihood −→ Posterior θ ∼ π(θ) y|θ ∼ π(y|θ) θ|y ∼ π(θ|y) 1 Robert (2001) , Gelman et al. (2003)

Slide 3

Slide 3 text

2 / 44 Bayesian inference θ : unknown object of interest y : available data (observations) E(θ|y) Cα 1 − α Bayesian estimators Credibility regions arg min δ L(δ, θ)π(θ|y)dθ Cα π(θ|y)dθ = 1 − α, α ∈ (0, 1)

Slide 4

Slide 4 text

3 / 44 Monte Carlo integration4 h(θ)π(θ|y)dθ ≈ 1 N N n=1 h θ(n) , θ(n) ∼ π(θ|y) Numerous applications : inverse problems1 statistical machine learning2 aggregation of estimators3 1 Idier (2008) 2 Andrieu et al. (2003) 3 Dalalyan and Tsybakov (2012) 4 Robert and Casella (2004)

Slide 5

Slide 5 text

3 / 44 Monte Carlo integration1 h(θ)π(θ|y)dθ ≈ 1 N N n=1 h θ(n) , θ(n) ∼ π(θ|y) Sampling challenges: − log π(θ|y) = b i=1 Ui (Ai θ) {Ui ; i ∈ [k]} non-smooth θ ∈ Rd with d 1 1 Robert and Casella (2004)

Slide 6

Slide 6 text

4 / 44 Illustrative example: image inpainting y = Hθ + ε, ε ∼ N(0d , σ2Id ) π(θ) : TV prior1 Posterior π(θ|y) : not standard: no conjugacy not smooth: TV prior high-dimensional: d ∼ 105 1 Chambolle et al. (2010)

Slide 7

Slide 7 text

5 / 44 Aim of this talk Present a general statistical framework to perform efficient sampling Seminal works: HMC1, (MY)ULA2 efficient & simple sampling scalability in high dimension scalability in big data settings 0 5000 10000 15000 Iteration t 3 2 1 0 1 2 3 ✓ N(0, 1) Random walk with = 1 1 Duane et al. (1987); Neal (2011); Maddison et al. (2018); Dang et al. (2019) 2 Roberts and Tweedie (1996); Pereyra (2016); Durmus et al. (2018); Luu et al. (2020)

Slide 8

Slide 8 text

6 / 44 Outline 1. Asymptotically exact data augmentation (AXDA) 2. Theoretical guarantees for Gaussian smoothing 3. Split Gibbs sampler (SGS) 4. Convergence and complexity analyses 5. Experiments

Slide 9

Slide 9 text

7 / 44 Outline 1. Asymptotically exact data augmentation (AXDA) 2. Theoretical guarantees for Gaussian smoothing 3. Split Gibbs sampler (SGS) 4. Convergence and complexity analyses 5. Experiments

Slide 10

Slide 10 text

8 / 44 Data augmentation (DA) π(θ, z|y)dz = π(θ|y) Numerous well-known advantages: at the core of EM1 π(θ, z|y): simpler full conditionals2 better mixing3 z θ(t) π(θ(t)) Slice sampling 1 Hartley (1958); Dempster et al. (1977); Celeux et al. (2001) 2 Neal (2003) 3 Edwards and Sokal (1988); Marnissi et al. (2018)

Slide 11

Slide 11 text

9 / 44 Example: Bayesian LASSO1 y = Xθ + ε, ε ∼ N(0d , σ2Id ) π(θ) ∝ e−τ θ 1 1 Park and Casella (2008)

Slide 12

Slide 12 text

9 / 44 Example: Bayesian LASSO1 y = Xθ + ε, ε ∼ N(0d , σ2Id ) π(θ) ∝ e−τ θ 1 Surrogate: DA by exploiting the mixture representation π(θ) = d i=1 ∞ 0 N(θi |0, zi )E(zi |τ2)dzi 1 Park and Casella (2008)

Slide 13

Slide 13 text

9 / 44 Example: Bayesian LASSO1 y = Xθ + ε, ε ∼ N(0d , σ2Id ) π(θ) ∝ e−τ θ 1 Surrogate: DA by exploiting the mixture representation π(θ) = d i=1 ∞ 0 N(θi |0, zi )E(zi |τ2)dzi Simple Gibbs sampling: π(θ|y, z1 , . . ., zd ) : Gaussian π(1/zi |θ) : Inverse-Gaussian 1 Park and Casella (2008)

Slide 14

Slide 14 text

10 / 44 The art of data augmentation Unfortunately, satisfying π(θ, z|y)dz = π(θ|y) (1) while leading to efficient algorithms is a matter of art.1 1 van Dyk and Meng (2001)

Slide 15

Slide 15 text

10 / 44 The art of data augmentation Unfortunately, satisfying π(θ, z|y)dz = π(θ|y) (1) while leading to efficient algorithms is a matter of art.1 Proposed approach: systematic & simple way to relax (1) leading to efficient algorithms πρ (θ, z|y)dz − − − → ρ→0 π(θ|y) 1 van Dyk and Meng (2001)

Slide 16

Slide 16 text

10 / 44 The art of data augmentation Unfortunately, satisfying π(θ, z|y)dz = π(θ|y) (1) while leading to efficient algorithms is a matter of art.1 Proposed approach: systematic & simple way to relax (1) leading to efficient algorithms πρ (θ, z|y)dz − − − → ρ→0 π(θ|y) −→ How to build πρ (θ, z|y) ? 1 van Dyk and Meng (2001)

Slide 17

Slide 17 text

11 / 44 On building πρ A natural manner to construct πρ is to define πρ (θ, z|y) = π(z|y)κρ (z, θ) where κρ (·, θ) weak − − − → ρ−→0 δθ (·) θ z 0 2 4 6 8 κρ (z; θ) ρ = 0.05 ρ = 0.1 ρ = 0.2 ρ = 0.3

Slide 18

Slide 18 text

11 / 44 On building πρ A natural manner to construct πρ is to define πρ (θ, z|y) = π(z|y)κρ (z, θ) where κρ (·, θ) weak − − − → ρ−→0 δθ (·) θ z 0 2 4 6 8 κρ (z; θ) ρ = 0.05 ρ = 0.1 ρ = 0.2 ρ = 0.3 Divide to conquer generalization: π(θ|y) ∝ b i=1 πi (Ai θ|yi ) −→ πρ (θ, z1:b |y) ∝ b i=1 πi (zi |yi )κρ (zi , Ai θ)

Slide 19

Slide 19 text

12 / 44 Split Gibbs sampler (SGS) π(θ|y) ∝ b i=1 πi (Ai θ|yi ) −→ πρ (θ, z1:b |y) ∝ b i=1 πi (zi |yi )κρ (zi , Ai θ) Algorithm 1: Split Gibbs Sampler (SGS) 1 for t ← 1 to T do 2 for i ← 1 to b do 3 z(t) i ∼ πi (zi |yi )κρ (zi , Ai θ) 4 end 5 θ(t) ∼ b i=1 κρ (zi , Ai θ) 6 end Divide-to-conquer : b simpler steps −→ πi (zi |yi ) b local steps −→ yi

Slide 20

Slide 20 text

13 / 44 Kernel choice : κρ (z, θ) ∝ ρ−dK(z−θ ρ ) name support K(u) Gaussian R 1 √ 2π exp −u2/2 Cauchy R 1 π(1+u2) Laplace R 1 2 exp (−|u|) Dirichlet R sin2(u) πu2 Uniform [−1, 1] 1 2 1 |u|≤1 Triangular [−1, 1] (1 − |u|)1 |u|≤1 Epanechnikov [−1, 1] 3 4 (1 − u2)1 |u|≤1 −4 0 4 u 0 1 K(u) K(0) Gaussian Cauchy Laplace Dirichlet −1 0 1 u 0 1 K(u) K(0) Uniform Triangular Epanechnikov −→ Similarity with “noisy” ABC methods (Marin et al. 2012; Wilkinson 2013)

Slide 21

Slide 21 text

13 / 44 Kernel choice : κρ (z, θ) ∝ ρ−dK(z−θ ρ ) name support K(u) Gaussian R 1 √ 2π exp −u2/2 Cauchy R 1 π(1+u2) Laplace R 1 2 exp (−|u|) Dirichlet R sin2(u) πu2 Uniform [−1, 1] 1 2 1 |u|≤1 Triangular [−1, 1] (1 − |u|)1 |u|≤1 Epanechnikov [−1, 1] 3 4 (1 − u2)1 |u|≤1 −4 0 4 u 0 1 K(u) K(0) Gaussian Cauchy Laplace Dirichlet −1 0 1 u 0 1 K(u) K(0) Uniform Triangular Epanechnikov −→ Similarity with “noisy” ABC methods (Marin et al. 2012; Wilkinson 2013)

Slide 22

Slide 22 text

14 / 44 Outline 1. Asymptotically exact data augmentation (AXDA) 2. Theoretical guarantees for Gaussian smoothing 3. Split Gibbs sampler (SGS) 4. Convergence and complexity analyses 5. Experiments

Slide 23

Slide 23 text

15 / 44 Densities with full domains We assume here that supp(π) = Rd. π(θ|y) ∝ b i=1 e−Ui(Aiθ) πi(Aiθ) Distance Main assumptions Upper bound πρ − π TV Ui Li -Lipschitz b i=1 2 di Li + o(ρ) U1 M1 -smooth, b = 1 1 2 M1 d Ui Mi -smooth & strongly convex 1 2 b i=1 Mi di + o(ρ2) W1 (πρ , π) U1 M1 -smooth & strongly convex min √ d, 1 2 √ M1 d b = 1, A1 = Id

Slide 24

Slide 24 text

15 / 44 Densities with full domains π(θ|y) ∝ b i=1 e−Ui(Aiθ) πi(Aiθ) πρ (θ|y) ∝ b i=1 Rdi e−Ui(zi) πi(zi) N zi |Ai θ, ρ2Id κρ(zi,Aiθ) dzi Distance Main assumptions Upper bound πρ − π TV Ui Li -Lipschitz b i=1 2 di Li + o(ρ) U1 M1 -smooth, b = 1 1 2 M1 d Ui Mi -smooth & strongly convex 1 2 b i=1 Mi di + o(ρ2) W1 (πρ , π) U1 M1 -smooth & strongly convex min √ d, 1 2 √ M1 d b = 1, A = I

Slide 25

Slide 25 text

15 / 44 Densities with full domains πρ (θ|y) ∝ b i=1 Rdi e−Ui(zi) πi(zi) N zi |Ai θ, ρ2Id κρ(zi,Aiθ) dzi Distance Upper bound Main assumptions πρ − π TV ρ b i=1 2 di Li + o(ρ) Ui Li -Lipschitz 1 2 ρ2M1 d U1 M1 -smooth, b = 1 1 2 ρ2 b i=1 Mi di + o(ρ2) Ui Mi -smooth & strongly convex W1 (πρ , π) min ρ √ d, 1 2 ρ2 √ M1 d U1 M1 -smooth & strongly convex b = 1, A1 = Id

Slide 26

Slide 26 text

15 / 44 Densities with full domains Illustration of these bounds for an univariate Gaussian toy example with b = 1 and U1 (θ) = θ2 2σ2 : π(θ) = 1 √ 2πσ2 e− θ2 2σ2 πρ (θ) = 1 2π(σ2 + ρ2) e− θ2 2(σ2+ρ2) 10−2 10−1 100 101 102 103 ρ 10−7 10−6 10−5 10−4 10−3 10−2 10−1 100 π − πρ TV Our bound Cρ2 10−2 10−1 100 101 102 103 ρ 10−5 10−3 10−1 101 103 W1 (π, πρ ) Upper bound Cρ2

Slide 27

Slide 27 text

16 / 44 Uniform prior densities on compact convex sets1 π has bounded support: supp(π) = K, where K ⊂ Rd is a convex body. π(θ) ∝ e−ιK(θ), ιK (θ) = 0 if θ ∈ K ∞ if θ / ∈ K πρ (θ) ∝ Rd e−ιK(z)− z−θ 2 2ρ2 dz H1. There exists r > 0 such that B(0d , r) ⊂ K. For ρ ∈ 0, r 2 √ 2d , πρ − π TV ≤ 2 √ 2ρd r r K B(0d , r) 1 Gelfand et al. (1992); Bubeck et al. (2015); Brosse et al. (2017); Hsieh et al. (2018)

Slide 28

Slide 28 text

17 / 44 Outline 1. Asymptotically exact data augmentation (AXDA) 2. Theoretical guarantees for Gaussian smoothing 3. Split Gibbs sampler (SGS) 4. Convergence and complexity analyses 5. Experiments

Slide 29

Slide 29 text

18 / 44 Sampling problem π(θ|y) ∝ b i=1 e−Ui(Aiθ) πi(Aiθ) Ai ∈ Rdi×d Ui depends on yi

Slide 30

Slide 30 text

18 / 44 Sampling problem π(θ|y) ∝ b i=1 e−Ui(Aiθ) πi(Aiθ) Ai ∈ Rdi×d Ui depends on yi Example: Bayesian logistic regression with Gaussian prior π(θ|y) ∝ e−τ||θ||2 prior n i=1 e−logistic(xT i θ;yi) likelihood not standard high-dimensional possibly distributed data {yi , xi }i∈[n]

Slide 31

Slide 31 text

19 / 44 Split Gibbs sampler (SGS) πρ (θ, z1:b |y) ∝ b i=1 e−Ui(zi)κρ (zi , Ai θ) Algorithm 2: Split Gibbs Sampler (SGS) 1 for t ← 1 to T do 2 for i ← 1 to b do 3 z(t) i ∼ πρ (zi |θ(t−1), yi ) ∝ exp −Ui (zi ) − 1 2ρ2 zi − Ai θ 2 4 end 5 θ(t) ∼ πρ (θ|z(t) 1:b ) ∝ b i=1 N z(t) i |Ai θ, ρ2Id 6 end Divide-to-conquer : • b simpler steps −→ Ui (zi ) • b local steps −→ yi

Slide 32

Slide 32 text

19 / 44 Split Gibbs sampler (SGS) πρ (θ, z1:b |y) ∝ b i=1 e−Ui(zi)N zi |Ai θ, ρ2Id Algorithm 3: Split Gibbs Sampler (SGS) 1 for t ← 1 to T do 2 for i ← 1 to b do 3 z(t) i ∼ πρ (zi |θ(t−1), yi ) ∝ exp −Ui (zi ) − 1 2ρ2 zi − Ai θ 2 4 end 5 θ(t) ∼ πρ (θ|z(t) 1:b ) ∝ b i=1 N z(t) i |Ai θ, ρ2Id 6 end Divide-to-conquer : • b simpler steps −→ Ui (zi ) • b local steps −→ yi

Slide 33

Slide 33 text

19 / 44 Split Gibbs sampler (SGS) πρ (θ, z1:b |y) ∝ b i=1 e−Ui(zi)N zi |Ai θ, ρ2Id Algorithm 1: Split Gibbs Sampler (SGS) 1 for t ← 1 to T do 2 for i ← 1 to b do 3 z(t) i ∼ πρ (zi |θ(t−1), yi ) ∝ exp −Ui (zi ) − 1 2ρ2 zi − Ai θ 2 4 end 5 θ(t) ∼ πρ (θ|z(t) 1:b ) ∝ b i=1 N z(t) i |Ai θ, ρ2Id 6 end Divide-to-conquer : b simpler steps −→ Ui (zi ) b local steps −→ yi

Slide 34

Slide 34 text

20 / 44 On the scalability in distributed environments1 θ yi i ∈ [n] θ z1 y1 . . . zi yi . . . zb yb . . . . . . → Each latent variable zi depends on the subset of observations yi 1 For a comprehensive review, see the work by Rendell et al. (2020)

Slide 35

Slide 35 text

21 / 44 Sampling efficiently the auxiliary variables {zi }i∈[b] Rejection sampling1 πρ zi |θ(t−1), y i ∝ exp −Ui (zi ) − 1 2ρ2 zi − Ai θ 2 If Ui convex & smooth and ρ ≤ ¯ ρ with ¯ ρ = O(1/di ): exact sampling by rejection sampling with O(1) evals of Ui and ∇Ui . 1 Gilks and Wild (1992)

Slide 36

Slide 36 text

21 / 44 Sampling efficiently the auxiliary variables {zi }i∈[b] Rejection sampling1 πρ zi |θ(t−1), y i ∝ exp −Ui (zi ) − 1 2ρ2 zi − Ai θ 2 If Ui convex & smooth and ρ ≤ ¯ ρ with ¯ ρ = O(1/di ): exact sampling by rejection sampling with O(1) evals of Ui and ∇Ui . ρ = 4.49 πρ (zi |θ) qρ (zi |θ) ρ = 2.46 πρ (zi |θ) qρ (zi |θ) ρ = 1.27 πρ (zi |θ) qρ (zi |θ) Here Ui (zi ) = yi zi + log(1 + e−zi ) + αz2 i 2 1 Gilks and Wild (1992)

Slide 37

Slide 37 text

22 / 44 Sampling efficiently the master parameter θ High-dimensional Gaussian sampling1 πρ (θ|z(t) 1:b ) ∝ b i=1 N z(t) i |Ai θ, ρ2Id Main difficulty: Handling the high-dimensional matrix Q = b i=1 AT i Ai . 10−4 10−3 10−2 10−1 100 101 102 103 104 −15 −10 −5 0 5 10 15 −40 −20 0 20 40 −→ Numerous efficient sampling approaches have been proposed by building on numerical linear algebra techniques. 1 For a recent overview, see the work by Vono et al. (2020)

Slide 38

Slide 38 text

23 / 44 Comparison with the Unadjusted Langevin Algorithm When SGS meets ULA In the single splitting case (b = 1) with A1 = Id , SGS writes 1. Sample z(t) ∼ πρ (z|θ(t)) ∝ exp −U(z) − z−θ(t) 2 2ρ2 2. Sample θ(t+1) ∼ N(z(t), ρ2Id ). By Taylor’s expansion, when ρ → 0, we have: θ(t+1) ∼ N(θ(t)−ρ2∇U(θ(t)), 2ρ2Id ) −→ one ULA step wih stepsize γ = ρ2.

Slide 39

Slide 39 text

24 / 44 Comparison with the Unadjusted Langevin Algorithm Bias comparison1 Distance Upper bound Main assumptions πρ − π TV 1 2 ρ2M1 d M-smooth πULA − π TV √ 2Md · ρ M-smooth & convex W1 (πρ , π) min ρ √ d, 1 2 ρ2 √ Md M-smooth, m-strongly convex W1 (πULA , π) 2M m d · ρ M-smooth, m-strongly convex −→ Higher order approximation based on the explicit form of πρ . 1 Durmus et al. (2019)

Slide 40

Slide 40 text

25 / 44 Relation with quadratic penalty methods MAP estimation under each full conditional of πρ (θ, z1:b |y) ∝ b i=1 exp −Ui (zi ) − 1 2ρ2 zi − Ai θ 2 (2) yields at iteration t z(t) i = arg min zi Ui (zi ) + 1 2ρ2 zi − Ai θ(t−1) 2 , ∀i ∈ [b] θ(t) = arg min θ b i=1 1 2ρ2 z(t) i − Ai θ 2 → Steps involved in quadratic penalty methods

Slide 41

Slide 41 text

26 / 44 Outline 1. Asymptotically exact data augmentation (AXDA) 2. Theoretical guarantees for Gaussian smoothing 3. Split Gibbs sampler (SGS) 4. Convergence and complexity analyses 5. Experiments

Slide 42

Slide 42 text

27 / 44 Mixing times How many SGS iterations should I run to obtain a sample from π? This question can be answered by giving an upper bound on the so-called -mixing time: tmix ( ; ν) = min t ≥ 0 D(νPt SGS , π) ≤ , where > 0, ν is an initial distibution, PSGS is the Markov kernel of SGS and D is a statistical distance. −→ We are here interested in providing explicit bounds w.r.t. d, and regularity constants associated to U.

Slide 43

Slide 43 text

28 / 44 Explicit convergence rates Assumptions : Ai = Id and Ui mi -strongly convex for i ∈ [b] W1 (νPt SGS , πρ ) ≤ W1 (ν, πρ ) · (1 − KSGS )t νPt SGS − πρ TV ≤ Varπρ dν dπρ · (1 − KSGS )t, where KSGS = ρ2 b b i=1 mi 1 + mi ρ2 , ν init., PSGS Markov kernel of SGS 0 20 40 60 80 100 Iteration t −12 −10 −8 −6 −4 −2 0 ρ = 1 log νPt SGS − πρ TV Our bound 0 200 400 600 800 1000 Iteration t −20 −15 −10 −5 0 ρ = 0.1 log W1 (δθ0 Pt SGS , πρ ) Our bound

Slide 44

Slide 44 text

29 / 44 Explicit guidelines for practitioners Single splitting case π(θ|y) ∝ e−U(θ) πρ (θ, z|y) ∝ e−U(z)N(z|θ, ρ2Id ) 1. Set ∈ (0, 1) 2. Set ρ2 = dM , M: Lipschitz constant of ∇U 3. Start from ν(θ) = N θ|θ , M−1Id 4. Run T iterations of SGS with T = C( , d) KSGS Then, νPT SGS − π TV ≤ −4 −2 0 2 4 aT θ a 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Exact - d = 60 −4 −2 0 2 4 aT θ a 0.00 0.05 0.10 0.15 0.20 0.25 0.30 SGS - d = 60

Slide 45

Slide 45 text

30 / 44 Complexity results for strongly log-concave densities 1-Wasserstein distance Assumptions: U is M-smooth & m-strongly convex, κ = M/m starting point: minimizer θ of U Method Condition on Evals for W1 err. √ d √ m Unadjusted Langevin 0 < ≤ 1 O∗(κ2/ 2) Underdamped Langevin 0 < ≤ 1 O∗(κ2/ ) Underdamped Langevin 0 < ≤ 1/ √ κ O∗(κ3/2/ ) Hamiltonian Dynamics 0 < ≤ 1 O∗(κ3/2/ ) SGS with single splitting 0 < ≤ 1/(d √ κ) O∗(κ1/2/ ) → For small , the rate of the SGS improves upon previous works.

Slide 46

Slide 46 text

31 / 44 Complexity results for strongly log-concave densities Total variation distance Assumptions: U is M-smooth & m-strongly convex, κ = M/m starting point θ(0) ∼ N(θ; θ , Id /M) Method Validity Evals ULA 0 < ≤ 1 O∗(κ2d3/ 2) MALA 0 < ≤ 1 O(κ2d2 log1.5( κ 1/d )) SGS 0 < ≤ 1 O∗(κd2/ ) 101 102 103 d 103 104 105 106 107 tmix ( ; ν) Total variation distance theoretical slope = 2 SGS, slope = 1.04 (± 0.02) 102 103 log(2/ )/ 102 103 104 tmix ( ; ν) Total variation distance theoretical slope = 1 SGS, slope = 1.06 (± 0.04)

Slide 47

Slide 47 text

32 / 44 Complexity results for log-concave densities Total variation distance Idea: Replace each potential Ui by a strongly-convex approximation: ˜ U(θ) = b i=1 Ui (Ai θ) + λ 2 θ − θ 2 By using results from Dalalyan (2017) , the triangle inequality and our previous bounds, we have: Method Validity Evals ULA 0 < ≤ 1 O∗(M2d3/ 4) MRW 0 < ≤ 1 O∗(M2d3/ 2) SGS 0 < ≤ 1 O∗(Md3/ 2) → Our complexity results improve upon that of ULA and MRW.

Slide 48

Slide 48 text

33 / 44 Outline 1. Asymptotically exact data augmentation (AXDA) 2. Theoretical guarantees for Gaussian smoothing 3. Split Gibbs sampler (SGS) 4. Convergence and complexity analyses 5. Experiments

Slide 49

Slide 49 text

34 / 44 Unsupervised image deconvolution with a smooth prior Problem statement y = Bθ + ε, ε ∼ N(0d , Ω−1) π(θ) = N 0d , γLT L −1 , B: circulant matrix w.r.t. a blurring kernel Ω−1 = diag(σ2 1 , . . . , σ2 d ), σi ∼ (1 − β)δκ1 + βδκ2 , where κ1 κ2 L: circulant matrix w.r.t. a Laplacian filter Posterior π(θ|y): Gaussian w/ Q = BT ΩB + γLT L challenging: Q cannot be diagonalized easily high-dimensional: d ∼ 105 Observed Original

Slide 50

Slide 50 text

35 / 44 Unsupervised image deconvolution with a smooth prior AXDA approximation and SGS π(θ|y, σ, κ1 , κ2 , β, γ) ∝ exp − 1 2 (Bθ − y)T Ω(Bθ − y) + γ Lθ 2 Approximate joint posterior : πρ (θ, z|y, σ, κ1 , κ2 , β, γ) ∝ exp − 1 2 (z − y)T Ω(z − y) + γ Lθ 2 + 1 ρ2 z − Bθ 2 .

Slide 51

Slide 51 text

35 / 44 Unsupervised image deconvolution with a smooth prior AXDA approximation and SGS π(θ|y, σ, κ1 , κ2 , β, γ) ∝ exp − 1 2 (Bθ − y)T Ω(Bθ − y) + γ Lθ 2 Approximate joint posterior : πρ (θ, z|y, σ, κ1 , κ2 , β, γ) ∝ exp − 1 2 (z − y)T Ω(z − y) + γ Lθ 2 + 1 ρ2 z − Bθ 2 . Sampling θ : πρ (θ|z, γ) : Gaussian distribution with Qθ = 1 ρ2 BT B + γLT L

Slide 52

Slide 52 text

35 / 44 Unsupervised image deconvolution with a smooth prior AXDA approximation and SGS π(θ|y, σ, κ1 , κ2 , β, γ) ∝ exp − 1 2 (Bθ − y)T Ω(Bθ − y) + γ Lθ 2 Approximate joint posterior : πρ (θ, z|y, σ, κ1 , κ2 , β, γ) ∝ exp − 1 2 (z − y)T Ω(z − y) + γ Lθ 2 + 1 ρ2 z − Bθ 2 . Sampling θ : πρ (θ|z, γ) : Gaussian distribution with Qθ = 1 ρ2 BT B + γLT L Sampling {zi }i∈[d] : πρ (z|θ, y, σ, κ1 , κ2 , β) : Gaussian distribution with Qz = 1 ρ2 Id + Ω

Slide 53

Slide 53 text

36 / 44 Unsupervised image deconvolution with a smooth prior Restoration results with the MMSE estimator Original Observed MMSE under π Rel. error = 0.004 MMSE under πρ method SNR (dB) PSNR (dB) SGS 17.57 22.92 AuxV1 17.57 22.92 AuxV2 17.58 22.93 RJPO 17.57 22.91

Slide 54

Slide 54 text

37 / 44 Unsupervised image deconvolution with a smooth prior Convergence diagnotics of the MCMC samplers 0 200 400 600 800 1000 Iteration t 0.000 0.001 0.002 0.003 0.004 0.005 0.006 γ[t] SGS AuxV1 AuxV2 RJPO 0 200 400 600 800 1000 Iteration t 0.30 0.32 0.34 0.36 0.38 0.40 β[t] SGS AuxV1 AuxV2 RJPO True value 0 200 400 600 800 1000 Iteration t 10 11 12 13 14 15 16 κ[t] 1 SGS AuxV1 AuxV2 RJPO True value 0 200 400 600 800 1000 Iteration t 36 38 40 42 44 κ[t] 2 SGS AuxV1 AuxV2 RJPO True value

Slide 55

Slide 55 text

38 / 44 Image inpainting with non-smooth prior Problem statement and challenges y = Hθ + ε, ε ∼ N(0d , σ2Id ) π(θ) : TV prior H: decimation matrix associated to a binary mask TV(θ) = τ 1≤i≤d Di θ Di θ: 2D discrete gradient applied at pixel i of θ Posterior π(θ|y) : not standard : no conjugacy not smooth : TV prior high-dimensional : d ∼ 105

Slide 56

Slide 56 text

39 / 44 Image inpainting with non-smooth prior Possible surrogate: MYULA1 π(θ|y) ∝ N y|Hθ, σ2In d i=1 e−τ Diθ π(θ) Approximate posterior : πλ (θ|y) ∝ N y|Hθ, σ2In e−Mλ[T V ](θ) 1 Durmus et al. (2018)

Slide 57

Slide 57 text

39 / 44 Image inpainting with non-smooth prior Possible surrogate: MYULA1 π(θ|y) ∝ N y|Hθ, σ2In d i=1 e−τ Diθ π(θ) Approximate posterior : πλ (θ|y) ∝ N y|Hθ, σ2In e−Mλ[T V ](θ) MYULA : θ(t+1) ∼ N 1 − λ γ θ(t) + λ∇ log π(y|θ(t)) + λ γ proxγ TV (θ(t)), 2γId 1 Durmus et al. (2018)

Slide 58

Slide 58 text

39 / 44 Image inpainting with non-smooth prior Possible surrogate: MYULA1 π(θ|y) ∝ N y|Hθ, σ2In d i=1 e−τ Diθ π(θ) Approximate posterior : πλ (θ|y) ∝ N y|Hθ, σ2In e−Mλ[T V ](θ) MYULA : θ(t+1) ∼ N 1 − λ γ θ(t) + λ∇ log π(y|θ(t)) + λ γ proxγ TV (θ(t)), 2γId No closed-form for proxTV : −→ additional approximation −→ computational burden 1 Durmus et al. (2018)

Slide 59

Slide 59 text

40 / 44 Image inpainting with non-smooth prior AXDA approximation and SGS π(θ|y) ∝ N y|Hθ, σ2In d i=1 e−τ Diθ π(θ) Approximate joint posterior : πρ (θ, z1:d |y) ∝ N y|Hθ, σ2In d i=1 e−τ zi N(zi |Di θ, ρ2Id )

Slide 60

Slide 60 text

40 / 44 Image inpainting with non-smooth prior AXDA approximation and SGS π(θ|y) ∝ N y|Hθ, σ2In d i=1 e−τ Diθ π(θ) Approximate joint posterior : πρ (θ, z1:d |y) ∝ N y|Hθ, σ2In d i=1 e−τ zi N(zi |Di θ, ρ2Id ) Sampling θ : πρ (θ|z1:d , y) ∝ N y|Hθ, σ2In d i=1 N(zi |Di θ, ρ2Id )

Slide 61

Slide 61 text

40 / 44 Image inpainting with non-smooth prior AXDA approximation and SGS π(θ|y) ∝ N y|Hθ, σ2In d i=1 e−τ Diθ π(θ) Approximate joint posterior : πρ (θ, z1:d |y) ∝ N y|Hθ, σ2In d i=1 e−τ zi N(zi |Di θ, ρ2Id ) Sampling θ : πρ (θ|z1:d , y) ∝ N y|Hθ, σ2In d i=1 N(zi |Di θ, ρ2Id ) Sampling {zi }i∈[d] : πρ (zi |θ) ∝ e−τ zi N(zi |Di θ, ρ2Id ) −→ DA applied to e−τ zi : Gaussian and Inverse Gaussian conditionals

Slide 62

Slide 62 text

41 / 44 Image inpainting with non-smooth prior Highest posterior density regions C α = {θ ∈ Rd | π(θ|y) ≥ e−γα }, Cα π(θ|y)dθ = 1 − α 0 10 20 30 40 50 60 70 80 0.0 0.2 0.4 0.6 0.8 1.0 α 0.5 1.0 1.5 2.0 2.5 3.0 |˜ γα −γα | γα ×10−3 SGS MYULA −→ Relative error of the order 0.1%

Slide 63

Slide 63 text

42 / 44 Image inpainting with non-smooth prior Sampling efficiency U(θ) = − log π(θ|y) 100 101 102 103 104 105 Iteration t 106 107 108 U(θ) SGS MYULA 104 105 Iteration t 1.65 × 105 1.66 × 105 1.67 × 105 1.68 × 105 1.69 × 105 1.7 × 105 U(θ) SGS MYULA −→ Faster convergence towards high probability regions

Slide 64

Slide 64 text

42 / 44 Image inpainting with non-smooth prior Sampling efficiency ESSR: effective sample size ratio T1 : CPU time in seconds to draw one sample method ESSR T1 [seconds] SGS 0.22 3.83 MYULA 0.15 4.22 −→ ESSR improved by 47%

Slide 65

Slide 65 text

43 / 44 Conclusion Broad and simple approximate statistical framework • Multiple convolutions with smoothing kernels • Related to Approximate Bayesian Computation • Rich interpretation from both simulation and optimisation • Theoretical guarantees Efficient sampling based on AXDA • Parallel and simple Gibbs sampling • Explicit convergence rates and mixing times • State-of-the-art performances on image processing problems Perspectives • Convergence and complexity analyses for Metropolis-within-SGS • Distributed and asynchronous sampling • Applications to challenging machine and deep learning problems • Comparison with stochastic gradient MCMC

Slide 66

Slide 66 text

44 / 44 References Theory and methods M. Vono, N. Dobigeon and P. Chainais (2019), “Split-and-augmented Gibbs sampler - Application to large-scale inference problems,” IEEE Transactions on Signal Processing, vol. 67, no. 6, pp. 1648-1661. M. Vono, N. Dobigeon and P. Chainais (2020), “Asymptotically exact data augmentation: models, properties and algorithms,” Journal of Computational and Graphical Statistics (in press). M. Vono, D. Paulin and A. Doucet (2019), “Efficient MCMC sampling with dimension-free convergence rate,” arXiv: 1905.11937. M. Vono, N. Dobigeon and P. Chainais (2020), “High-dimensional Gaussian sampling: A review and a unifying approach based on a stochastic proximal point algorithm,” arXiv: 2010.01510. L. J. Rendell et al. (2020), “Global consensus Monte Carlo,” Journal of Computational and Graphical Statistics (in press). Applications M. Vono et al. (2019), “Bayesian image restoration under Poisson noise and log-concave prior,” in Proc. ICASSP. M. Vono et al. (2018), “Sparse Bayesian binary logistic regression using the split-and-augmented Gibbs sampler,” in Proc. MLSP. Code https://github.com/mvono

Slide 67

Slide 67 text

44 / 44 Thank you for your attention! Questions? Joint work with Pierre Chainais, Nicolas Dobigeon, Arnaud Doucet & Daniel Paulin

Slide 68

Slide 68 text

45 / 44 Andrieu, C., de Freitas, N., Doucet, A., and Jordan, M. I. (2003), “An Introduction to MCMC for Machine Learning,” Machine Learning, 50, 5–43. Brosse, N., Durmus, A., Moulines, E., and Pereyra, M. (2017), “Sampling from a log-concave distribution with compact support with proximal Langevin Monte Carlo,” in Conference on Learning Theory, vol. 65, pp. 319–342. Bubeck, S., Eldan, R., and Lehec, J. (2015), “Finite-Time Analysis of Projected Langevin Monte Carlo,” in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, p. 1243–1251. Celeux, G., Chretien, S., Forbes, F., and Mkhadri, A. (2001), “A Component-Wise EM Algorithm for Mixtures,” Journal of Computational and Graphical Statistics, 10, 697–712. Chambolle, A., Novaga, M., Cremers, D., and Pock, T. (2010), “An introduction to total variation for image analysis,” in Theoretical Foundations and Numerical Methods for Sparse Recovery, De Gruyter. Dalalyan, A. and Tsybakov, A. (2012), “Sparse regression learning by aggregation and Langevin Monte-Carlo,” Journal of Computer and System Sciences, 78, 1423–1443. Dalalyan, A. S. (2017), “Theoretical guarantees for approximate sampling from smooth and log-concave densities,” Journal of the Royal Statistical Society, Series B, 79, 651–676. Dang, K.-D., Quiroz, M., Kohn, R., Tran, M.-N., and Villani, M. (2019), “Hamiltonian Monte Carlo with Energy Conserving Subsampling,” Journal of Machine Learning Research, 20, 1–31. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, Series B, 39, 1–38. Duane, S., Kennedy, A., Pendleton, B. J., and Roweth, D. (1987), “Hybrid Monte Carlo,” Physics Letters B, 195, 216–222. Durmus, A., Majewski, S., and Miasojedow, B. (2019), “Analysis of Langevin Monte Carlo via convex optimization,” Journal of Machine Learning Research, 20, 1–46, available at http://www.jmlr.org/papers/volume20/18-173/18-173.pdf. Durmus, A., Moulines, E., and Pereyra, M. (2018), “Efficient Bayesian Computation by Proximal Markov chain Monte Carlo: When Langevin Meets Moreau,” SIAM Journal on Imaging Sciences, 11, 473–506. Edwards, R. G. and Sokal, A. D. (1988), “Generalization of the Fortuin-Kasteleyn-Swendsen-Wang representation and Monte Carlo algorithm,” Physical Review D, 38, 2009–2012.

Slide 69

Slide 69 text

46 / 44 Gelfand, A. E., Smith, A. F. M., and Lee, T.-M. (1992), “Bayesian Analysis of Constrained Parameter and Truncated Data Problems Using Gibbs Sampling,” Journal of the American Statistical Association, 87, 523–532. Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003), Bayesian Data Analysis, Chapman and Hall/CRC, 2nd ed. Gilks, W. R. and Wild, P. (1992), “Adaptive Rejection Sampling for Gibbs Sampling,” Journal of the Royal Statistical Society. Series C (Applied Statistics), 41, 337–348. Hanson, T. E., Branscum, A. J., and Johnson, W. O. (2014), “Informative g -Priors for Logistic Regression,” Bayesian Analysis, 9, 597–612. Hartley, H. O. (1958), “Maximum Likelihood Estimation from Incomplete Data,” Biometrics, 14, 174–194. Hsieh, Y.-P., Kavis, A., Rolland, P., and Cevher, V. (2018), “Mirrored Langevin Dynamics,” in Advances in Neural Information Processing Systems, vol. 31, pp. 2878–2887. Idier, J. (ed.) (2008), Bayesian Approach to Inverse Problems, Wiley. Luu, T., Fadili, J., and Chesneau, C. (2020), “Sampling from non-smooth distribution through Langevin diffusion,” Methodology and Computing in Applied Probability (in press). Maddison, C. J., Paulin, D., Teh, Y. W., O’Donoghue, B. r., and Doucet, A. (2018), “Hamiltonian Descent Methods,” [online]. Technical report. Available at https://arxiv.org/abs/1809.05042/. Marin, J.-M., Pudlo, P., Robert, C. P., and Ryder, R. J. (2012), “Approximate Bayesian computational methods,” Statistics and Computing, 22, 1167–1180. Marnissi, Y., Chouzenoux, E., Benazza-Benyahia, A., and Pesquet, J.-C. (2018), “An Auxiliary Variable Method for Markov Chain Monte Carlo Algorithms in High Dimension,” Entropy, 20. Neal, R. M. (2003), “Slice sampling,” Ann. Statist., 31, 705–767. — (2011), “MCMC Using Hamiltonian Dynamics,” Handbook of Markov Chain Monte Carlo, 54, 113–162. Park, T. and Casella, G. (2008), “The Bayesian Lasso,” Journal of the American Statistical Association, 103, 681–686. Pereyra, M. (2016), “Proximal Markov chain Monte Carlo algorithms,” Statistics and Computing, 26, 745–760. Rendell, L. J., Johansen, A. M., Lee, A., and Whiteley, N. (2020), “Global consensus Monte Carlo,” Journal of Computational and Graphical Statistics, to appear.

Slide 70

Slide 70 text

47 / 44 Robert, C. P. (2001), The Bayesian Choice: from decision-theoretic foundations to computational implementation, New York: Springer, 2nd ed. Robert, C. P. and Casella, G. (2004), Monte Carlo Statistical Methods, Berlin: Springer, 2nd ed. Roberts, G. O. and Tweedie, R. L. (1996), “Exponential convergence of Langevin distributions and their discrete approximations,” Bernoulli, 2, 341–363. Sabanes Bove, D. and Held, L. (2011), “Hyper-g priors for generalized linear models,” Bayesian Analysis, 6, 387–410. van Dyk, D. A. and Meng, X.-L. (2001), “The Art of Data Augmentation,” Journal of Computational and Graphical Statistics, 10, 1–50. Vono, M., Dobigeon, N., and Chainais, P. (2020), “High-dimensional Gaussian sampling: A review and a unifying approach based on a stochastic proximal point algorithm,” SIAM Review, in 1st round of review. Wilkinson, R. (2013), “Approximate Bayesian Computation (ABC) gives exact results under the assumption of model error,” Statistical applications in genetics and molecular biology, 12, 1–13.