Slide 1

Slide 1 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Non-asymptotic convergence bound for the Unadjusted Langevin Algorithm Alain Durmus, Eric Moulines, Marcelo Pereyra Telecom ParisTech, Ecole Polytechnique, Bristol University November 23, 2016 LS3 seminar

Slide 2

Slide 2 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials 1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials LS3 seminar

Slide 3

Slide 3 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Introduction Sampling distribution over high-dimensional state-space has recently attracted a lot of research efforts in computational statistics and machine learning community... Applications (non-exhaustive) 1 Bayesian inference for high-dimensional models and Bayesian non parametrics 2 Bayesian linear inverse problems (typically function space problems) 3 Aggregation of estimators and experts Most of the sampling techniques known so far do not scale to high-dimension... Challenges are numerous in this area... LS3 seminar

Slide 4

Slide 4 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Bayesian setting (I) - In a Bayesian setting, a parameter β β β ∈ Rd is embedded with a prior distribution p and the observations are given by a likelihood: Y ∼ L(·|β β β) The inference is then based on the posterior distribution: π(dβ β β|Y ) = p(dβ β β)L(Y |β β β) L(Y |u)p(du) . In most cases the normalizing constant is not tractable: π(dβ β β|Y ) ∝ p(dβ β β)L(Y |β β β) . LS3 seminar

Slide 5

Slide 5 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Bayesian setting (II) Bayesian decision theory relies on computing expectations: Rd f(β β β)L(Y |β β β)p(dβ β β) Generic problem: estimation of an expectation Eπ [f], where - π is known up to a multiplicative factor ; - Sampling directly from π is not an option; - π is high dimensional. LS3 seminar

Slide 6

Slide 6 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Logistic and probit regression Likelihood: Binary regression set-up in which the binary observations (responses) (Y1 , . . . , Yn ) are conditionally independent Bernoulli random variables with success probability F(β β βT Xi ), where 1 Xi is a d dimensional vector of known covariates, 2 β β β is a d dimensional vector of unknown regression coefficient 3 F is a distribution function. Two important special cases: 1 probit regression: F is the standard normal distribution function, 2 logistic regression: F is the standard logistic distribution function, F(t) = et/(1 + et). LS3 seminar

Slide 7

Slide 7 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials A daunting problem ? The posterior density distribution of β β β is given , up to a proportionality constant by π(β β β|(Y, X)) ∝ exp(−U(β β β)); where the potential U(β β β) is given by U(β β β) = − p i=1 {Yi log F(β β βT Xi )+(1−Yi ) log(1−F(β β βT Xi ))}+g(β β β) , where g is the log density of the posterior distribution. Two important cases: Gaussian prior g(β β β) = (1/2)β β βT Σβ β β, ridge regression. Laplace prior g(β β β) = λ d i=1 |β β βi |, lasso regression. LS3 seminar

Slide 8

Slide 8 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials New challenges Problem the number of predictor variables d is large (104 and up). Examples: text categorization, genomics and proteomics (gene expression analysis)... The most popular algorithms for Bayesian inference in binary regression models are based on data augmentation: 1 probit link: Albert and Chib (1993). 2 logistic link: Polya-Gamma sampler, Polsson and Scott (2012)... ! LS3 seminar

Slide 9

Slide 9 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Data Augmentation algorithms (I) Data Augmentation: Instead on sampling π(β β β|(X, Y )) sample π(β β β, W|(X, Y )) probability measure on Rd1 × Rd2 and take the marginal wrt β β β. Typical application of the Gibbs sampler: sample in turn π(β β β|(X, Y, W)) and π(W|(X, Y,β β β)). The Gibbs sampler consists in sampling a Markov chain (β β βk , Wk )k≥0 defined by 1 Given (β β βk, Wk), 2 Draw Wk+1 ∼ π (·|(β β βk, X, Y )) . β β βk+1 ∼ π (·|(Wk+1, X, Y )) . The target density π(β β β, W|(X, Y )) is invariant for the Markov chain (β β βk , Wk )k≥0 ! The choice of the DA should make these two steps reasonably easy... LS3 seminar

Slide 10

Slide 10 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Data Augmentation algorithms (II) Question: Control the distance between the law of (β β βn , Wn ) and the stationary distribution π(β β β, W|(X, Y ))? Definition (Geometric ergodicity) We will say that the Markov kernel P on (Rd, B(Rd)) is geometrically ergodic if there exits κ ∈ (0, 1) such that for all n ≥ 0 and x ∈ Rd, Pn(x, ·) − π TV ≤ C(x)κn . where for µ, ν two probabilities measure on Rd, define µ − ν TV = sup |f|≤1 |µ(f) − ν(f)| . LS3 seminar

Slide 11

Slide 11 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Data Augmentation algorithms (III) The algorithm of Albert and Chib and the Polya-Gamma sampler have been shown to be uniformly geometrically ergodic, BUT - The geometric rate of convergence is exponentially small with the dimension - do not allow to construct honest confidence intervals, credible regions The algorithms are very demanding in terms of computational ressources... - applicable only when is d small 10 to moderate 100 but certainly not when d is large (104 or more). - convergence time prohibitive as soon as d ≥ 102. LS3 seminar

Slide 12

Slide 12 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials A daunting problem ? In the case of the ridge regression, the potential U is smooth strongly convex. In the case of the lasso regression, the potential U is non-smooth but still convex... A wealth of reasonably fast optimisation algorithms are available to solve this problem in high-dimension... LS3 seminar

Slide 13

Slide 13 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials 1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials LS3 seminar

Slide 14

Slide 14 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Framework Denote by π a target density w.r.t. the Lebesgue measure on Rd, known up to a normalisation factor x → e−U(x)/ Rd e−U(y)dy , Implicitly, d 1. Assumption: U is L-smooth : twice continuously differentiable and there exists a constant L such that for all x, y ∈ Rd, ∇U(x) − ∇U(y) ≤ L x − y . This condition can be weakened. LS3 seminar

Slide 15

Slide 15 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Langevin diffusion Langevin SDE: dYt = −∇U(Yt )dt + √ 2dBt , where (Bt )t≥0 is a d-dimensional Brownian Motion. (Pt )t≥0 is a Markov semigroup: - aperiodic, strong Feller (all compact sets are small). - reversible w.r.t. to π (admits π as its unique invariant distribution). π ∝ e−U is reversible ; the unique invariant probability measure. For all x ∈ Rd, lim t→+∞ δx Pt − π TV = 0 . LS3 seminar

Slide 16

Slide 16 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Discretized Langevin diffusion Idea: Sample the diffusion paths, using for example the Euler-Maruyama (EM) scheme: Xk+1 = Xk − γk+1 ∇U(Xk ) + 2γk+1 Zk+1 where - (Zk )k≥1 is i.i.d. N(0, Id ) - (γk )k≥1 is a sequence of stepsizes, which can either be held constant or be chosen to decrease to 0 at a certain rate. Closely related to the gradient algorithm. LS3 seminar

Slide 17

Slide 17 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Discretized Langevin diffusion: constant stepzize When γk = γ, then (Xk )k≥1 is an homogeneous Markov chain with Markov kernel Rγ Under some appropriate conditions, this Markov chain is irreducible, positive recurrent ; unique invariant distribution πγ . Problem: πγ = π. LS3 seminar

Slide 18

Slide 18 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials The Euler-Maruyama (EM) Markov chain When (γk )k≥1 is nonincreasing and non constant, (Xk )k≥1 is an inhomogeneous Markov chain associated with the sequence of Markov kernel (Rγk )k≥1 . We denote by δx Qp γ the law of Xp stated at x. The diffusion converges to the target distribution... Question: since the EM discretization approximates the diffusion, might we use it to sample from π ? Control δx Qp γ − π TV ? Obtain bounds with explicit dependence in the dimension d. Previous works: Lamberton, Pages, 2002, Dalalyan, 2014. LS3 seminar

Slide 19

Slide 19 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Metropolis-Adjusted Langevin Algorithm To correct the target distribution, a Metropolis-Hastings step can be included ; Metropolis Adjusted Langevin Agorithm (MALA). - Key references Roberts and Tweedie, 1996 Algorithm: 1 Propose Yk+1 ∼ Xk − γ∇U(Xk ) + √ 2γZk+1 , Zk+1 ∼ N(0, Id ) 2 Compute the acceptance ratio αγ (Xk , Yk+1 ) αγ (x, y) = 1 ∧ π(y)rγ (y, x) π(x)rγ (x, y) , rγ (x, y) ∝ e− y−x−γ∇U(x) 2/(4γ) 3 Accept / Reject the proposal. LS3 seminar

Slide 20

Slide 20 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials MALA: pros and cons Require to compute one gradient at each iteration and to evaluate one time the objective function Geometric convergence is established under the condition that in the tail the acceptance region is inwards in q, lim x →∞ Aγ (x)∆I(x) rγ (x, y)dy = 0 . where I(x) = {y, y ≤ x } and Aγ (x) is the acceptance region Aγ (x) = {y, π(x)rγ (x, y) ≤ π(y)rγ (y, x)} LS3 seminar

Slide 21

Slide 21 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments 1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials LS3 seminar

Slide 22

Slide 22 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Strongly convex potential Assume that U is strongly convex: there exists m > 0, such that for all x, y ∈ Rd, ∇U(x) − ∇U(y), x − y ≥ m x − y 2 . Convergence in Wasserstein distance of the semigroup ! LS3 seminar

Slide 23

Slide 23 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Wasserstein distance Definition For µ, ν two probabilities measure on Rd, define W2 (µ, ν) = inf (X,Y )∈Π(µ,ν) E1/2 X − Y 2 , where (X, Y ) ∈ Π(µ, ν) if X ∼ µ and Y ∼ ν. Note by the Cauchy-Schwarz inequality, for all f : Rd → R, f Lip ≤ 1, (X, Y ) ∈ Π(µ, ν), |µ(f) − ν(f)| ≤ E1/2 X − Y 2 ≤ W2 (µ, ν) . LS3 seminar

Slide 24

Slide 24 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Wasserstein distance convergence Theorem Assume that U is L-smooth and m-strongly convex. Then, for all probability measures x, y ∈ Rd and t ≥ 0, W2 (δx Pt , δy Pt ) ≤ e−mt x − y The mixing rate depends only on the strong convexity constant. LS3 seminar

Slide 25

Slide 25 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Elements of proof dYt = −∇U(Yt )dt + √ 2dBt , d ˜ Yt = −∇U( ˜ Yt )dt + √ 2dBt , where (Y0 , ˜ Y0 ) = (x, y). This SDE has a unique strong solution (Yt , ˜ Yt )t≥0 LS3 seminar

Slide 26

Slide 26 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Elements of proof dYt = −∇U(Yt )dt + √ 2dBt , d ˜ Yt = −∇U( ˜ Yt )dt + √ 2dBt , where (Y0 , ˜ Y0 ) = (x, y). This SDE has a unique strong solution (Yt , ˜ Yt )t≥0 Consider Vt = Yt − ˜ Yt 2 . Since d{Yt − ˜ Yt } = − ∇U(Yt ) − ∇U( ˜ Yt ) dt we get a very simple SDE for (Vt , t ≥ 0) dVt = − ∇U(Yt ) − ∇U( ˜ Yt ), Yt − ˜ Yt dt . LS3 seminar

Slide 27

Slide 27 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Elements of proof Integrating this SDE we get Yt − ˜ Yt 2 = Y0 − ˜ Y0 2 − 2 t 0 (∇U(Ys ) − ∇U( ˜ Ys )), Ys − ˜ Ys ds , Since U is strongly convex ∇U(y) − ∇U(y ), y − y ≥ m y − y 2 which implies Yt − ˜ Yt 2 ≤ Y0 − ˜ Y0 2 − 2m t 0 Ys − ˜ Ys 2 ds . LS3 seminar

Slide 28

Slide 28 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Elements of proof Yt − ˜ Yt 2 ≤ Y0 − ˜ Y0 2 − 2m t 0 Ys − ˜ Ys 2 ds . By Gr¨ omwall inequality, we obtain Yt − ˜ Yt 2 ≤ Y0 − ˜ Y0 2 e−2mt The proof follows since for all t ≥ 0, the law of (Yt , ˜ Yt ) is a coupling between δx Pt and δy Pt . LS3 seminar

Slide 29

Slide 29 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Theorem Assume that U is L-smooth and m-strongly convex. Then, or any x ∈ Rd and t ≥ 0 W2 (δx Pt , π) ≤ x − x + d m e−mt . where x = arg min x∈Rd U(x) . The constant depends only linearly in the dimension d. LS3 seminar

Slide 30

Slide 30 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Elements of proof dYt = −∇U(Yt )dt + √ 2dBt , d ˜ Yt = −∇U( ˜ Yt )dt + √ 2dBt , where Y0 = x and ˜ Y0 ∼ π. Since π is invariant for (Pt )t≥0 , then ˜ Yt ∼ π for all t ≥ 0, showing that W2 2 (δx Pt , π) ≤ E Yt − ˜ Yt 2 ≤ e−mt E x − ˜ Y0 2 . The proof follows from E x − ˜ Y0 2 ≤ d/m. LS3 seminar

Slide 31

Slide 31 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments A coupling proof We now proceed to establish explicit bounds for W2 (µ0 Qn γ , π) for µ0 ∈ P2 (Rd). Since πPt = π for all t ≥ 0, it suffices to get some bounds on W2 µ0 Qn γ , ν0 PΓn , with ν0 ∈ P2 (Rd) and take ν0 = π. Idea ! Construct a coupling between the diffusion and the linear interpolation of the Euler discretization. LS3 seminar

Slide 32

Slide 32 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments A coupling proof (II) An obvious candidate is the synchronous coupling (Yt , Y t )t≥0 for all n ≥ 0 and t ∈ [Γn , Γn+1 ) by Yt = YΓn − t Γn ∇U(Ys )ds + √ 2(Bt − BΓn ) ¯ Yt = ¯ YΓn − ∇U( ¯ YΓn )(t − Γn ) + √ 2(Bt − BΓn ) , For all n ≥ 0, W2 2 µ0 PΓn , ν0 Qn γ ≤ E[ YΓn − ¯ YΓn 2] , LS3 seminar

Slide 33

Slide 33 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Explicit bound in Wasserstein distance for the Euler discretisation Theorem Assume U is L-smooth and strongly convex. Let (γk )k≥1 be a nonincreasing sequence with γ1 ≤ 1/(m + L). (Optional assumption) U ∈ C3(Rd) and there exists ˜ L such that for all x, y ∈ Rd: ∇2U(x) − ∇2U(y) ≤ ˜ L x − y . Then there exist sequences {u(1) n (γ), n ∈ N} and {u(2) n (γ), n ∈ N} (explicit expressions are available) such that for all x ∈ Rd and n ≥ 1, W2 δx Qn γ , π ≤ u(1) n (γ) Rd y − x 2 π(dy) + u(2) n (γ) , LS3 seminar

Slide 34

Slide 34 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Decreasing step sizes If limk→+∞ γk = 0 and limk→+∞ Γk = +∞, then lim p→+∞ W2 δx Qn γ , π = 0 , with explicit control. Order of convergence for decreasing stepsize. α ∈ (0, 1) Order of convergence O(n−α) Table : Order of convergence of W2 δx Qn γ , π for γk = γ1 k−α LS3 seminar

Slide 35

Slide 35 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Constant step sizes For any > 0, the minimal number of iterations to achieve W2 δx Qp γ , π ≤ is p = O( √ d −1) . For a fixed number of iterations p, we can choose the stepsize γ such that W2 δx Qp γ , π ≤ Cp−1 . For a given stepsize γ, letting p → +∞, we get: W2 (πγ , π) ≤ Cγ . LS3 seminar

Slide 36

Slide 36 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments From the Wasserstein distance to the TV Theorem If U is strongly convex, then for all x, y ∈ Rd, Pt (x, ·) − Pt (y, ·) TV ≤ 1 − 2Φ − x − y (4/m)(e2mt − 1) Proof. reflection coupling For all x, y ∈ Rd, Pt (x, ·) − Pt (y, ·) TV ≤ x − y (2π/m)(e2mt − 1) LS3 seminar

Slide 37

Slide 37 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments From the Wasserstein distance to the TV (II) Pt (x, ·) − Pt (y, ·) TV ≤ x − y (2π/m)(e2mt − 1) Consequences: 1 (Pt )t≥0 converges exponentially fast to π in total variation at a rate e−mt. 2 For all f : Rd → R, measurable and sup |f| ≤ 1, then x → Pt f(x) , is Lipschitz with Lipschitz constant smaller than 1/ (2π/m)(e2mt − 1) . LS3 seminar

Slide 38

Slide 38 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Explicit bound in total variation Theorem Assume U is L-smooth and strongly convex. Let (γk )k≥1 be a nonincreasing sequence with γ1 ≤ 1/(m + L). (Optional assumption) U ∈ C3(Rd) and there exists ˜ L such that for all x, y ∈ Rd: ∇2U(x) − ∇2U(y) ≤ ˜ L x − y . Then there exist sequences {˜ u(1) n (γ), n ∈ N} and {˜ u(2) n (γ), n ∈ N} such that for all x ∈ Rd and n ≥ 1, δx Qn γ − π TV ≤ ˜ u(1) n (γ) Rd y − x 2 π(dy) + ˜ u(2) n (γ) . LS3 seminar

Slide 39

Slide 39 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Constant step sizes For any > 0, the minimal number of iterations to achieve δx Qp γ − π TV ≤ is p = O( √ d log(d) −1 |log( )|) . For a given stepsize γ, letting p → +∞, we get: πγ − π TV ≤ Cγ |log(γ)| . LS3 seminar

Slide 40

Slide 40 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Comparison of MALA and ULA (I) We compare MALA and ULA for the logistic regression with Gaussian prior on five real data sets. Data set Observations p Covariates d German credit 1000 25 Heart disease 270 14 Australian credit 690 35 Prima indian diabetes 768 9 Musk 476 167 Table : Dimension of the data sets LS3 seminar

Slide 41

Slide 41 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Comparison of MALA and ULA (II) Define the marginal accuracy between two probability measure µ, ν on (R, B(Rd)) by MA(µ, ν) = 1 − (1/2) µ − ν TV . We compare MALA and ULA for each data sets by estimating for each component i ∈ {1, . . . , d} the marginal accuracy between their d marginal empirical distributions and the d marginal posterior distributions. LS3 seminar

Slide 42

Slide 42 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Comparison of MALA and ULA (III) To estimate the d marginal posterior distributions, we run 2 · 107 iterations of the Polya-Gamma Gibbs sampler. Then 100 runs of MALA and ULA (106 iterations per run) have been performed. For MALA, the step-size is chosen so that the acceptance probability is ≈ 0.5. For ULA, we choose the same constant step-size than MALA. LS3 seminar

Slide 43

Slide 43 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Comparison of MALA and ULA (IV) 0.975 0.98 0.985 ULA MALA 0.95 0.96 0.97 0.98 ULA MALA 0.95 0.96 0.97 0.98 ULA MALA 0.95 0.96 0.97 0.98 ULA MALA Figure : Marginal accuracy accross all the dimensions. Upper left: German credit data set. Upper right: Australian credit data set. Lower left: Heart disease data set. Lower right: Pima Indian diabetes data set LS3 seminar

Slide 44

Slide 44 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Comparison of MALA and ULA (V) ULA MALA 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 MALA ULA Figure : Marginal accuracy accross all the dimensions for the Musk data Set LS3 seminar

Slide 45

Slide 45 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Comparison of MALA and ULA (VI) Figure : 2-dimensional histogram for the Musk data Set. LS3 seminar

Slide 46

Slide 46 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials 1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials LS3 seminar

Slide 47

Slide 47 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence of the Euler discretization Assume one of the following conditions: - There exist α > 1, ρ > 0 and Mρ ≥ 0 such that for all y ∈ Rd, y ≥ Mρ : ∇U(y), y ≥ ρ y α . - U is convex. If limγk→+∞ γk = 0, and k γk = +∞ then lim p→+∞ δx Qp γ − π TV = 0 . Computable bounds for the convergence1. 1D., Moulines, Annals of Applied Probability, 2016 LS3 seminar

Slide 48

Slide 48 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Target precision : the convex case Setting U is convex. Constant stepsize Optimal stepsize γ and number of iterations p to achieve -accuracy in TV: δx Qp γ − π TV ≤ . d ε L γ O(d−3) O(ε2/ log(ε−1)) O(L−2) p O(d5) O(ε−2 log2(ε−1)) O(L2) In the strongly convex case, the convergence of the semigroup of the diffusion to π depends only on the strong convexity constant m. In the convex case, this depends on the dimension !. LS3 seminar

Slide 49

Slide 49 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Strongly convex outside a ball potential U is convex everywhere and strongly convex outside a ball, i.e. there exist R ≥ 0 and m > 0, such that for all x, y ∈ Rd, x − y ≥ R, ∇U(x) − ∇U(y), x − y ≥ m x − y 2 . Eberle, 2015 established that the convergence in the Wasserstein distance does not depends on the dimension. D., Moulines 2016 established that the convergence of the semi-group in TV to π does not depends on the dimension but just on R ; new bounds which scale nicely in the dimension. LS3 seminar

Slide 50

Slide 50 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Dependence on the dimension Setting U is convex and strongly convex outside a ball. Constant stepsize Optimal stepsize γ and number of iterations p to achieve -accuracy in TV: δx Qp γ − π TV ≤ . d ε L m R γ O(d−1) O(ε2/ log(ε−1)) O(L−2) O(m) O(R−4) p O(d log(d)) O(ε−2 log2(ε−1)) O(L2) O(m−2) O(R8) LS3 seminar

Slide 51

Slide 51 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials 1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials LS3 seminar

Slide 52

Slide 52 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Bounds for functionals Let f : Rd → R be a Lipshitz function and (Xk )k≥0 the Euler discretization of the Langevin diffusion. We approximate Rd f(x)π(dx) by the weighted average estimator ˆ πN n (f) = N+n k=N+1 ωk,n f(Xk ) , ωk,n = γk+1 Γ−1 N+2,N+n+1 . where N ≥ 0 is the length of the burn-in period, n ≥ 1 is the number of effective samples. Objective: compute an explicit bounds for the Mean Square Error (MSE) of this estimator defined by: MSEf (N, n) = Ex ˆ πN n (f) − π(f) 2 . LS3 seminar

Slide 53

Slide 53 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials The MSE can be decomposed into the sum of the squared bias and the variance MSEf (N, n) = Ex [ˆ πN n (f)] − π(f) 2 + Varx ˆ πN n (f) , By definition of the Wasserstein distance:, Bias2 ≤ f 2 Lip N+n k=N+1 ωk,n W2 1 (δx Qk γ , π) . and W2 1 (δx Qk γ , π) ≤ 2( x − x 2 + d/m)u(1) k (γ) + u(2,3) k (γ) . LS3 seminar

Slide 54

Slide 54 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials MSE Theorem Assume that U is L-smooth and strongly convex. Let (γk )k≥1 be a nonincreasing sequence with γ1 ≤ 2/(m + L). Then for all N ≥ 0, n ≥ 1 and Lipschitz functions f : Rd → R, we get Varx ˆ πN n (f) ≤ 8κ−2 f 2 Lip Γ−1 N+2,N+n+1 u(3) N,n (γ) where u(3) N,n (γ) def = 1 + Γ−1 N+2,N+n+1 (κ−1 + 2/(m + L)) . The upper bound is independent of the dimension and allow to construct honest confidence bounds. LS3 seminar

Slide 55

Slide 55 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Bound for the MSE α = 0 γ2 1 + (γ1 n)−1 exp(−κγ1 N/2) α ∈ (0, 1/3) γ2 1 n−2α + (γ1 n1−α)−1 exp(−κγ1 N1−α/(2(1 − α))) α = 1/3 γ2 1 log(n)n−2/3 + (γ1 n2/3)−1 exp(−κγ1 N1/2/4) α ∈ (1/3, 1) nα−1 γ2 1 + γ−1 1 exp(−κγ1 N1−α/(2(1 − α))) α = 1 log(n)−1 γ2 1 + γ−1 1 N−γ1κ/2 Table : Bound for the MSE for γk = γ1 k−α for fixed γ1 and N with more regularity on U LS3 seminar

Slide 56

Slide 56 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Optimal choice of γ1 Bound for the MSE α = 0 n−1/3 n−2/3 α ∈ (0, 1/2) nα−1/3 n−2/3 α = 1/2 (log(n))−1/3 log1/3(n)n−2/3 α ∈ (1/2, 1) 1/(m + L) n1−α α = 1 1/(m + L) log(n) Table : Bound for the MSE for γk = γ1 k−α for fixed n with more regularity on U LS3 seminar

Slide 57

Slide 57 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials 1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials LS3 seminar

Slide 58

Slide 58 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Non-smooth potentials The target distribution has a density π with respect to the Lebesgue measure on Rd of the form x → e−U(x)/ Rd e−U(y)dy where U = f + g, with f : Rd → R and g : Rd → (−∞, +∞] are two lower bounded, convex functions satisfying: 1 f is continuously differentiable and gradient Lipschitz with Lipschitz constant Lf , i.e. for all x, y ∈ Rd ∇f(x) − ∇f(y) ≤ Lf x − y . 2 g is lower semi-continuous and Rd e−g(y)dy ∈ (0, +∞). LS3 seminar

Slide 59

Slide 59 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Moreau-Yosida regularization Let h : Rd → (−∞, +∞] be a l.s.c convex function and λ > 0. The λ-Moreau-Yosida envelope hλ : Rd → R and the proximal operator proxλ h : Rd → Rd associated with h are defined for all x ∈ Rd by hλ(x) = inf y∈Rd h(y) + (2λ)−1 x − y 2 ≤ h(x) . For every x ∈ Rd, the minimum is achieved at a unique point, proxλ h (x), which is characterized by the inclusion x − proxλ h (x) ∈ γ∂h(proxλ h (x)) . The Moreau-Yosida envelope is a regularized version of g, which approximates g from below. LS3 seminar

Slide 60

Slide 60 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Properties of proximal operators As λ ↓ 0, converges hλ converges pointwise h, i.e. for all x ∈ Rd, hλ(x) ↑ h(x) , as λ ↓ 0 . The function hλ is convex and continuously differentiable ∇hλ(x) = λ−1(x − proxλ h (x)) . The proximal operator is a monotone operator, for all x, y ∈ Rd, proxλ h (x) − proxλ h (y), x − y ≥ 0 , which implies that the Moreau-Yosida envelope is L-smooth: ∇hλ(x) − ∇hλ(y) ≤ λ−1 x − y , for all x, y ∈ Rd. LS3 seminar

Slide 61

Slide 61 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials MY regularized potential If g is not differentiable, but the proximal operator associated with g is available, its λ-Moreau Yosida envelope gλ can be considered. This leads to the approximation of the potential Uλ : Rd → R defined for all x ∈ Rd by Uλ(x) = f(x) + gλ(x) . Theorem (D., Moulines, Pereira, 2016) Under (H), for all λ > 0, 0 < Rd e−Uλ(y)dy < +∞. LS3 seminar

Slide 62

Slide 62 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Some approximation results Theorem Assume (H). 1 Then, limλ→0 πλ − π TV = 0. 2 Assume in addition that g is Lipschitz. Then for all λ > 0, πλ − π TV ≤ λ g 2 Lip . LS3 seminar

Slide 63

Slide 63 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials The MYULA algorithm-I Given a regularization parameter λ > 0 and a sequence of stepsizes {γk , k ∈ N∗}, the algorithm produces the Markov chain {XM k , k ∈ N}: for all k ≥ 0, XM k+1 = XM k −γk+1 ∇f(XM k ) + λ−1(XM k − proxλ g (XM k )) + 2γk+1 Zk+1 , where {Zk , k ∈ N∗} is a sequence of i.i.d. d-dimensional standard Gaussian random variables. LS3 seminar

Slide 64

Slide 64 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials The MYULA algorithm-II The ULA target the smoothed distribution πλ. To compute the expectation of a function h : Rd → R under π from {XM k ; 0 ≤ k ≤ n}, an importance sampling step is used to correct the regularization. This step amounts to approximate Rd h(x)π(x)dx by the weighted sum Sh n = n k=0 ωk,n h(Xk ) , with ωk,n = n k=0 γk e¯ gλ(XM k ) −1 γk e¯ gλ(XM k ) , where for all x ∈ Rd ¯ gλ(x) = gλ(x)−g(x) = g(proxλ g (x))−g(x)+(2λ)−1 x − proxλ g (x) 2 . LS3 seminar

Slide 65

Slide 65 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Image deconvolution Objective recover an original image x ∈ Rn from a blurred and noisy observed image y ∈ Rn related to x by the linear observation model y = Hx + w, where H is a linear operator representing the blur point spread function and w is a Gaussian vector with zero-mean and covariance matrix σ2In . This inverse problem is usually ill-posed or ill-conditioned: exploits prior knowledge about x. One of the most widely used image prior for deconvolution problems is the improper total-variation norm prior, π(x) ∝ exp (−α ∇d x 1 ), where ∇d denotes the discrete gradient operator that computes the vertical and horizontal differences between neighbour pixels. π(x|y) ∝ exp − y − Hx 2/2σ2 − α ∇d x 1 . LS3 seminar

Slide 66

Slide 66 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials (a) (b) (c) Figure : (a) Original Boat image (256 × 256 pixels), (b) Blurred image, (c) MAP estimate. LS3 seminar

Slide 67

Slide 67 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Credibility intervals (a) (b) (c) Figure : (a) Pixel-wise 90% credibility intervals computed with proximal MALA (computing time 35 hours), (b) Approximate intervals estimated with MYULA using λ = 0.01 (computing time 3.5 hours), (c) Approximate intervals estimated with MYULA using λ = 0.1 (computing time 20 minutes). LS3 seminar

Slide 68

Slide 68 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Microscopy dataset (a) (b) LS3 seminar

Slide 69

Slide 69 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials HPD credible region Consider the 1 norm prior, π(x) ∝ exp (−α x 1 ), then π(x|y) ∝ exp − y − Hx 2/2σ2 − α x 1 . We want to test if the molecules are indeed present in true image (as opposed to being noise artefacts for example), Uncertainty about their position. For this, it can be relevent to compute the HPD credible region C∗ α = {x : U(x) ≤ ηα } with ηα ∈ R chosen such that P(x ∈ C∗ α |y) = 1 − α holds. Here we use α = 0.01 related to the 99% confidence level. LS3 seminar

Slide 70

Slide 70 text

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Comparison with PMALA (a) (b) Figure : Microscopy experiment: (a) HDP region thresholds ηα for MYULA (2 × 106 iterations λ = 1, γ = 0.6) and PMALA (2 × 107 iterations), (b) relative approximation error of MYULA. LS3 seminar