Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

Alain Durmus

S³ Seminar
November 23, 2016

Alain Durmus

(LTCI, Telecom ParisTech)

https://s3-seminar.github.io/seminars/alain-durmus

Title — High dimensional sampling with the Unadjusted Langevin Algorithm

Abstract — Recently, the problem of designing MCMC sampler adapted to high-dimensional distributions and with sensible theoretical guarantees has received a lot of interest. The applications are numerous, including large-scale inference in machine learning, Bayesian nonparametrics, Bayesian inverse problem, aggregation of experts among others. When the density is L-smooth (the log-density is continuously differentiable and its derivative is Lipshitz), we will advocate the use of a “rejection- free” algorithm, based on the discretization of the Euler diffusion with either constant or decreasing stepsizes. We will present several new results allowing convergence to stationarity under different conditions for the log-density (from the weakest, bounded oscillations on a compact set and super-exponential in the tails to the log concave). When the density is strongly log-concave, the convergence of an appropriately weighted empirical measure is also investigated and bounds for the mean square error and exponential deviation inequality for Lipschitz functions will be reported. Finally, based on optimzation techniques we will propose new methods to sample from high dimensional distributions. In particular, we will be interested in densities which are not continuously differentiable. Some Monte Carlo experiments will be presented to support our findings.

S³ Seminar

November 23, 2016
Tweet

More Decks by S³ Seminar

Other Decks in Research

Transcript

  1. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Non-asymptotic convergence bound for the Unadjusted Langevin Algorithm Alain Durmus, Eric Moulines, Marcelo Pereyra Telecom ParisTech, Ecole Polytechnique, Bristol University November 23, 2016 LS3 seminar
  2. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials 1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials LS3 seminar
  3. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Introduction Sampling distribution over high-dimensional state-space has recently attracted a lot of research efforts in computational statistics and machine learning community... Applications (non-exhaustive) 1 Bayesian inference for high-dimensional models and Bayesian non parametrics 2 Bayesian linear inverse problems (typically function space problems) 3 Aggregation of estimators and experts Most of the sampling techniques known so far do not scale to high-dimension... Challenges are numerous in this area... LS3 seminar
  4. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Bayesian setting (I) - In a Bayesian setting, a parameter β β β ∈ Rd is embedded with a prior distribution p and the observations are given by a likelihood: Y ∼ L(·|β β β) The inference is then based on the posterior distribution: π(dβ β β|Y ) = p(dβ β β)L(Y |β β β) L(Y |u)p(du) . In most cases the normalizing constant is not tractable: π(dβ β β|Y ) ∝ p(dβ β β)L(Y |β β β) . LS3 seminar
  5. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Bayesian setting (II) Bayesian decision theory relies on computing expectations: Rd f(β β β)L(Y |β β β)p(dβ β β) Generic problem: estimation of an expectation Eπ [f], where - π is known up to a multiplicative factor ; - Sampling directly from π is not an option; - π is high dimensional. LS3 seminar
  6. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Logistic and probit regression Likelihood: Binary regression set-up in which the binary observations (responses) (Y1 , . . . , Yn ) are conditionally independent Bernoulli random variables with success probability F(β β βT Xi ), where 1 Xi is a d dimensional vector of known covariates, 2 β β β is a d dimensional vector of unknown regression coefficient 3 F is a distribution function. Two important special cases: 1 probit regression: F is the standard normal distribution function, 2 logistic regression: F is the standard logistic distribution function, F(t) = et/(1 + et). LS3 seminar
  7. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials A daunting problem ? The posterior density distribution of β β β is given , up to a proportionality constant by π(β β β|(Y, X)) ∝ exp(−U(β β β)); where the potential U(β β β) is given by U(β β β) = − p i=1 {Yi log F(β β βT Xi )+(1−Yi ) log(1−F(β β βT Xi ))}+g(β β β) , where g is the log density of the posterior distribution. Two important cases: Gaussian prior g(β β β) = (1/2)β β βT Σβ β β, ridge regression. Laplace prior g(β β β) = λ d i=1 |β β βi |, lasso regression. LS3 seminar
  8. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials New challenges Problem the number of predictor variables d is large (104 and up). Examples: text categorization, genomics and proteomics (gene expression analysis)... The most popular algorithms for Bayesian inference in binary regression models are based on data augmentation: 1 probit link: Albert and Chib (1993). 2 logistic link: Polya-Gamma sampler, Polsson and Scott (2012)... ! LS3 seminar
  9. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Data Augmentation algorithms (I) Data Augmentation: Instead on sampling π(β β β|(X, Y )) sample π(β β β, W|(X, Y )) probability measure on Rd1 × Rd2 and take the marginal wrt β β β. Typical application of the Gibbs sampler: sample in turn π(β β β|(X, Y, W)) and π(W|(X, Y,β β β)). The Gibbs sampler consists in sampling a Markov chain (β β βk , Wk )k≥0 defined by 1 Given (β β βk, Wk), 2 Draw Wk+1 ∼ π (·|(β β βk, X, Y )) . β β βk+1 ∼ π (·|(Wk+1, X, Y )) . The target density π(β β β, W|(X, Y )) is invariant for the Markov chain (β β βk , Wk )k≥0 ! The choice of the DA should make these two steps reasonably easy... LS3 seminar
  10. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Data Augmentation algorithms (II) Question: Control the distance between the law of (β β βn , Wn ) and the stationary distribution π(β β β, W|(X, Y ))? Definition (Geometric ergodicity) We will say that the Markov kernel P on (Rd, B(Rd)) is geometrically ergodic if there exits κ ∈ (0, 1) such that for all n ≥ 0 and x ∈ Rd, Pn(x, ·) − π TV ≤ C(x)κn . where for µ, ν two probabilities measure on Rd, define µ − ν TV = sup |f|≤1 |µ(f) − ν(f)| . LS3 seminar
  11. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Data Augmentation algorithms (III) The algorithm of Albert and Chib and the Polya-Gamma sampler have been shown to be uniformly geometrically ergodic, BUT - The geometric rate of convergence is exponentially small with the dimension - do not allow to construct honest confidence intervals, credible regions The algorithms are very demanding in terms of computational ressources... - applicable only when is d small 10 to moderate 100 but certainly not when d is large (104 or more). - convergence time prohibitive as soon as d ≥ 102. LS3 seminar
  12. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials A daunting problem ? In the case of the ridge regression, the potential U is smooth strongly convex. In the case of the lasso regression, the potential U is non-smooth but still convex... A wealth of reasonably fast optimisation algorithms are available to solve this problem in high-dimension... LS3 seminar
  13. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials 1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials LS3 seminar
  14. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Framework Denote by π a target density w.r.t. the Lebesgue measure on Rd, known up to a normalisation factor x → e−U(x)/ Rd e−U(y)dy , Implicitly, d 1. Assumption: U is L-smooth : twice continuously differentiable and there exists a constant L such that for all x, y ∈ Rd, ∇U(x) − ∇U(y) ≤ L x − y . This condition can be weakened. LS3 seminar
  15. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Langevin diffusion Langevin SDE: dYt = −∇U(Yt )dt + √ 2dBt , where (Bt )t≥0 is a d-dimensional Brownian Motion. (Pt )t≥0 is a Markov semigroup: - aperiodic, strong Feller (all compact sets are small). - reversible w.r.t. to π (admits π as its unique invariant distribution). π ∝ e−U is reversible ; the unique invariant probability measure. For all x ∈ Rd, lim t→+∞ δx Pt − π TV = 0 . LS3 seminar
  16. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Discretized Langevin diffusion Idea: Sample the diffusion paths, using for example the Euler-Maruyama (EM) scheme: Xk+1 = Xk − γk+1 ∇U(Xk ) + 2γk+1 Zk+1 where - (Zk )k≥1 is i.i.d. N(0, Id ) - (γk )k≥1 is a sequence of stepsizes, which can either be held constant or be chosen to decrease to 0 at a certain rate. Closely related to the gradient algorithm. LS3 seminar
  17. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Discretized Langevin diffusion: constant stepzize When γk = γ, then (Xk )k≥1 is an homogeneous Markov chain with Markov kernel Rγ Under some appropriate conditions, this Markov chain is irreducible, positive recurrent ; unique invariant distribution πγ . Problem: πγ = π. LS3 seminar
  18. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials The Euler-Maruyama (EM) Markov chain When (γk )k≥1 is nonincreasing and non constant, (Xk )k≥1 is an inhomogeneous Markov chain associated with the sequence of Markov kernel (Rγk )k≥1 . We denote by δx Qp γ the law of Xp stated at x. The diffusion converges to the target distribution... Question: since the EM discretization approximates the diffusion, might we use it to sample from π ? Control δx Qp γ − π TV ? Obtain bounds with explicit dependence in the dimension d. Previous works: Lamberton, Pages, 2002, Dalalyan, 2014. LS3 seminar
  19. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Metropolis-Adjusted Langevin Algorithm To correct the target distribution, a Metropolis-Hastings step can be included ; Metropolis Adjusted Langevin Agorithm (MALA). - Key references Roberts and Tweedie, 1996 Algorithm: 1 Propose Yk+1 ∼ Xk − γ∇U(Xk ) + √ 2γZk+1 , Zk+1 ∼ N(0, Id ) 2 Compute the acceptance ratio αγ (Xk , Yk+1 ) αγ (x, y) = 1 ∧ π(y)rγ (y, x) π(x)rγ (x, y) , rγ (x, y) ∝ e− y−x−γ∇U(x) 2/(4γ) 3 Accept / Reject the proposal. LS3 seminar
  20. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials MALA: pros and cons Require to compute one gradient at each iteration and to evaluate one time the objective function Geometric convergence is established under the condition that in the tail the acceptance region is inwards in q, lim x →∞ Aγ (x)∆I(x) rγ (x, y)dy = 0 . where I(x) = {y, y ≤ x } and Aγ (x) is the acceptance region Aγ (x) = {y, π(x)rγ (x, y) ≤ π(y)rγ (y, x)} LS3 seminar
  21. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments 1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials LS3 seminar
  22. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Strongly convex potential Assume that U is strongly convex: there exists m > 0, such that for all x, y ∈ Rd, ∇U(x) − ∇U(y), x − y ≥ m x − y 2 . Convergence in Wasserstein distance of the semigroup ! LS3 seminar
  23. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Wasserstein distance Definition For µ, ν two probabilities measure on Rd, define W2 (µ, ν) = inf (X,Y )∈Π(µ,ν) E1/2 X − Y 2 , where (X, Y ) ∈ Π(µ, ν) if X ∼ µ and Y ∼ ν. Note by the Cauchy-Schwarz inequality, for all f : Rd → R, f Lip ≤ 1, (X, Y ) ∈ Π(µ, ν), |µ(f) − ν(f)| ≤ E1/2 X − Y 2 ≤ W2 (µ, ν) . LS3 seminar
  24. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Wasserstein distance convergence Theorem Assume that U is L-smooth and m-strongly convex. Then, for all probability measures x, y ∈ Rd and t ≥ 0, W2 (δx Pt , δy Pt ) ≤ e−mt x − y The mixing rate depends only on the strong convexity constant. LS3 seminar
  25. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Elements of proof dYt = −∇U(Yt )dt + √ 2dBt , d ˜ Yt = −∇U( ˜ Yt )dt + √ 2dBt , where (Y0 , ˜ Y0 ) = (x, y). This SDE has a unique strong solution (Yt , ˜ Yt )t≥0 LS3 seminar
  26. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Elements of proof dYt = −∇U(Yt )dt + √ 2dBt , d ˜ Yt = −∇U( ˜ Yt )dt + √ 2dBt , where (Y0 , ˜ Y0 ) = (x, y). This SDE has a unique strong solution (Yt , ˜ Yt )t≥0 Consider Vt = Yt − ˜ Yt 2 . Since d{Yt − ˜ Yt } = − ∇U(Yt ) − ∇U( ˜ Yt ) dt we get a very simple SDE for (Vt , t ≥ 0) dVt = − ∇U(Yt ) − ∇U( ˜ Yt ), Yt − ˜ Yt dt . LS3 seminar
  27. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Elements of proof Integrating this SDE we get Yt − ˜ Yt 2 = Y0 − ˜ Y0 2 − 2 t 0 (∇U(Ys ) − ∇U( ˜ Ys )), Ys − ˜ Ys ds , Since U is strongly convex ∇U(y) − ∇U(y ), y − y ≥ m y − y 2 which implies Yt − ˜ Yt 2 ≤ Y0 − ˜ Y0 2 − 2m t 0 Ys − ˜ Ys 2 ds . LS3 seminar
  28. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Elements of proof Yt − ˜ Yt 2 ≤ Y0 − ˜ Y0 2 − 2m t 0 Ys − ˜ Ys 2 ds . By Gr¨ omwall inequality, we obtain Yt − ˜ Yt 2 ≤ Y0 − ˜ Y0 2 e−2mt The proof follows since for all t ≥ 0, the law of (Yt , ˜ Yt ) is a coupling between δx Pt and δy Pt . LS3 seminar
  29. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Theorem Assume that U is L-smooth and m-strongly convex. Then, or any x ∈ Rd and t ≥ 0 W2 (δx Pt , π) ≤ x − x + d m e−mt . where x = arg min x∈Rd U(x) . The constant depends only linearly in the dimension d. LS3 seminar
  30. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Elements of proof dYt = −∇U(Yt )dt + √ 2dBt , d ˜ Yt = −∇U( ˜ Yt )dt + √ 2dBt , where Y0 = x and ˜ Y0 ∼ π. Since π is invariant for (Pt )t≥0 , then ˜ Yt ∼ π for all t ≥ 0, showing that W2 2 (δx Pt , π) ≤ E Yt − ˜ Yt 2 ≤ e−mt E x − ˜ Y0 2 . The proof follows from E x − ˜ Y0 2 ≤ d/m. LS3 seminar
  31. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments A coupling proof We now proceed to establish explicit bounds for W2 (µ0 Qn γ , π) for µ0 ∈ P2 (Rd). Since πPt = π for all t ≥ 0, it suffices to get some bounds on W2 µ0 Qn γ , ν0 PΓn , with ν0 ∈ P2 (Rd) and take ν0 = π. Idea ! Construct a coupling between the diffusion and the linear interpolation of the Euler discretization. LS3 seminar
  32. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments A coupling proof (II) An obvious candidate is the synchronous coupling (Yt , Y t )t≥0 for all n ≥ 0 and t ∈ [Γn , Γn+1 ) by Yt = YΓn − t Γn ∇U(Ys )ds + √ 2(Bt − BΓn ) ¯ Yt = ¯ YΓn − ∇U( ¯ YΓn )(t − Γn ) + √ 2(Bt − BΓn ) , For all n ≥ 0, W2 2 µ0 PΓn , ν0 Qn γ ≤ E[ YΓn − ¯ YΓn 2] , LS3 seminar
  33. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Explicit bound in Wasserstein distance for the Euler discretisation Theorem Assume U is L-smooth and strongly convex. Let (γk )k≥1 be a nonincreasing sequence with γ1 ≤ 1/(m + L). (Optional assumption) U ∈ C3(Rd) and there exists ˜ L such that for all x, y ∈ Rd: ∇2U(x) − ∇2U(y) ≤ ˜ L x − y . Then there exist sequences {u(1) n (γ), n ∈ N} and {u(2) n (γ), n ∈ N} (explicit expressions are available) such that for all x ∈ Rd and n ≥ 1, W2 δx Qn γ , π ≤ u(1) n (γ) Rd y − x 2 π(dy) + u(2) n (γ) , LS3 seminar
  34. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Decreasing step sizes If limk→+∞ γk = 0 and limk→+∞ Γk = +∞, then lim p→+∞ W2 δx Qn γ , π = 0 , with explicit control. Order of convergence for decreasing stepsize. α ∈ (0, 1) Order of convergence O(n−α) Table : Order of convergence of W2 δx Qn γ , π for γk = γ1 k−α LS3 seminar
  35. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Constant step sizes For any > 0, the minimal number of iterations to achieve W2 δx Qp γ , π ≤ is p = O( √ d −1) . For a fixed number of iterations p, we can choose the stepsize γ such that W2 δx Qp γ , π ≤ Cp−1 . For a given stepsize γ, letting p → +∞, we get: W2 (πγ , π) ≤ Cγ . LS3 seminar
  36. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments From the Wasserstein distance to the TV Theorem If U is strongly convex, then for all x, y ∈ Rd, Pt (x, ·) − Pt (y, ·) TV ≤ 1 − 2Φ − x − y (4/m)(e2mt − 1) Proof. reflection coupling For all x, y ∈ Rd, Pt (x, ·) − Pt (y, ·) TV ≤ x − y (2π/m)(e2mt − 1) LS3 seminar
  37. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments From the Wasserstein distance to the TV (II) Pt (x, ·) − Pt (y, ·) TV ≤ x − y (2π/m)(e2mt − 1) Consequences: 1 (Pt )t≥0 converges exponentially fast to π in total variation at a rate e−mt. 2 For all f : Rd → R, measurable and sup |f| ≤ 1, then x → Pt f(x) , is Lipschitz with Lipschitz constant smaller than 1/ (2π/m)(e2mt − 1) . LS3 seminar
  38. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Explicit bound in total variation Theorem Assume U is L-smooth and strongly convex. Let (γk )k≥1 be a nonincreasing sequence with γ1 ≤ 1/(m + L). (Optional assumption) U ∈ C3(Rd) and there exists ˜ L such that for all x, y ∈ Rd: ∇2U(x) − ∇2U(y) ≤ ˜ L x − y . Then there exist sequences {˜ u(1) n (γ), n ∈ N} and {˜ u(2) n (γ), n ∈ N} such that for all x ∈ Rd and n ≥ 1, δx Qn γ − π TV ≤ ˜ u(1) n (γ) Rd y − x 2 π(dy) + ˜ u(2) n (γ) . LS3 seminar
  39. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Constant step sizes For any > 0, the minimal number of iterations to achieve δx Qp γ − π TV ≤ is p = O( √ d log(d) −1 |log( )|) . For a given stepsize γ, letting p → +∞, we get: πγ − π TV ≤ Cγ |log(γ)| . LS3 seminar
  40. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Comparison of MALA and ULA (I) We compare MALA and ULA for the logistic regression with Gaussian prior on five real data sets. Data set Observations p Covariates d German credit 1000 25 Heart disease 270 14 Australian credit 690 35 Prima indian diabetes 768 9 Musk 476 167 Table : Dimension of the data sets LS3 seminar
  41. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Comparison of MALA and ULA (II) Define the marginal accuracy between two probability measure µ, ν on (R, B(Rd)) by MA(µ, ν) = 1 − (1/2) µ − ν TV . We compare MALA and ULA for each data sets by estimating for each component i ∈ {1, . . . , d} the marginal accuracy between their d marginal empirical distributions and the d marginal posterior distributions. LS3 seminar
  42. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Comparison of MALA and ULA (III) To estimate the d marginal posterior distributions, we run 2 · 107 iterations of the Polya-Gamma Gibbs sampler. Then 100 runs of MALA and ULA (106 iterations per run) have been performed. For MALA, the step-size is chosen so that the acceptance probability is ≈ 0.5. For ULA, we choose the same constant step-size than MALA. LS3 seminar
  43. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Comparison of MALA and ULA (IV) 0.975 0.98 0.985 ULA MALA 0.95 0.96 0.97 0.98 ULA MALA 0.95 0.96 0.97 0.98 ULA MALA 0.95 0.96 0.97 0.98 ULA MALA Figure : Marginal accuracy accross all the dimensions. Upper left: German credit data set. Upper right: Australian credit data set. Lower left: Heart disease data set. Lower right: Pima Indian diabetes data set LS3 seminar
  44. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Comparison of MALA and ULA (V) ULA MALA 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 MALA ULA Figure : Marginal accuracy accross all the dimensions for the Musk data Set LS3 seminar
  45. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Comparison of MALA and ULA (VI) Figure : 2-dimensional histogram for the Musk data Set. LS3 seminar
  46. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials 1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials LS3 seminar
  47. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence of the Euler discretization Assume one of the following conditions: - There exist α > 1, ρ > 0 and Mρ ≥ 0 such that for all y ∈ Rd, y ≥ Mρ : ∇U(y), y ≥ ρ y α . - U is convex. If limγk→+∞ γk = 0, and k γk = +∞ then lim p→+∞ δx Qp γ − π TV = 0 . Computable bounds for the convergence1. 1D., Moulines, Annals of Applied Probability, 2016 LS3 seminar
  48. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Target precision : the convex case Setting U is convex. Constant stepsize Optimal stepsize γ and number of iterations p to achieve -accuracy in TV: δx Qp γ − π TV ≤ . d ε L γ O(d−3) O(ε2/ log(ε−1)) O(L−2) p O(d5) O(ε−2 log2(ε−1)) O(L2) In the strongly convex case, the convergence of the semigroup of the diffusion to π depends only on the strong convexity constant m. In the convex case, this depends on the dimension !. LS3 seminar
  49. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Strongly convex outside a ball potential U is convex everywhere and strongly convex outside a ball, i.e. there exist R ≥ 0 and m > 0, such that for all x, y ∈ Rd, x − y ≥ R, ∇U(x) − ∇U(y), x − y ≥ m x − y 2 . Eberle, 2015 established that the convergence in the Wasserstein distance does not depends on the dimension. D., Moulines 2016 established that the convergence of the semi-group in TV to π does not depends on the dimension but just on R ; new bounds which scale nicely in the dimension. LS3 seminar
  50. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Dependence on the dimension Setting U is convex and strongly convex outside a ball. Constant stepsize Optimal stepsize γ and number of iterations p to achieve -accuracy in TV: δx Qp γ − π TV ≤ . d ε L m R γ O(d−1) O(ε2/ log(ε−1)) O(L−2) O(m) O(R−4) p O(d log(d)) O(ε−2 log2(ε−1)) O(L2) O(m−2) O(R8) LS3 seminar
  51. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials 1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials LS3 seminar
  52. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Bounds for functionals Let f : Rd → R be a Lipshitz function and (Xk )k≥0 the Euler discretization of the Langevin diffusion. We approximate Rd f(x)π(dx) by the weighted average estimator ˆ πN n (f) = N+n k=N+1 ωk,n f(Xk ) , ωk,n = γk+1 Γ−1 N+2,N+n+1 . where N ≥ 0 is the length of the burn-in period, n ≥ 1 is the number of effective samples. Objective: compute an explicit bounds for the Mean Square Error (MSE) of this estimator defined by: MSEf (N, n) = Ex ˆ πN n (f) − π(f) 2 . LS3 seminar
  53. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials The MSE can be decomposed into the sum of the squared bias and the variance MSEf (N, n) = Ex [ˆ πN n (f)] − π(f) 2 + Varx ˆ πN n (f) , By definition of the Wasserstein distance:, Bias2 ≤ f 2 Lip N+n k=N+1 ωk,n W2 1 (δx Qk γ , π) . and W2 1 (δx Qk γ , π) ≤ 2( x − x 2 + d/m)u(1) k (γ) + u(2,3) k (γ) . LS3 seminar
  54. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials MSE Theorem Assume that U is L-smooth and strongly convex. Let (γk )k≥1 be a nonincreasing sequence with γ1 ≤ 2/(m + L). Then for all N ≥ 0, n ≥ 1 and Lipschitz functions f : Rd → R, we get Varx ˆ πN n (f) ≤ 8κ−2 f 2 Lip Γ−1 N+2,N+n+1 u(3) N,n (γ) where u(3) N,n (γ) def = 1 + Γ−1 N+2,N+n+1 (κ−1 + 2/(m + L)) . The upper bound is independent of the dimension and allow to construct honest confidence bounds. LS3 seminar
  55. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Bound for the MSE α = 0 γ2 1 + (γ1 n)−1 exp(−κγ1 N/2) α ∈ (0, 1/3) γ2 1 n−2α + (γ1 n1−α)−1 exp(−κγ1 N1−α/(2(1 − α))) α = 1/3 γ2 1 log(n)n−2/3 + (γ1 n2/3)−1 exp(−κγ1 N1/2/4) α ∈ (1/3, 1) nα−1 γ2 1 + γ−1 1 exp(−κγ1 N1−α/(2(1 − α))) α = 1 log(n)−1 γ2 1 + γ−1 1 N−γ1κ/2 Table : Bound for the MSE for γk = γ1 k−α for fixed γ1 and N with more regularity on U LS3 seminar
  56. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Optimal choice of γ1 Bound for the MSE α = 0 n−1/3 n−2/3 α ∈ (0, 1/2) nα−1/3 n−2/3 α = 1/2 (log(n))−1/3 log1/3(n)n−2/3 α ∈ (1/2, 1) 1/(m + L) n1−α α = 1 1/(m + L) log(n) Table : Bound for the MSE for γk = γ1 k−α for fixed n with more regularity on U LS3 seminar
  57. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials 1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials LS3 seminar
  58. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Non-smooth potentials The target distribution has a density π with respect to the Lebesgue measure on Rd of the form x → e−U(x)/ Rd e−U(y)dy where U = f + g, with f : Rd → R and g : Rd → (−∞, +∞] are two lower bounded, convex functions satisfying: 1 f is continuously differentiable and gradient Lipschitz with Lipschitz constant Lf , i.e. for all x, y ∈ Rd ∇f(x) − ∇f(y) ≤ Lf x − y . 2 g is lower semi-continuous and Rd e−g(y)dy ∈ (0, +∞). LS3 seminar
  59. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Moreau-Yosida regularization Let h : Rd → (−∞, +∞] be a l.s.c convex function and λ > 0. The λ-Moreau-Yosida envelope hλ : Rd → R and the proximal operator proxλ h : Rd → Rd associated with h are defined for all x ∈ Rd by hλ(x) = inf y∈Rd h(y) + (2λ)−1 x − y 2 ≤ h(x) . For every x ∈ Rd, the minimum is achieved at a unique point, proxλ h (x), which is characterized by the inclusion x − proxλ h (x) ∈ γ∂h(proxλ h (x)) . The Moreau-Yosida envelope is a regularized version of g, which approximates g from below. LS3 seminar
  60. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Properties of proximal operators As λ ↓ 0, converges hλ converges pointwise h, i.e. for all x ∈ Rd, hλ(x) ↑ h(x) , as λ ↓ 0 . The function hλ is convex and continuously differentiable ∇hλ(x) = λ−1(x − proxλ h (x)) . The proximal operator is a monotone operator, for all x, y ∈ Rd, proxλ h (x) − proxλ h (y), x − y ≥ 0 , which implies that the Moreau-Yosida envelope is L-smooth: ∇hλ(x) − ∇hλ(y) ≤ λ−1 x − y , for all x, y ∈ Rd. LS3 seminar
  61. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials MY regularized potential If g is not differentiable, but the proximal operator associated with g is available, its λ-Moreau Yosida envelope gλ can be considered. This leads to the approximation of the potential Uλ : Rd → R defined for all x ∈ Rd by Uλ(x) = f(x) + gλ(x) . Theorem (D., Moulines, Pereira, 2016) Under (H), for all λ > 0, 0 < Rd e−Uλ(y)dy < +∞. LS3 seminar
  62. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Some approximation results Theorem Assume (H). 1 Then, limλ→0 πλ − π TV = 0. 2 Assume in addition that g is Lipschitz. Then for all λ > 0, πλ − π TV ≤ λ g 2 Lip . LS3 seminar
  63. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials The MYULA algorithm-I Given a regularization parameter λ > 0 and a sequence of stepsizes {γk , k ∈ N∗}, the algorithm produces the Markov chain {XM k , k ∈ N}: for all k ≥ 0, XM k+1 = XM k −γk+1 ∇f(XM k ) + λ−1(XM k − proxλ g (XM k )) + 2γk+1 Zk+1 , where {Zk , k ∈ N∗} is a sequence of i.i.d. d-dimensional standard Gaussian random variables. LS3 seminar
  64. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials The MYULA algorithm-II The ULA target the smoothed distribution πλ. To compute the expectation of a function h : Rd → R under π from {XM k ; 0 ≤ k ≤ n}, an importance sampling step is used to correct the regularization. This step amounts to approximate Rd h(x)π(x)dx by the weighted sum Sh n = n k=0 ωk,n h(Xk ) , with ωk,n = n k=0 γk e¯ gλ(XM k ) −1 γk e¯ gλ(XM k ) , where for all x ∈ Rd ¯ gλ(x) = gλ(x)−g(x) = g(proxλ g (x))−g(x)+(2λ)−1 x − proxλ g (x) 2 . LS3 seminar
  65. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Image deconvolution Objective recover an original image x ∈ Rn from a blurred and noisy observed image y ∈ Rn related to x by the linear observation model y = Hx + w, where H is a linear operator representing the blur point spread function and w is a Gaussian vector with zero-mean and covariance matrix σ2In . This inverse problem is usually ill-posed or ill-conditioned: exploits prior knowledge about x. One of the most widely used image prior for deconvolution problems is the improper total-variation norm prior, π(x) ∝ exp (−α ∇d x 1 ), where ∇d denotes the discrete gradient operator that computes the vertical and horizontal differences between neighbour pixels. π(x|y) ∝ exp − y − Hx 2/2σ2 − α ∇d x 1 . LS3 seminar
  66. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials (a) (b) (c) Figure : (a) Original Boat image (256 × 256 pixels), (b) Blurred image, (c) MAP estimate. LS3 seminar
  67. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Credibility intervals (a) (b) (c) Figure : (a) Pixel-wise 90% credibility intervals computed with proximal MALA (computing time 35 hours), (b) Approximate intervals estimated with MYULA using λ = 0.01 (computing time 3.5 hours), (c) Approximate intervals estimated with MYULA using λ = 0.1 (computing time 20 minutes). LS3 seminar
  68. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Microscopy dataset (a) (b) LS3 seminar
  69. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials HPD credible region Consider the 1 norm prior, π(x) ∝ exp (−α x 1 ), then π(x|y) ∝ exp − y − Hx 2/2σ2 − α x 1 . We want to test if the molecules are indeed present in true image (as opposed to being noise artefacts for example), Uncertainty about their position. For this, it can be relevent to compute the HPD credible region C∗ α = {x : U(x) ≤ ηα } with ηα ∈ R chosen such that P(x ∈ C∗ α |y) = 1 − α holds. Here we use α = 0.01 related to the 99% confidence level. LS3 seminar
  70. Motivation Framework Sampling from strongly log-concave distribution Computable bounds in

    total variation for super-exponential densities Deviation inequalities Non-smooth potentials Comparison with PMALA (a) (b) Figure : Microscopy experiment: (a) HDP region thresholds ηα for MYULA (2 × 106 iterations λ = 1, γ = 0.6) and PMALA (2 × 107 iterations), (b) relative approximation error of MYULA. LS3 seminar