Alain Durmus

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in
total variation for super-exponential densities Deviation inequalities Non-smooth potentials Non-asymptotic convergence bound for the Unadjusted Langevin Algorithm Alain Durmus, Eric Moulines, Marcelo Pereyra Telecom ParisTech, Ecole Polytechnique, Bristol University November 23, 2016 LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials 1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Introduction Sampling distribution over high-dimensional state-space has recently attracted a lot of research eﬀorts in computational statistics and machine learning community... Applications (non-exhaustive) 1 Bayesian inference for high-dimensional models and Bayesian non parametrics 2 Bayesian linear inverse problems (typically function space problems) 3 Aggregation of estimators and experts Most of the sampling techniques known so far do not scale to high-dimension... Challenges are numerous in this area... LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Bayesian setting (I) - In a Bayesian setting, a parameter β β β ∈ Rd is embedded with a prior distribution p and the observations are given by a likelihood: Y ∼ L(·|β β β) The inference is then based on the posterior distribution: π(dβ β β|Y ) = p(dβ β β)L(Y |β β β) L(Y |u)p(du) . In most cases the normalizing constant is not tractable: π(dβ β β|Y ) ∝ p(dβ β β)L(Y |β β β) . LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Bayesian setting (II) Bayesian decision theory relies on computing expectations: Rd f(β β β)L(Y |β β β)p(dβ β β) Generic problem: estimation of an expectation Eπ [f], where - π is known up to a multiplicative factor ; - Sampling directly from π is not an option; - π is high dimensional. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Logistic and probit regression Likelihood: Binary regression set-up in which the binary observations (responses) (Y1 , . . . , Yn ) are conditionally independent Bernoulli random variables with success probability F(β β βT Xi ), where 1 Xi is a d dimensional vector of known covariates, 2 β β β is a d dimensional vector of unknown regression coeﬃcient 3 F is a distribution function. Two important special cases: 1 probit regression: F is the standard normal distribution function, 2 logistic regression: F is the standard logistic distribution function, F(t) = et/(1 + et). LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials A daunting problem ? The posterior density distribution of β β β is given , up to a proportionality constant by π(β β β|(Y, X)) ∝ exp(−U(β β β)); where the potential U(β β β) is given by U(β β β) = − p i=1 {Yi log F(β β βT Xi )+(1−Yi ) log(1−F(β β βT Xi ))}+g(β β β) , where g is the log density of the posterior distribution. Two important cases: Gaussian prior g(β β β) = (1/2)β β βT Σβ β β, ridge regression. Laplace prior g(β β β) = λ d i=1 |β β βi |, lasso regression. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials New challenges Problem the number of predictor variables d is large (104 and up). Examples: text categorization, genomics and proteomics (gene expression analysis)... The most popular algorithms for Bayesian inference in binary regression models are based on data augmentation: 1 probit link: Albert and Chib (1993). 2 logistic link: Polya-Gamma sampler, Polsson and Scott (2012)... ! LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Data Augmentation algorithms (I) Data Augmentation: Instead on sampling π(β β β|(X, Y )) sample π(β β β, W|(X, Y )) probability measure on Rd1 × Rd2 and take the marginal wrt β β β. Typical application of the Gibbs sampler: sample in turn π(β β β|(X, Y, W)) and π(W|(X, Y,β β β)). The Gibbs sampler consists in sampling a Markov chain (β β βk , Wk )k≥0 deﬁned by 1 Given (β β βk, Wk), 2 Draw Wk+1 ∼ π (·|(β β βk, X, Y )) . β β βk+1 ∼ π (·|(Wk+1, X, Y )) . The target density π(β β β, W|(X, Y )) is invariant for the Markov chain (β β βk , Wk )k≥0 ! The choice of the DA should make these two steps reasonably easy... LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Data Augmentation algorithms (II) Question: Control the distance between the law of (β β βn , Wn ) and the stationary distribution π(β β β, W|(X, Y ))? Deﬁnition (Geometric ergodicity) We will say that the Markov kernel P on (Rd, B(Rd)) is geometrically ergodic if there exits κ ∈ (0, 1) such that for all n ≥ 0 and x ∈ Rd, Pn(x, ·) − π TV ≤ C(x)κn . where for µ, ν two probabilities measure on Rd, deﬁne µ − ν TV = sup |f|≤1 |µ(f) − ν(f)| . LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Data Augmentation algorithms (III) The algorithm of Albert and Chib and the Polya-Gamma sampler have been shown to be uniformly geometrically ergodic, BUT - The geometric rate of convergence is exponentially small with the dimension - do not allow to construct honest conﬁdence intervals, credible regions The algorithms are very demanding in terms of computational ressources... - applicable only when is d small 10 to moderate 100 but certainly not when d is large (104 or more). - convergence time prohibitive as soon as d ≥ 102. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials A daunting problem ? In the case of the ridge regression, the potential U is smooth strongly convex. In the case of the lasso regression, the potential U is non-smooth but still convex... A wealth of reasonably fast optimisation algorithms are available to solve this problem in high-dimension... LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Framework Denote by π a target density w.r.t. the Lebesgue measure on Rd, known up to a normalisation factor x → e−U(x)/ Rd e−U(y)dy , Implicitly, d 1. Assumption: U is L-smooth : twice continuously diﬀerentiable and there exists a constant L such that for all x, y ∈ Rd, ∇U(x) − ∇U(y) ≤ L x − y . This condition can be weakened. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Langevin diﬀusion Langevin SDE: dYt = −∇U(Yt )dt + √ 2dBt , where (Bt )t≥0 is a d-dimensional Brownian Motion. (Pt )t≥0 is a Markov semigroup: - aperiodic, strong Feller (all compact sets are small). - reversible w.r.t. to π (admits π as its unique invariant distribution). π ∝ e−U is reversible ; the unique invariant probability measure. For all x ∈ Rd, lim t→+∞ δx Pt − π TV = 0 . LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Discretized Langevin diﬀusion Idea: Sample the diﬀusion paths, using for example the Euler-Maruyama (EM) scheme: Xk+1 = Xk − γk+1 ∇U(Xk ) + 2γk+1 Zk+1 where - (Zk )k≥1 is i.i.d. N(0, Id ) - (γk )k≥1 is a sequence of stepsizes, which can either be held constant or be chosen to decrease to 0 at a certain rate. Closely related to the gradient algorithm. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Discretized Langevin diﬀusion: constant stepzize When γk = γ, then (Xk )k≥1 is an homogeneous Markov chain with Markov kernel Rγ Under some appropriate conditions, this Markov chain is irreducible, positive recurrent ; unique invariant distribution πγ . Problem: πγ = π. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials The Euler-Maruyama (EM) Markov chain When (γk )k≥1 is nonincreasing and non constant, (Xk )k≥1 is an inhomogeneous Markov chain associated with the sequence of Markov kernel (Rγk )k≥1 . We denote by δx Qp γ the law of Xp stated at x. The diﬀusion converges to the target distribution... Question: since the EM discretization approximates the diﬀusion, might we use it to sample from π ? Control δx Qp γ − π TV ? Obtain bounds with explicit dependence in the dimension d. Previous works: Lamberton, Pages, 2002, Dalalyan, 2014. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Metropolis-Adjusted Langevin Algorithm To correct the target distribution, a Metropolis-Hastings step can be included ; Metropolis Adjusted Langevin Agorithm (MALA). - Key references Roberts and Tweedie, 1996 Algorithm: 1 Propose Yk+1 ∼ Xk − γ∇U(Xk ) + √ 2γZk+1 , Zk+1 ∼ N(0, Id ) 2 Compute the acceptance ratio αγ (Xk , Yk+1 ) αγ (x, y) = 1 ∧ π(y)rγ (y, x) π(x)rγ (x, y) , rγ (x, y) ∝ e− y−x−γ∇U(x) 2/(4γ) 3 Accept / Reject the proposal. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials MALA: pros and cons Require to compute one gradient at each iteration and to evaluate one time the objective function Geometric convergence is established under the condition that in the tail the acceptance region is inwards in q, lim x →∞ Aγ (x)∆I(x) rγ (x, y)dy = 0 . where I(x) = {y, y ≤ x } and Aγ (x) is the acceptance region Aγ (x) = {y, π(x)rγ (x, y) ≤ π(y)rγ (y, x)} LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments 1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Strongly convex potential Assume that U is strongly convex: there exists m > 0, such that for all x, y ∈ Rd, ∇U(x) − ∇U(y), x − y ≥ m x − y 2 . Convergence in Wasserstein distance of the semigroup ! LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Wasserstein distance Deﬁnition For µ, ν two probabilities measure on Rd, deﬁne W2 (µ, ν) = inf (X,Y )∈Π(µ,ν) E1/2 X − Y 2 , where (X, Y ) ∈ Π(µ, ν) if X ∼ µ and Y ∼ ν. Note by the Cauchy-Schwarz inequality, for all f : Rd → R, f Lip ≤ 1, (X, Y ) ∈ Π(µ, ν), |µ(f) − ν(f)| ≤ E1/2 X − Y 2 ≤ W2 (µ, ν) . LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Wasserstein distance convergence Theorem Assume that U is L-smooth and m-strongly convex. Then, for all probability measures x, y ∈ Rd and t ≥ 0, W2 (δx Pt , δy Pt ) ≤ e−mt x − y The mixing rate depends only on the strong convexity constant. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Elements of proof dYt = −∇U(Yt )dt + √ 2dBt , d ˜ Yt = −∇U( ˜ Yt )dt + √ 2dBt , where (Y0 , ˜ Y0 ) = (x, y). This SDE has a unique strong solution (Yt , ˜ Yt )t≥0 LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Elements of proof dYt = −∇U(Yt )dt + √ 2dBt , d ˜ Yt = −∇U( ˜ Yt )dt + √ 2dBt , where (Y0 , ˜ Y0 ) = (x, y). This SDE has a unique strong solution (Yt , ˜ Yt )t≥0 Consider Vt = Yt − ˜ Yt 2 . Since d{Yt − ˜ Yt } = − ∇U(Yt ) − ∇U( ˜ Yt ) dt we get a very simple SDE for (Vt , t ≥ 0) dVt = − ∇U(Yt ) − ∇U( ˜ Yt ), Yt − ˜ Yt dt . LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Elements of proof Integrating this SDE we get Yt − ˜ Yt 2 = Y0 − ˜ Y0 2 − 2 t 0 (∇U(Ys ) − ∇U( ˜ Ys )), Ys − ˜ Ys ds , Since U is strongly convex ∇U(y) − ∇U(y ), y − y ≥ m y − y 2 which implies Yt − ˜ Yt 2 ≤ Y0 − ˜ Y0 2 − 2m t 0 Ys − ˜ Ys 2 ds . LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Elements of proof Yt − ˜ Yt 2 ≤ Y0 − ˜ Y0 2 − 2m t 0 Ys − ˜ Ys 2 ds . By Gr¨ omwall inequality, we obtain Yt − ˜ Yt 2 ≤ Y0 − ˜ Y0 2 e−2mt The proof follows since for all t ≥ 0, the law of (Yt , ˜ Yt ) is a coupling between δx Pt and δy Pt . LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Theorem Assume that U is L-smooth and m-strongly convex. Then, or any x ∈ Rd and t ≥ 0 W2 (δx Pt , π) ≤ x − x + d m e−mt . where x = arg min x∈Rd U(x) . The constant depends only linearly in the dimension d. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Elements of proof dYt = −∇U(Yt )dt + √ 2dBt , d ˜ Yt = −∇U( ˜ Yt )dt + √ 2dBt , where Y0 = x and ˜ Y0 ∼ π. Since π is invariant for (Pt )t≥0 , then ˜ Yt ∼ π for all t ≥ 0, showing that W2 2 (δx Pt , π) ≤ E Yt − ˜ Yt 2 ≤ e−mt E x − ˜ Y0 2 . The proof follows from E x − ˜ Y0 2 ≤ d/m. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments A coupling proof We now proceed to establish explicit bounds for W2 (µ0 Qn γ , π) for µ0 ∈ P2 (Rd). Since πPt = π for all t ≥ 0, it suﬃces to get some bounds on W2 µ0 Qn γ , ν0 PΓn , with ν0 ∈ P2 (Rd) and take ν0 = π. Idea ! Construct a coupling between the diﬀusion and the linear interpolation of the Euler discretization. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments A coupling proof (II) An obvious candidate is the synchronous coupling (Yt , Y t )t≥0 for all n ≥ 0 and t ∈ [Γn , Γn+1 ) by Yt = YΓn − t Γn ∇U(Ys )ds + √ 2(Bt − BΓn ) ¯ Yt = ¯ YΓn − ∇U( ¯ YΓn )(t − Γn ) + √ 2(Bt − BΓn ) , For all n ≥ 0, W2 2 µ0 PΓn , ν0 Qn γ ≤ E[ YΓn − ¯ YΓn 2] , LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Explicit bound in Wasserstein distance for the Euler discretisation Theorem Assume U is L-smooth and strongly convex. Let (γk )k≥1 be a nonincreasing sequence with γ1 ≤ 1/(m + L). (Optional assumption) U ∈ C3(Rd) and there exists ˜ L such that for all x, y ∈ Rd: ∇2U(x) − ∇2U(y) ≤ ˜ L x − y . Then there exist sequences {u(1) n (γ), n ∈ N} and {u(2) n (γ), n ∈ N} (explicit expressions are available) such that for all x ∈ Rd and n ≥ 1, W2 δx Qn γ , π ≤ u(1) n (γ) Rd y − x 2 π(dy) + u(2) n (γ) , LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Decreasing step sizes If limk→+∞ γk = 0 and limk→+∞ Γk = +∞, then lim p→+∞ W2 δx Qn γ , π = 0 , with explicit control. Order of convergence for decreasing stepsize. α ∈ (0, 1) Order of convergence O(n−α) Table : Order of convergence of W2 δx Qn γ , π for γk = γ1 k−α LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Constant step sizes For any > 0, the minimal number of iterations to achieve W2 δx Qp γ , π ≤ is p = O( √ d −1) . For a ﬁxed number of iterations p, we can choose the stepsize γ such that W2 δx Qp γ , π ≤ Cp−1 . For a given stepsize γ, letting p → +∞, we get: W2 (πγ , π) ≤ Cγ . LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments From the Wasserstein distance to the TV Theorem If U is strongly convex, then for all x, y ∈ Rd, Pt (x, ·) − Pt (y, ·) TV ≤ 1 − 2Φ − x − y (4/m)(e2mt − 1) Proof. reﬂection coupling For all x, y ∈ Rd, Pt (x, ·) − Pt (y, ·) TV ≤ x − y (2π/m)(e2mt − 1) LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments From the Wasserstein distance to the TV (II) Pt (x, ·) − Pt (y, ·) TV ≤ x − y (2π/m)(e2mt − 1) Consequences: 1 (Pt )t≥0 converges exponentially fast to π in total variation at a rate e−mt. 2 For all f : Rd → R, measurable and sup |f| ≤ 1, then x → Pt f(x) , is Lipschitz with Lipschitz constant smaller than 1/ (2π/m)(e2mt − 1) . LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Explicit bound in total variation Theorem Assume U is L-smooth and strongly convex. Let (γk )k≥1 be a nonincreasing sequence with γ1 ≤ 1/(m + L). (Optional assumption) U ∈ C3(Rd) and there exists ˜ L such that for all x, y ∈ Rd: ∇2U(x) − ∇2U(y) ≤ ˜ L x − y . Then there exist sequences {˜ u(1) n (γ), n ∈ N} and {˜ u(2) n (γ), n ∈ N} such that for all x ∈ Rd and n ≥ 1, δx Qn γ − π TV ≤ ˜ u(1) n (γ) Rd y − x 2 π(dy) + ˜ u(2) n (γ) . LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Constant step sizes For any > 0, the minimal number of iterations to achieve δx Qp γ − π TV ≤ is p = O( √ d log(d) −1 |log( )|) . For a given stepsize γ, letting p → +∞, we get: πγ − π TV ≤ Cγ |log(γ)| . LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Comparison of MALA and ULA (I) We compare MALA and ULA for the logistic regression with Gaussian prior on ﬁve real data sets. Data set Observations p Covariates d German credit 1000 25 Heart disease 270 14 Australian credit 690 35 Prima indian diabetes 768 9 Musk 476 167 Table : Dimension of the data sets LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Comparison of MALA and ULA (II) Deﬁne the marginal accuracy between two probability measure µ, ν on (R, B(Rd)) by MA(µ, ν) = 1 − (1/2) µ − ν TV . We compare MALA and ULA for each data sets by estimating for each component i ∈ {1, . . . , d} the marginal accuracy between their d marginal empirical distributions and the d marginal posterior distributions. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Comparison of MALA and ULA (III) To estimate the d marginal posterior distributions, we run 2 · 107 iterations of the Polya-Gamma Gibbs sampler. Then 100 runs of MALA and ULA (106 iterations per run) have been performed. For MALA, the step-size is chosen so that the acceptance probability is ≈ 0.5. For ULA, we choose the same constant step-size than MALA. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Comparison of MALA and ULA (IV) 0.975 0.98 0.985 ULA MALA 0.95 0.96 0.97 0.98 ULA MALA 0.95 0.96 0.97 0.98 ULA MALA 0.95 0.96 0.97 0.98 ULA MALA Figure : Marginal accuracy accross all the dimensions. Upper left: German credit data set. Upper right: Australian credit data set. Lower left: Heart disease data set. Lower right: Pima Indian diabetes data set LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Comparison of MALA and ULA (V) ULA MALA 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 MALA ULA Figure : Marginal accuracy accross all the dimensions for the Musk data Set LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments Comparison of MALA and ULA (VI) Figure : 2-dimensional histogram for the Musk data Set. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence of the Euler discretization Assume one of the following conditions: - There exist α > 1, ρ > 0 and Mρ ≥ 0 such that for all y ∈ Rd, y ≥ Mρ : ∇U(y), y ≥ ρ y α . - U is convex. If limγk→+∞ γk = 0, and k γk = +∞ then lim p→+∞ δx Qp γ − π TV = 0 . Computable bounds for the convergence1. 1D., Moulines, Annals of Applied Probability, 2016 LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Target precision : the convex case Setting U is convex. Constant stepsize Optimal stepsize γ and number of iterations p to achieve -accuracy in TV: δx Qp γ − π TV ≤ . d ε L γ O(d−3) O(ε2/ log(ε−1)) O(L−2) p O(d5) O(ε−2 log2(ε−1)) O(L2) In the strongly convex case, the convergence of the semigroup of the diﬀusion to π depends only on the strong convexity constant m. In the convex case, this depends on the dimension !. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Strongly convex outside a ball potential U is convex everywhere and strongly convex outside a ball, i.e. there exist R ≥ 0 and m > 0, such that for all x, y ∈ Rd, x − y ≥ R, ∇U(x) − ∇U(y), x − y ≥ m x − y 2 . Eberle, 2015 established that the convergence in the Wasserstein distance does not depends on the dimension. D., Moulines 2016 established that the convergence of the semi-group in TV to π does not depends on the dimension but just on R ; new bounds which scale nicely in the dimension. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Dependence on the dimension Setting U is convex and strongly convex outside a ball. Constant stepsize Optimal stepsize γ and number of iterations p to achieve -accuracy in TV: δx Qp γ − π TV ≤ . d ε L m R γ O(d−1) O(ε2/ log(ε−1)) O(L−2) O(m) O(R−4) p O(d log(d)) O(ε−2 log2(ε−1)) O(L2) O(m−2) O(R8) LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Bounds for functionals Let f : Rd → R be a Lipshitz function and (Xk )k≥0 the Euler discretization of the Langevin diffusion. We approximate Rd f(x)π(dx) by the weighted average estimator ˆ πN n (f) = N+n k=N+1 ωk,n f(Xk ) , ωk,n = γk+1 Γ−1 N+2,N+n+1 . where N ≥ 0 is the length of the burn-in period, n ≥ 1 is the number of effective samples. Objective: compute an explicit bounds for the Mean Square Error (MSE) of this estimator defined by: MSEf (N, n) = Ex ˆ πN n (f) − π(f) 2 . LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials The MSE can be decomposed into the sum of the squared bias and the variance MSEf (N, n) = Ex [ˆ πN n (f)] − π(f) 2 + Varx ˆ πN n (f) , By deﬁnition of the Wasserstein distance:, Bias2 ≤ f 2 Lip N+n k=N+1 ωk,n W2 1 (δx Qk γ , π) . and W2 1 (δx Qk γ , π) ≤ 2( x − x 2 + d/m)u(1) k (γ) + u(2,3) k (γ) . LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials MSE Theorem Assume that U is L-smooth and strongly convex. Let (γk )k≥1 be a nonincreasing sequence with γ1 ≤ 2/(m + L). Then for all N ≥ 0, n ≥ 1 and Lipschitz functions f : Rd → R, we get Varx ˆ πN n (f) ≤ 8κ−2 f 2 Lip Γ−1 N+2,N+n+1 u(3) N,n (γ) where u(3) N,n (γ) def = 1 + Γ−1 N+2,N+n+1 (κ−1 + 2/(m + L)) . The upper bound is independent of the dimension and allow to construct honest conﬁdence bounds. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Bound for the MSE α = 0 γ2 1 + (γ1 n)−1 exp(−κγ1 N/2) α ∈ (0, 1/3) γ2 1 n−2α + (γ1 n1−α)−1 exp(−κγ1 N1−α/(2(1 − α))) α = 1/3 γ2 1 log(n)n−2/3 + (γ1 n2/3)−1 exp(−κγ1 N1/2/4) α ∈ (1/3, 1) nα−1 γ2 1 + γ−1 1 exp(−κγ1 N1−α/(2(1 − α))) α = 1 log(n)−1 γ2 1 + γ−1 1 N−γ1κ/2 Table : Bound for the MSE for γk = γ1 k−α for ﬁxed γ1 and N with more regularity on U LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Optimal choice of γ1 Bound for the MSE α = 0 n−1/3 n−2/3 α ∈ (0, 1/2) nα−1/3 n−2/3 α = 1/2 (log(n))−1/3 log1/3(n)n−2/3 α ∈ (1/2, 1) 1/(m + L) n1−α α = 1 1/(m + L) log(n) Table : Bound for the MSE for γk = γ1 k−α for ﬁxed n with more regularity on U LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Non-smooth potentials The target distribution has a density π with respect to the Lebesgue measure on Rd of the form x → e−U(x)/ Rd e−U(y)dy where U = f + g, with f : Rd → R and g : Rd → (−∞, +∞] are two lower bounded, convex functions satisfying: 1 f is continuously diﬀerentiable and gradient Lipschitz with Lipschitz constant Lf , i.e. for all x, y ∈ Rd ∇f(x) − ∇f(y) ≤ Lf x − y . 2 g is lower semi-continuous and Rd e−g(y)dy ∈ (0, +∞). LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Moreau-Yosida regularization Let h : Rd → (−∞, +∞] be a l.s.c convex function and λ > 0. The λ-Moreau-Yosida envelope hλ : Rd → R and the proximal operator proxλ h : Rd → Rd associated with h are deﬁned for all x ∈ Rd by hλ(x) = inf y∈Rd h(y) + (2λ)−1 x − y 2 ≤ h(x) . For every x ∈ Rd, the minimum is achieved at a unique point, proxλ h (x), which is characterized by the inclusion x − proxλ h (x) ∈ γ∂h(proxλ h (x)) . The Moreau-Yosida envelope is a regularized version of g, which approximates g from below. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Properties of proximal operators As λ ↓ 0, converges hλ converges pointwise h, i.e. for all x ∈ Rd, hλ(x) ↑ h(x) , as λ ↓ 0 . The function hλ is convex and continuously diﬀerentiable ∇hλ(x) = λ−1(x − proxλ h (x)) . The proximal operator is a monotone operator, for all x, y ∈ Rd, proxλ h (x) − proxλ h (y), x − y ≥ 0 , which implies that the Moreau-Yosida envelope is L-smooth: ∇hλ(x) − ∇hλ(y) ≤ λ−1 x − y , for all x, y ∈ Rd. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials MY regularized potential If g is not diﬀerentiable, but the proximal operator associated with g is available, its λ-Moreau Yosida envelope gλ can be considered. This leads to the approximation of the potential Uλ : Rd → R deﬁned for all x ∈ Rd by Uλ(x) = f(x) + gλ(x) . Theorem (D., Moulines, Pereira, 2016) Under (H), for all λ > 0, 0 < Rd e−Uλ(y)dy < +∞. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Some approximation results Theorem Assume (H). 1 Then, limλ→0 πλ − π TV = 0. 2 Assume in addition that g is Lipschitz. Then for all λ > 0, πλ − π TV ≤ λ g 2 Lip . LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials The MYULA algorithm-I Given a regularization parameter λ > 0 and a sequence of stepsizes {γk , k ∈ N∗}, the algorithm produces the Markov chain {XM k , k ∈ N}: for all k ≥ 0, XM k+1 = XM k −γk+1 ∇f(XM k ) + λ−1(XM k − proxλ g (XM k )) + 2γk+1 Zk+1 , where {Zk , k ∈ N∗} is a sequence of i.i.d. d-dimensional standard Gaussian random variables. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials The MYULA algorithm-II The ULA target the smoothed distribution πλ. To compute the expectation of a function h : Rd → R under π from {XM k ; 0 ≤ k ≤ n}, an importance sampling step is used to correct the regularization. This step amounts to approximate Rd h(x)π(x)dx by the weighted sum Sh n = n k=0 ωk,n h(Xk ) , with ωk,n = n k=0 γk e¯ gλ(XM k ) −1 γk e¯ gλ(XM k ) , where for all x ∈ Rd ¯ gλ(x) = gλ(x)−g(x) = g(proxλ g (x))−g(x)+(2λ)−1 x − proxλ g (x) 2 . LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Image deconvolution Objective recover an original image x ∈ Rn from a blurred and noisy observed image y ∈ Rn related to x by the linear observation model y = Hx + w, where H is a linear operator representing the blur point spread function and w is a Gaussian vector with zero-mean and covariance matrix σ2In . This inverse problem is usually ill-posed or ill-conditioned: exploits prior knowledge about x. One of the most widely used image prior for deconvolution problems is the improper total-variation norm prior, π(x) ∝ exp (−α ∇d x 1 ), where ∇d denotes the discrete gradient operator that computes the vertical and horizontal diﬀerences between neighbour pixels. π(x|y) ∝ exp − y − Hx 2/2σ2 − α ∇d x 1 . LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials (a) (b) (c) Figure : (a) Original Boat image (256 × 256 pixels), (b) Blurred image, (c) MAP estimate. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Credibility intervals (a) (b) (c) Figure : (a) Pixel-wise 90% credibility intervals computed with proximal MALA (computing time 35 hours), (b) Approximate intervals estimated with MYULA using λ = 0.01 (computing time 3.5 hours), (c) Approximate intervals estimated with MYULA using λ = 0.1 (computing time 20 minutes). LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Microscopy dataset (a) (b) LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials HPD credible region Consider the 1 norm prior, π(x) ∝ exp (−α x 1 ), then π(x|y) ∝ exp − y − Hx 2/2σ2 − α x 1 . We want to test if the molecules are indeed present in true image (as opposed to being noise artefacts for example), Uncertainty about their position. For this, it can be relevent to compute the HPD credible region C∗ α = {x : U(x) ≤ ηα } with ηα ∈ R chosen such that P(x ∈ C∗ α |y) = 1 − α holds. Here we use α = 0.01 related to the 99% conﬁdence level. LS3 seminar

total variation for super-exponential densities Deviation inequalities Non-smooth potentials Comparison with PMALA (a) (b) Figure : Microscopy experiment: (a) HDP region thresholds ηα for MYULA (2 × 106 iterations λ = 1, γ = 0.6) and PMALA (2 × 107 iterations), (b) relative approximation error of MYULA. LS3 seminar

Alain Durmus

Alain Durmus

More Decks by S³ Seminar

Other Decks in Research

Featured

Transcript