A First Look at Proximal Methods

A First Look at Proximal Methods

SMILE meeting, Mines ParisTech, May 2015

4807c637e2e5e8a5c5e68b287e8492a9?s=128

Samuel Vaiter

May 21, 2015
Tweet

Transcript

  1. A First Look at Proximal Methods Samuel Vaiter CMAP, École

    Polytechnique samuel.vaiter@cmap.polytechnique.fr May 21, 2015 Télécom ParisTech
  2. What’s the Menu ? argmin x∈M f (x) In this

    tutorial: M = RN (finite dimension, euclidian setting) f nicely convex no global smoothness assumption Fundamental ideas covered Fixed point Splitting Duality
  3. Algorithm: Gradient Descent x(n+1) = x(n) − ρ∇f (x(n))

  4. Descent Method Definition A point dx ∈ RN is a

    descent direction at x if there exists ρx > 0 such that ∀ρ ∈ (0, ρx ), f (x + ρd) < f (x). f (x(n)) x(n) f (x(n+1)) x(n+1) dx(n) Descent method x(n+1) = x(n)+ρx(n) dx(n)
  5. The Gradient Descent . . . . . . is

    a Descent Method Proposition If f is differentiable and ∇f (x) = 0, then −∇f (x) is a descent direction. t f (t) (t, t2)
  6. Convergence of the Gradient Descent x(n+1) = x(n) − ρ∇f

    (x(n)) Proposition If 0 < ρ < 2α β , then x(n) converges to the unique minimizer x of f . Moreover, there exists 0 < γ < 1 such that ||x(n) − x || γk||x(0) − x ||
  7. Proof Fixed point theory interpretation Tx = x − ρ∇f

    (x) Fix T = argmin f T is a contractor Tx(n) → x
  8. Fix T = argmin f Tx = x ⇔ x

    = x − ρ∇f (x) ⇔ 0 = ∇f (x) ⇔ x minimizer of f
  9. T is contractant ||Tx − Ty||2 =||x − y||2 −

    2ρ ∇f (x) − ∇f (y), x − y −A1 + 2ρ2||∇f (x) − ∇f (y)||2 A2 A1 −2αρ||x − y||2 (strong convexity) A2 β2ρ2||x − y||2 (Lipschitz gradient) ||Tx − Ty||2 γ2||x − y||2 et γ = 1 − 2αρ + β2ρ2 and γ < 1 ⇔ 0 < ρ < 2α β2
  10. Towards a New Operator First-order optimality condition f convex +

    smooth 0 = ∇f (x ) ⇔ x minimizer of f f convex + nonsmooth 0 ∈ ∂f (x ) = 0 ⇔ x minimizer of f
  11. Historical Note Nonlinear analysis (starting from ’1950s) Convex analysis Monotone

    operator Nonexpansive mappings People: Brézis, Fenchel, Lions, Moreau, Rockafellar, etc. study of multivalued mappings of Banach spaces Today: application-driven approach
  12. Reminder on Convex Analysis f : RN → ¯ R

    Convexity f (tx + (1 − t)y) tf (x) + (1 − t)f (y) Lower semi-continuous lim inf x→x0 f (x) f (x0 ) Proper f (x) > −∞ et ∃x, f (x) = +∞ HERE convex ≡ convex, l.s.c and proper. We assume also that argmin f is not empty.
  13. Subdifferential t f (t) (t, t2) ∂f (t) = {η

    : f (t ) f (t) + η, t − t }
  14. Towards a New Operator First-order optimality condition f convex +

    smooth 0 = ∇f (x ) ⇔ x minimizer of f f convex + nonsmooth 0 ∈ ∂f (x ) = 0 ⇔ x minimizer of f
  15. Properties of the Subdifferential Definition The subdifferential of a convex

    function f at x ∈ RN is the subset of RN defined by ∂f (t) = {η : f (t ) f (t) + η, t − t } Proposition If f is smooth, ∂f (x) = {∇f (x)} For ρ > 0, ∂(ρf )(x) = ρ∂f (x) ∂(f + g)(x) ⊆ ∂f (x) + ∂g(x) Equality if ri dom f ∩ ri dom g = ∅
  16. Life is Smooth: Moreau–Yosida Infimal convolution (f g)(x) = inf

    v f (x) + g(v − x) Definition The Moreau–Yosida regularization of f is defined as M[f ] = f (1/2)|| · ||2 Theorem For any convex function f (not smooth, not full-domain) dom M[f ] = Rn M[f ] is continuously differentiable argmin M[f ] = argmin f
  17. Proximity Operator Proximity operator ≡ unique argument of Moreau infimum

    Definition The proximity operator of a convex function f is defined by proxf (v) = argmin x f (x) + 1 2 ||x − v||2 Smooth interpretation: implicit gradient step proxf (x) = x − ∇M[f ](x)
  18. Proximity ≈ Generalized Projection Indicator function ιC (x) = 0

    if x ∈ C +∞ otherwise. Proposition (Proximity ≡ Projection) If C is a convex set, then proxιC = ΠC proxιC (v) = argmin x proxιC (v) + 1 2 ||x − v||2 = argmin x∈C 1 2 ||x − v||2 = ΠC (v)
  19. Subdifferential and Proximity Operator Proposition p = proxf (v) ⇔

    v − p ∈ ∂f (p) Resolvant of the subdifferential (as a notation) proxf (v) = (Id + ∂f )−1(v) Theorem Fix proxf = argmin f
  20. Proximal Fixed Point T = proxf Fix T = argmin

    f T is a contractor T is firmly nonexpansive Tnx → x Krasnosel’skii-Mann Firmly nonexpansive || proxf (x)−proxf (y)||2+||(Id−proxf )(x)−(Id−proxf )(y)||2 ||x−y||2
  21. A First Set of Properties Separability: f (x, y) =

    f1 (x) + f2 (y) proxf (v, w) = (Proxf1 (v), Proxf2 (w)) Orthogonal precomposition: f (x) = g(Ax) with AA∗ = Id proxf (v) = AT proxg (Av) Affine addition: f (x) = g(x) + u, x + b proxf (v) = proxg (v − a)
  22. A Concrete Example: The Lasso Observations y = Φx0 +

    w Assumption: x0 sparse, i.e. Card supp(x0 ) N Variational recovery min x 1 2 ||y − Φx||2 + λ Card supp(x) Convex relaxation min x 1 2 ||y − Φx||2 + λ||x||1 where ||x||1 = i |xi |
  23. An Idea: Splitting min x J(x) = 1 2 ||y

    − Φx||2 f + λ||x||1 λg J not smooth / proxJ hard to compute But: f is smooth proxλg is easy to compute t ST(t) λ −λ Soft thresholding (proxλ||·||1 (x))i = sign(xi )(|xi |−λ)+
  24. Fixed Point x ∈ argmin f + g 0 ∈

    ∇f (x ) + ∂g(x ) 0 ∈ ρ∇f (x ) + ρ∂g(x ) 0 ∈ ρ∇f (x ) − x + x + ρ∂g(x ) (Id − ρ∇f )(x ) ∈ (Id + ρ∂g)(x ) x = (Id + ρ∂g)−1(Id − ρ∇f )(x ) x = proxρg (x − ρ∇f (x )) Proposition Tx = proxρg (x − ρ∇f (x)) Fix T = argmin f (x) + g(x)
  25. Algorithm: Forward-Backward x(n+1) = proxρg backward (x(n) − ρ∇f (x(n))

    forward ) Proposition If 0 < ρ < 1 β , then x(n) converges to a minimizer x of f + g. Moreover, J(x(n)) − J(x ) = O(1/n) The convergence ||x(n) − x || may be arbitrary slow
  26. Special Cases x(n+1) = proxρg (x(n) − ρ∇f (x(n))) Gradient

    descent: g = 0 x(n+1) = x(n) − ρ∇f (x(n)) Proximal point: f = 0 x(n+1) = proxρg (x(n)) Projected gradient: g = ιC x(n+1) = ΠC (x(n) − ρ∇f (x(n)))
  27. Another Example: The Basis Pursuit Noiseless observations y = Φx0

    Assumption: x0 sparse, i.e. Card supp(x0 ) N Variational recovery min y=Φx Card supp(x) Convex relaxation min y=Φx ||x||1 Constraint-less formulation min x ||x||1 + ιy=Φx (x)
  28. Another Example: The Basis Pursuit min x J(x) = ||x||1

    f + ιy=Φx (x) g Neither f or g is differentiable But: (prox||·||1 (x))i = sign(xi )(|xi | − 1)+ proxιy=Φx (x) = Πy=Φx (x) = x − Φ+(Φx − y)
  29. Fixed Point x ∈ argmin f + g 0 ∈

    ∂f (x ) + ∂g(x ) 2x ∈ (x + ρ∂f (x )) + (x + ρ∂g(x )) 2x ∈ (Id + ρ∂f )(x ) + (Id + ρ∂g)(x ) Idea: take z ∈ (Id + ρ∂g)(x ) 2x ∈ (Id + ρ∂f )(x ) + z x = (Id + ρ∂f )−1(2x − z) Almost a fixed point . . . let Γρf = 2(Id + ρ∂f ) − Id 2x ∈ (Id + ρ∂f )(x ) + z 2x − z ∈ (Id + ρ∂f )(x ) Γρg (z) ∈ (Id + ρ∂f )(x ) In particular, x = (Id + ρ∂f )−1(Γρg (z)).
  30. Fixed Point Γρf = 2(Id + ρ∂f ) − Id

    x = (Id + ρ∂f )−1(Γρg (z)) et Γρg (z) ∈ (Id + ρ∂f )(x ) Thus, z = 2x −Γρg (z) = 2(Id+ρ∂f )−1(Γρg (z))−Γρg (z) = Γρf (Γρg (z)) Fixed point x ∈ argmin f + g ⇔ x = proxρg (z) z = (Γρf ◦ Γρg )(z) Proposition proxρg (Fix(Γρf ◦ Γρg )) = argmin f + g
  31. Algorithm: Douglas–Rachford min x f (x) + g(x) Douglas–Rachford x(n)

    = proxρg (y(n)) z(n) = proxρf (2x(n) − y(n)) y(n+1) = y(n) + γ(z(n) − x(n)) Proposition If ρ > 0, 1 < γ < 2, then x(n) converges to a minimizer x of f + g. The rate of convergence is hard to derive
  32. Algorithm: FISTA Forward-Backward x(n+1) = proxρg (x(n) − ρ∇f (x(n)))

    Rate of converence (f + g)(x(n)) − (f + g)(x ) = O(1/n) FISTA: non-convex updates x(n) = proxρg (y(n) − ρ∇f (y(n))) t(n+1) = 1 + √ 1 + 4t(n) 2 y(n+1) = x(n) + t(n) − 1 t(n+1) (x(n) − x(n−1)) Rate of converence (f + g)(x(n)) − (f + g)(x ) = O(1/n2)
  33. 100 101 102 103 104 n 10-11 10-10 10-9 10-8

    10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 J(x(n) )−J(x ) FB FISTA 1/n 1/n2
  34. Composite Problem Total Variation: Known in statistics (1D) as taut

    string min x 1 2 ||y − Φx||2 + λ||Dx||p D first-order finite difference Composite Problem: min x f (Kx) + g(x) where K is a linear operator proxf easy to compute ⇒ proxf ◦K easy to compute ? No (except if K is orthogonal)
  35. Fenchel-Rockafellar Conjugate Definition f ∗(ξ) = sup x x, ξ

    − f (x) x ξ, x f (x) ˆ x ˆ x, ξ f (ˆ x) −f ∗(ξ)
  36. Biconjugate Theorem If f is convex, lsc and proper then

    f ∗∗ = f
  37. Back to Moreau–Yosida Using that (f g) = f ∗

    + g∗ and (1/2)|| · ||2 self-dual M[f ] = (f ∗ + (1/2)|| · ||2)∗ Natural smoothing of 1 norm ? (M[||·||1 ](x))i = x2 i if |x| 1 2|xi | − 1 otherwise t | · | M[| · |] Other choices exist such as | · |2 + ε2
  38. The Moreau Identity Theorem If f is convex, lsc and

    proper then proxf (x) + proxf ∗ (x) = x Applications: Generalization of the orthogonal splitting of Rn ΠT (x) + ΠT⊥ (x) = x If proxf ∗ easy to compute, so does proxf .
  39. The Moreau Identity: Norms and Balls proxλ||·|| (x) = x

    − λΠB (x/λ) where B = {x : ||x||∗ 1} and ||x||∗ = sup||v|| 1 v, x 2 norm proxλ||·||2 (x) = 1 − λ ||x||2 + x Elastic net f = || · ||1 + (ε/2)|| · ||2 proxλf (x) = 1 1 + λε proxλ||·||1 (x) Group regularization f (x) = g∈G ||xg ||2 (proxλf (x))g = 1 − λ ||xg ||2 + xg
  40. Primal–Dual Formulation min x f (Kx) + g(x) Objective: remove

    K inside f . 1. Biconjugate theorem f (Kx) = (f ∗)∗(Kx) = sup ξ ξ, Kx − f ∗(ξ) 2. Adjoint operator f (Kx) = sup ξ K∗ξ, x − f ∗(ξ) 3. Rewrite initial problem (with qualification assumption) min x max ξ K∗ξ, x − f ∗(ξ) + g(x)
  41. Algorithm: Arrow–Hurwicz min x max ξ K∗ξ, x − f

    ∗(ξ) + g(x) ξ(n+1) = proxσf ∗ (ξ(n) + σKx(n)) x(n+1) = proxτg (x(n) − τK∗ξ(n+1))
  42. Algorithm: Chambolle–Pock min x max ξ K∗ξ, x − f

    ∗(ξ) + g(x) ξ(n+1) = proxσf ∗ (ξ(n) + σK¯ x(n)) x(n+1) = proxτg (x(n) − τK∗ξ(n+1)) ¯ x(n+1) = 2x(n+1) − x(n)
  43. None
  44. Matrix Regularization Orthogonal invariance (∀X ∈ RN×M, ∀U ∈ ON

    , V ∈ OM ) F(VXU) = F(X) Proposition If F is orthogonally invariant, then F(X) = F(diag(σ(X))), where σ(X) is the ordered singular values of X. Absolute symmetry (∀x ∈ RN, ∀Q signed permutation) f (Qx) = f (x) Theorem F orthogonally inv. ⇔ F = f ◦ σ with f absolutely symmetric. ex: || · ||nuc = || · ||1 ◦ σ
  45. Transfer Principle F = f ◦ σ orthogonaly invariant. SVD

    X = V diag(σ(X))U Proposition F convex ⇔ f convex ∂F(X) = {V diag(µ)U : µ ∈ ∂f (σ(X))} proxF (X) = V diag(proxf (diag(σ(X))))U Example: nuclear norm proxλ||·||nuc (X) = N i=1 (σ(X)i − λ)+ ui v∗ i
  46. Not Covered Today A lot of things Augmented Lagrangian methods

    (ADMM) Parallelization and multi-objective problems . . .
  47. Software C, Matlab proximal (Parikh, Boyd) https://github.com/cvxgrp/proximal Matlab TFOCS (Becker,

    Candès, Grant) https://github.com/cvxr/TFOCS Python pyprox (Vaiter) https://github.com/svaiter/pyprox (currently rewriting)
  48. Selected Bibliography Y. Nesterov, Introductory Lectures on Convex Optimization, 2004

    N. Parikh and S. Boyd, Proximal algorithms, 2013 P.L. Combettes and J.-C. Pesquet, Proximal Splitting Methods in Signal Processing, 2011 A. Beck and M. Teboulle, Gradient-Based Algorithms with Applications to Signal Recovery Problems, 2010