Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A First Look at Proximal Methods

A First Look at Proximal Methods

SMILE meeting, Mines ParisTech, May 2015

Samuel Vaiter

May 21, 2015
Tweet

More Decks by Samuel Vaiter

Other Decks in Science

Transcript

  1. What’s the Menu ? argmin x∈M f (x) In this

    tutorial: M = RN (finite dimension, euclidian setting) f nicely convex no global smoothness assumption Fundamental ideas covered Fixed point Splitting Duality
  2. Descent Method Definition A point dx ∈ RN is a

    descent direction at x if there exists ρx > 0 such that ∀ρ ∈ (0, ρx ), f (x + ρd) < f (x). f (x(n)) x(n) f (x(n+1)) x(n+1) dx(n) Descent method x(n+1) = x(n)+ρx(n) dx(n)
  3. The Gradient Descent . . . . . . is

    a Descent Method Proposition If f is differentiable and ∇f (x) = 0, then −∇f (x) is a descent direction. t f (t) (t, t2)
  4. Convergence of the Gradient Descent x(n+1) = x(n) − ρ∇f

    (x(n)) Proposition If 0 < ρ < 2α β , then x(n) converges to the unique minimizer x of f . Moreover, there exists 0 < γ < 1 such that ||x(n) − x || γk||x(0) − x ||
  5. Proof Fixed point theory interpretation Tx = x − ρ∇f

    (x) Fix T = argmin f T is a contractor Tx(n) → x
  6. Fix T = argmin f Tx = x ⇔ x

    = x − ρ∇f (x) ⇔ 0 = ∇f (x) ⇔ x minimizer of f
  7. T is contractant ||Tx − Ty||2 =||x − y||2 −

    2ρ ∇f (x) − ∇f (y), x − y −A1 + 2ρ2||∇f (x) − ∇f (y)||2 A2 A1 −2αρ||x − y||2 (strong convexity) A2 β2ρ2||x − y||2 (Lipschitz gradient) ||Tx − Ty||2 γ2||x − y||2 et γ = 1 − 2αρ + β2ρ2 and γ < 1 ⇔ 0 < ρ < 2α β2
  8. Towards a New Operator First-order optimality condition f convex +

    smooth 0 = ∇f (x ) ⇔ x minimizer of f f convex + nonsmooth 0 ∈ ∂f (x ) = 0 ⇔ x minimizer of f
  9. Historical Note Nonlinear analysis (starting from ’1950s) Convex analysis Monotone

    operator Nonexpansive mappings People: Brézis, Fenchel, Lions, Moreau, Rockafellar, etc. study of multivalued mappings of Banach spaces Today: application-driven approach
  10. Reminder on Convex Analysis f : RN → ¯ R

    Convexity f (tx + (1 − t)y) tf (x) + (1 − t)f (y) Lower semi-continuous lim inf x→x0 f (x) f (x0 ) Proper f (x) > −∞ et ∃x, f (x) = +∞ HERE convex ≡ convex, l.s.c and proper. We assume also that argmin f is not empty.
  11. Subdifferential t f (t) (t, t2) ∂f (t) = {η

    : f (t ) f (t) + η, t − t }
  12. Towards a New Operator First-order optimality condition f convex +

    smooth 0 = ∇f (x ) ⇔ x minimizer of f f convex + nonsmooth 0 ∈ ∂f (x ) = 0 ⇔ x minimizer of f
  13. Properties of the Subdifferential Definition The subdifferential of a convex

    function f at x ∈ RN is the subset of RN defined by ∂f (t) = {η : f (t ) f (t) + η, t − t } Proposition If f is smooth, ∂f (x) = {∇f (x)} For ρ > 0, ∂(ρf )(x) = ρ∂f (x) ∂(f + g)(x) ⊆ ∂f (x) + ∂g(x) Equality if ri dom f ∩ ri dom g = ∅
  14. Life is Smooth: Moreau–Yosida Infimal convolution (f g)(x) = inf

    v f (x) + g(v − x) Definition The Moreau–Yosida regularization of f is defined as M[f ] = f (1/2)|| · ||2 Theorem For any convex function f (not smooth, not full-domain) dom M[f ] = Rn M[f ] is continuously differentiable argmin M[f ] = argmin f
  15. Proximity Operator Proximity operator ≡ unique argument of Moreau infimum

    Definition The proximity operator of a convex function f is defined by proxf (v) = argmin x f (x) + 1 2 ||x − v||2 Smooth interpretation: implicit gradient step proxf (x) = x − ∇M[f ](x)
  16. Proximity ≈ Generalized Projection Indicator function ιC (x) = 0

    if x ∈ C +∞ otherwise. Proposition (Proximity ≡ Projection) If C is a convex set, then proxιC = ΠC proxιC (v) = argmin x proxιC (v) + 1 2 ||x − v||2 = argmin x∈C 1 2 ||x − v||2 = ΠC (v)
  17. Subdifferential and Proximity Operator Proposition p = proxf (v) ⇔

    v − p ∈ ∂f (p) Resolvant of the subdifferential (as a notation) proxf (v) = (Id + ∂f )−1(v) Theorem Fix proxf = argmin f
  18. Proximal Fixed Point T = proxf Fix T = argmin

    f T is a contractor T is firmly nonexpansive Tnx → x Krasnosel’skii-Mann Firmly nonexpansive || proxf (x)−proxf (y)||2+||(Id−proxf )(x)−(Id−proxf )(y)||2 ||x−y||2
  19. A First Set of Properties Separability: f (x, y) =

    f1 (x) + f2 (y) proxf (v, w) = (Proxf1 (v), Proxf2 (w)) Orthogonal precomposition: f (x) = g(Ax) with AA∗ = Id proxf (v) = AT proxg (Av) Affine addition: f (x) = g(x) + u, x + b proxf (v) = proxg (v − a)
  20. A Concrete Example: The Lasso Observations y = Φx0 +

    w Assumption: x0 sparse, i.e. Card supp(x0 ) N Variational recovery min x 1 2 ||y − Φx||2 + λ Card supp(x) Convex relaxation min x 1 2 ||y − Φx||2 + λ||x||1 where ||x||1 = i |xi |
  21. An Idea: Splitting min x J(x) = 1 2 ||y

    − Φx||2 f + λ||x||1 λg J not smooth / proxJ hard to compute But: f is smooth proxλg is easy to compute t ST(t) λ −λ Soft thresholding (proxλ||·||1 (x))i = sign(xi )(|xi |−λ)+
  22. Fixed Point x ∈ argmin f + g 0 ∈

    ∇f (x ) + ∂g(x ) 0 ∈ ρ∇f (x ) + ρ∂g(x ) 0 ∈ ρ∇f (x ) − x + x + ρ∂g(x ) (Id − ρ∇f )(x ) ∈ (Id + ρ∂g)(x ) x = (Id + ρ∂g)−1(Id − ρ∇f )(x ) x = proxρg (x − ρ∇f (x )) Proposition Tx = proxρg (x − ρ∇f (x)) Fix T = argmin f (x) + g(x)
  23. Algorithm: Forward-Backward x(n+1) = proxρg backward (x(n) − ρ∇f (x(n))

    forward ) Proposition If 0 < ρ < 1 β , then x(n) converges to a minimizer x of f + g. Moreover, J(x(n)) − J(x ) = O(1/n) The convergence ||x(n) − x || may be arbitrary slow
  24. Special Cases x(n+1) = proxρg (x(n) − ρ∇f (x(n))) Gradient

    descent: g = 0 x(n+1) = x(n) − ρ∇f (x(n)) Proximal point: f = 0 x(n+1) = proxρg (x(n)) Projected gradient: g = ιC x(n+1) = ΠC (x(n) − ρ∇f (x(n)))
  25. Another Example: The Basis Pursuit Noiseless observations y = Φx0

    Assumption: x0 sparse, i.e. Card supp(x0 ) N Variational recovery min y=Φx Card supp(x) Convex relaxation min y=Φx ||x||1 Constraint-less formulation min x ||x||1 + ιy=Φx (x)
  26. Another Example: The Basis Pursuit min x J(x) = ||x||1

    f + ιy=Φx (x) g Neither f or g is differentiable But: (prox||·||1 (x))i = sign(xi )(|xi | − 1)+ proxιy=Φx (x) = Πy=Φx (x) = x − Φ+(Φx − y)
  27. Fixed Point x ∈ argmin f + g 0 ∈

    ∂f (x ) + ∂g(x ) 2x ∈ (x + ρ∂f (x )) + (x + ρ∂g(x )) 2x ∈ (Id + ρ∂f )(x ) + (Id + ρ∂g)(x ) Idea: take z ∈ (Id + ρ∂g)(x ) 2x ∈ (Id + ρ∂f )(x ) + z x = (Id + ρ∂f )−1(2x − z) Almost a fixed point . . . let Γρf = 2(Id + ρ∂f ) − Id 2x ∈ (Id + ρ∂f )(x ) + z 2x − z ∈ (Id + ρ∂f )(x ) Γρg (z) ∈ (Id + ρ∂f )(x ) In particular, x = (Id + ρ∂f )−1(Γρg (z)).
  28. Fixed Point Γρf = 2(Id + ρ∂f ) − Id

    x = (Id + ρ∂f )−1(Γρg (z)) et Γρg (z) ∈ (Id + ρ∂f )(x ) Thus, z = 2x −Γρg (z) = 2(Id+ρ∂f )−1(Γρg (z))−Γρg (z) = Γρf (Γρg (z)) Fixed point x ∈ argmin f + g ⇔ x = proxρg (z) z = (Γρf ◦ Γρg )(z) Proposition proxρg (Fix(Γρf ◦ Γρg )) = argmin f + g
  29. Algorithm: Douglas–Rachford min x f (x) + g(x) Douglas–Rachford x(n)

    = proxρg (y(n)) z(n) = proxρf (2x(n) − y(n)) y(n+1) = y(n) + γ(z(n) − x(n)) Proposition If ρ > 0, 1 < γ < 2, then x(n) converges to a minimizer x of f + g. The rate of convergence is hard to derive
  30. Algorithm: FISTA Forward-Backward x(n+1) = proxρg (x(n) − ρ∇f (x(n)))

    Rate of converence (f + g)(x(n)) − (f + g)(x ) = O(1/n) FISTA: non-convex updates x(n) = proxρg (y(n) − ρ∇f (y(n))) t(n+1) = 1 + √ 1 + 4t(n) 2 y(n+1) = x(n) + t(n) − 1 t(n+1) (x(n) − x(n−1)) Rate of converence (f + g)(x(n)) − (f + g)(x ) = O(1/n2)
  31. 100 101 102 103 104 n 10-11 10-10 10-9 10-8

    10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 J(x(n) )−J(x ) FB FISTA 1/n 1/n2
  32. Composite Problem Total Variation: Known in statistics (1D) as taut

    string min x 1 2 ||y − Φx||2 + λ||Dx||p D first-order finite difference Composite Problem: min x f (Kx) + g(x) where K is a linear operator proxf easy to compute ⇒ proxf ◦K easy to compute ? No (except if K is orthogonal)
  33. Fenchel-Rockafellar Conjugate Definition f ∗(ξ) = sup x x, ξ

    − f (x) x ξ, x f (x) ˆ x ˆ x, ξ f (ˆ x) −f ∗(ξ)
  34. Back to Moreau–Yosida Using that (f g) = f ∗

    + g∗ and (1/2)|| · ||2 self-dual M[f ] = (f ∗ + (1/2)|| · ||2)∗ Natural smoothing of 1 norm ? (M[||·||1 ](x))i = x2 i if |x| 1 2|xi | − 1 otherwise t | · | M[| · |] Other choices exist such as | · |2 + ε2
  35. The Moreau Identity Theorem If f is convex, lsc and

    proper then proxf (x) + proxf ∗ (x) = x Applications: Generalization of the orthogonal splitting of Rn ΠT (x) + ΠT⊥ (x) = x If proxf ∗ easy to compute, so does proxf .
  36. The Moreau Identity: Norms and Balls proxλ||·|| (x) = x

    − λΠB (x/λ) where B = {x : ||x||∗ 1} and ||x||∗ = sup||v|| 1 v, x 2 norm proxλ||·||2 (x) = 1 − λ ||x||2 + x Elastic net f = || · ||1 + (ε/2)|| · ||2 proxλf (x) = 1 1 + λε proxλ||·||1 (x) Group regularization f (x) = g∈G ||xg ||2 (proxλf (x))g = 1 − λ ||xg ||2 + xg
  37. Primal–Dual Formulation min x f (Kx) + g(x) Objective: remove

    K inside f . 1. Biconjugate theorem f (Kx) = (f ∗)∗(Kx) = sup ξ ξ, Kx − f ∗(ξ) 2. Adjoint operator f (Kx) = sup ξ K∗ξ, x − f ∗(ξ) 3. Rewrite initial problem (with qualification assumption) min x max ξ K∗ξ, x − f ∗(ξ) + g(x)
  38. Algorithm: Arrow–Hurwicz min x max ξ K∗ξ, x − f

    ∗(ξ) + g(x) ξ(n+1) = proxσf ∗ (ξ(n) + σKx(n)) x(n+1) = proxτg (x(n) − τK∗ξ(n+1))
  39. Algorithm: Chambolle–Pock min x max ξ K∗ξ, x − f

    ∗(ξ) + g(x) ξ(n+1) = proxσf ∗ (ξ(n) + σK¯ x(n)) x(n+1) = proxτg (x(n) − τK∗ξ(n+1)) ¯ x(n+1) = 2x(n+1) − x(n)
  40. Matrix Regularization Orthogonal invariance (∀X ∈ RN×M, ∀U ∈ ON

    , V ∈ OM ) F(VXU) = F(X) Proposition If F is orthogonally invariant, then F(X) = F(diag(σ(X))), where σ(X) is the ordered singular values of X. Absolute symmetry (∀x ∈ RN, ∀Q signed permutation) f (Qx) = f (x) Theorem F orthogonally inv. ⇔ F = f ◦ σ with f absolutely symmetric. ex: || · ||nuc = || · ||1 ◦ σ
  41. Transfer Principle F = f ◦ σ orthogonaly invariant. SVD

    X = V diag(σ(X))U Proposition F convex ⇔ f convex ∂F(X) = {V diag(µ)U : µ ∈ ∂f (σ(X))} proxF (X) = V diag(proxf (diag(σ(X))))U Example: nuclear norm proxλ||·||nuc (X) = N i=1 (σ(X)i − λ)+ ui v∗ i
  42. Not Covered Today A lot of things Augmented Lagrangian methods

    (ADMM) Parallelization and multi-objective problems . . .
  43. Software C, Matlab proximal (Parikh, Boyd) https://github.com/cvxgrp/proximal Matlab TFOCS (Becker,

    Candès, Grant) https://github.com/cvxr/TFOCS Python pyprox (Vaiter) https://github.com/svaiter/pyprox (currently rewriting)
  44. Selected Bibliography Y. Nesterov, Introductory Lectures on Convex Optimization, 2004

    N. Parikh and S. Boyd, Proximal algorithms, 2013 P.L. Combettes and J.-C. Pesquet, Proximal Splitting Methods in Signal Processing, 2011 A. Beck and M. Teboulle, Gradient-Based Algorithms with Applications to Signal Recovery Problems, 2010