150

# A First Look at Proximal Methods

SMILE meeting, Mines ParisTech, May 2015

May 21, 2015

## Transcript

1. ### A First Look at Proximal Methods Samuel Vaiter CMAP, École

Polytechnique samuel.vaiter@cmap.polytechnique.fr May 21, 2015 Télécom ParisTech
2. ### What’s the Menu ? argmin x∈M f (x) In this

tutorial: M = RN (ﬁnite dimension, euclidian setting) f nicely convex no global smoothness assumption Fundamental ideas covered Fixed point Splitting Duality

4. ### Descent Method Deﬁnition A point dx ∈ RN is a

descent direction at x if there exists ρx > 0 such that ∀ρ ∈ (0, ρx ), f (x + ρd) < f (x). f (x(n)) x(n) f (x(n+1)) x(n+1) dx(n) Descent method x(n+1) = x(n)+ρx(n) dx(n)
5. ### The Gradient Descent . . . . . . is

a Descent Method Proposition If f is diﬀerentiable and ∇f (x) = 0, then −∇f (x) is a descent direction. t f (t) (t, t2)
6. ### Convergence of the Gradient Descent x(n+1) = x(n) − ρ∇f

(x(n)) Proposition If 0 < ρ < 2α β , then x(n) converges to the unique minimizer x of f . Moreover, there exists 0 < γ < 1 such that ||x(n) − x || γk||x(0) − x ||
7. ### Proof Fixed point theory interpretation Tx = x − ρ∇f

(x) Fix T = argmin f T is a contractor Tx(n) → x
8. ### Fix T = argmin f Tx = x ⇔ x

= x − ρ∇f (x) ⇔ 0 = ∇f (x) ⇔ x minimizer of f
9. ### T is contractant ||Tx − Ty||2 =||x − y||2 −

2ρ ∇f (x) − ∇f (y), x − y −A1 + 2ρ2||∇f (x) − ∇f (y)||2 A2 A1 −2αρ||x − y||2 (strong convexity) A2 β2ρ2||x − y||2 (Lipschitz gradient) ||Tx − Ty||2 γ2||x − y||2 et γ = 1 − 2αρ + β2ρ2 and γ < 1 ⇔ 0 < ρ < 2α β2
10. ### Towards a New Operator First-order optimality condition f convex +

smooth 0 = ∇f (x ) ⇔ x minimizer of f f convex + nonsmooth 0 ∈ ∂f (x ) = 0 ⇔ x minimizer of f
11. ### Historical Note Nonlinear analysis (starting from ’1950s) Convex analysis Monotone

operator Nonexpansive mappings People: Brézis, Fenchel, Lions, Moreau, Rockafellar, etc. study of multivalued mappings of Banach spaces Today: application-driven approach
12. ### Reminder on Convex Analysis f : RN → ¯ R

Convexity f (tx + (1 − t)y) tf (x) + (1 − t)f (y) Lower semi-continuous lim inf x→x0 f (x) f (x0 ) Proper f (x) > −∞ et ∃x, f (x) = +∞ HERE convex ≡ convex, l.s.c and proper. We assume also that argmin f is not empty.
13. ### Subdiﬀerential t f (t) (t, t2) ∂f (t) = {η

: f (t ) f (t) + η, t − t }
14. ### Towards a New Operator First-order optimality condition f convex +

smooth 0 = ∇f (x ) ⇔ x minimizer of f f convex + nonsmooth 0 ∈ ∂f (x ) = 0 ⇔ x minimizer of f
15. ### Properties of the Subdiﬀerential Deﬁnition The subdiﬀerential of a convex

function f at x ∈ RN is the subset of RN deﬁned by ∂f (t) = {η : f (t ) f (t) + η, t − t } Proposition If f is smooth, ∂f (x) = {∇f (x)} For ρ > 0, ∂(ρf )(x) = ρ∂f (x) ∂(f + g)(x) ⊆ ∂f (x) + ∂g(x) Equality if ri dom f ∩ ri dom g = ∅
16. ### Life is Smooth: Moreau–Yosida Inﬁmal convolution (f g)(x) = inf

v f (x) + g(v − x) Deﬁnition The Moreau–Yosida regularization of f is deﬁned as M[f ] = f (1/2)|| · ||2 Theorem For any convex function f (not smooth, not full-domain) dom M[f ] = Rn M[f ] is continuously diﬀerentiable argmin M[f ] = argmin f
17. ### Proximity Operator Proximity operator ≡ unique argument of Moreau inﬁmum

Deﬁnition The proximity operator of a convex function f is deﬁned by proxf (v) = argmin x f (x) + 1 2 ||x − v||2 Smooth interpretation: implicit gradient step proxf (x) = x − ∇M[f ](x)
18. ### Proximity ≈ Generalized Projection Indicator function ιC (x) = 0

if x ∈ C +∞ otherwise. Proposition (Proximity ≡ Projection) If C is a convex set, then proxιC = ΠC proxιC (v) = argmin x proxιC (v) + 1 2 ||x − v||2 = argmin x∈C 1 2 ||x − v||2 = ΠC (v)
19. ### Subdiﬀerential and Proximity Operator Proposition p = proxf (v) ⇔

v − p ∈ ∂f (p) Resolvant of the subdiﬀerential (as a notation) proxf (v) = (Id + ∂f )−1(v) Theorem Fix proxf = argmin f
20. ### Proximal Fixed Point T = proxf Fix T = argmin

f T is a contractor T is ﬁrmly nonexpansive Tnx → x Krasnosel’skii-Mann Firmly nonexpansive || proxf (x)−proxf (y)||2+||(Id−proxf )(x)−(Id−proxf )(y)||2 ||x−y||2
21. ### A First Set of Properties Separability: f (x, y) =

f1 (x) + f2 (y) proxf (v, w) = (Proxf1 (v), Proxf2 (w)) Orthogonal precomposition: f (x) = g(Ax) with AA∗ = Id proxf (v) = AT proxg (Av) Aﬃne addition: f (x) = g(x) + u, x + b proxf (v) = proxg (v − a)
22. ### A Concrete Example: The Lasso Observations y = Φx0 +

w Assumption: x0 sparse, i.e. Card supp(x0 ) N Variational recovery min x 1 2 ||y − Φx||2 + λ Card supp(x) Convex relaxation min x 1 2 ||y − Φx||2 + λ||x||1 where ||x||1 = i |xi |
23. ### An Idea: Splitting min x J(x) = 1 2 ||y

− Φx||2 f + λ||x||1 λg J not smooth / proxJ hard to compute But: f is smooth proxλg is easy to compute t ST(t) λ −λ Soft thresholding (proxλ||·||1 (x))i = sign(xi )(|xi |−λ)+
24. ### Fixed Point x ∈ argmin f + g 0 ∈

∇f (x ) + ∂g(x ) 0 ∈ ρ∇f (x ) + ρ∂g(x ) 0 ∈ ρ∇f (x ) − x + x + ρ∂g(x ) (Id − ρ∇f )(x ) ∈ (Id + ρ∂g)(x ) x = (Id + ρ∂g)−1(Id − ρ∇f )(x ) x = proxρg (x − ρ∇f (x )) Proposition Tx = proxρg (x − ρ∇f (x)) Fix T = argmin f (x) + g(x)
25. ### Algorithm: Forward-Backward x(n+1) = proxρg backward (x(n) − ρ∇f (x(n))

forward ) Proposition If 0 < ρ < 1 β , then x(n) converges to a minimizer x of f + g. Moreover, J(x(n)) − J(x ) = O(1/n) The convergence ||x(n) − x || may be arbitrary slow
26. ### Special Cases x(n+1) = proxρg (x(n) − ρ∇f (x(n))) Gradient

descent: g = 0 x(n+1) = x(n) − ρ∇f (x(n)) Proximal point: f = 0 x(n+1) = proxρg (x(n)) Projected gradient: g = ιC x(n+1) = ΠC (x(n) − ρ∇f (x(n)))
27. ### Another Example: The Basis Pursuit Noiseless observations y = Φx0

Assumption: x0 sparse, i.e. Card supp(x0 ) N Variational recovery min y=Φx Card supp(x) Convex relaxation min y=Φx ||x||1 Constraint-less formulation min x ||x||1 + ιy=Φx (x)
28. ### Another Example: The Basis Pursuit min x J(x) = ||x||1

f + ιy=Φx (x) g Neither f or g is diﬀerentiable But: (prox||·||1 (x))i = sign(xi )(|xi | − 1)+ proxιy=Φx (x) = Πy=Φx (x) = x − Φ+(Φx − y)
29. ### Fixed Point x ∈ argmin f + g 0 ∈

∂f (x ) + ∂g(x ) 2x ∈ (x + ρ∂f (x )) + (x + ρ∂g(x )) 2x ∈ (Id + ρ∂f )(x ) + (Id + ρ∂g)(x ) Idea: take z ∈ (Id + ρ∂g)(x ) 2x ∈ (Id + ρ∂f )(x ) + z x = (Id + ρ∂f )−1(2x − z) Almost a ﬁxed point . . . let Γρf = 2(Id + ρ∂f ) − Id 2x ∈ (Id + ρ∂f )(x ) + z 2x − z ∈ (Id + ρ∂f )(x ) Γρg (z) ∈ (Id + ρ∂f )(x ) In particular, x = (Id + ρ∂f )−1(Γρg (z)).
30. ### Fixed Point Γρf = 2(Id + ρ∂f ) − Id

x = (Id + ρ∂f )−1(Γρg (z)) et Γρg (z) ∈ (Id + ρ∂f )(x ) Thus, z = 2x −Γρg (z) = 2(Id+ρ∂f )−1(Γρg (z))−Γρg (z) = Γρf (Γρg (z)) Fixed point x ∈ argmin f + g ⇔ x = proxρg (z) z = (Γρf ◦ Γρg )(z) Proposition proxρg (Fix(Γρf ◦ Γρg )) = argmin f + g
31. ### Algorithm: Douglas–Rachford min x f (x) + g(x) Douglas–Rachford x(n)

= proxρg (y(n)) z(n) = proxρf (2x(n) − y(n)) y(n+1) = y(n) + γ(z(n) − x(n)) Proposition If ρ > 0, 1 < γ < 2, then x(n) converges to a minimizer x of f + g. The rate of convergence is hard to derive
32. ### Algorithm: FISTA Forward-Backward x(n+1) = proxρg (x(n) − ρ∇f (x(n)))

Rate of converence (f + g)(x(n)) − (f + g)(x ) = O(1/n) FISTA: non-convex updates x(n) = proxρg (y(n) − ρ∇f (y(n))) t(n+1) = 1 + √ 1 + 4t(n) 2 y(n+1) = x(n) + t(n) − 1 t(n+1) (x(n) − x(n−1)) Rate of converence (f + g)(x(n)) − (f + g)(x ) = O(1/n2)
33. ### 100 101 102 103 104 n 10-11 10-10 10-9 10-8

10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 J(x(n) )−J(x ) FB FISTA 1/n 1/n2
34. ### Composite Problem Total Variation: Known in statistics (1D) as taut

string min x 1 2 ||y − Φx||2 + λ||Dx||p D ﬁrst-order ﬁnite diﬀerence Composite Problem: min x f (Kx) + g(x) where K is a linear operator proxf easy to compute ⇒ proxf ◦K easy to compute ? No (except if K is orthogonal)
35. ### Fenchel-Rockafellar Conjugate Deﬁnition f ∗(ξ) = sup x x, ξ

− f (x) x ξ, x f (x) ˆ x ˆ x, ξ f (ˆ x) −f ∗(ξ)

f ∗∗ = f
37. ### Back to Moreau–Yosida Using that (f g) = f ∗

+ g∗ and (1/2)|| · ||2 self-dual M[f ] = (f ∗ + (1/2)|| · ||2)∗ Natural smoothing of 1 norm ? (M[||·||1 ](x))i = x2 i if |x| 1 2|xi | − 1 otherwise t | · | M[| · |] Other choices exist such as | · |2 + ε2
38. ### The Moreau Identity Theorem If f is convex, lsc and

proper then proxf (x) + proxf ∗ (x) = x Applications: Generalization of the orthogonal splitting of Rn ΠT (x) + ΠT⊥ (x) = x If proxf ∗ easy to compute, so does proxf .
39. ### The Moreau Identity: Norms and Balls proxλ||·|| (x) = x

− λΠB (x/λ) where B = {x : ||x||∗ 1} and ||x||∗ = sup||v|| 1 v, x 2 norm proxλ||·||2 (x) = 1 − λ ||x||2 + x Elastic net f = || · ||1 + (ε/2)|| · ||2 proxλf (x) = 1 1 + λε proxλ||·||1 (x) Group regularization f (x) = g∈G ||xg ||2 (proxλf (x))g = 1 − λ ||xg ||2 + xg
40. ### Primal–Dual Formulation min x f (Kx) + g(x) Objective: remove

K inside f . 1. Biconjugate theorem f (Kx) = (f ∗)∗(Kx) = sup ξ ξ, Kx − f ∗(ξ) 2. Adjoint operator f (Kx) = sup ξ K∗ξ, x − f ∗(ξ) 3. Rewrite initial problem (with qualiﬁcation assumption) min x max ξ K∗ξ, x − f ∗(ξ) + g(x)
41. ### Algorithm: Arrow–Hurwicz min x max ξ K∗ξ, x − f

∗(ξ) + g(x) ξ(n+1) = proxσf ∗ (ξ(n) + σKx(n)) x(n+1) = proxτg (x(n) − τK∗ξ(n+1))
42. ### Algorithm: Chambolle–Pock min x max ξ K∗ξ, x − f

∗(ξ) + g(x) ξ(n+1) = proxσf ∗ (ξ(n) + σK¯ x(n)) x(n+1) = proxτg (x(n) − τK∗ξ(n+1)) ¯ x(n+1) = 2x(n+1) − x(n)
43. None
44. ### Matrix Regularization Orthogonal invariance (∀X ∈ RN×M, ∀U ∈ ON

, V ∈ OM ) F(VXU) = F(X) Proposition If F is orthogonally invariant, then F(X) = F(diag(σ(X))), where σ(X) is the ordered singular values of X. Absolute symmetry (∀x ∈ RN, ∀Q signed permutation) f (Qx) = f (x) Theorem F orthogonally inv. ⇔ F = f ◦ σ with f absolutely symmetric. ex: || · ||nuc = || · ||1 ◦ σ
45. ### Transfer Principle F = f ◦ σ orthogonaly invariant. SVD

X = V diag(σ(X))U Proposition F convex ⇔ f convex ∂F(X) = {V diag(µ)U : µ ∈ ∂f (σ(X))} proxF (X) = V diag(proxf (diag(σ(X))))U Example: nuclear norm proxλ||·||nuc (X) = N i=1 (σ(X)i − λ)+ ui v∗ i
46. ### Not Covered Today A lot of things Augmented Lagrangian methods

(ADMM) Parallelization and multi-objective problems . . .
47. ### Software C, Matlab proximal (Parikh, Boyd) https://github.com/cvxgrp/proximal Matlab TFOCS (Becker,

Candès, Grant) https://github.com/cvxr/TFOCS Python pyprox (Vaiter) https://github.com/svaiter/pyprox (currently rewriting)
48. ### Selected Bibliography Y. Nesterov, Introductory Lectures on Convex Optimization, 2004

N. Parikh and S. Boyd, Proximal algorithms, 2013 P.L. Combettes and J.-C. Pesquet, Proximal Splitting Methods in Signal Processing, 2011 A. Beck and M. Teboulle, Gradient-Based Algorithms with Applications to Signal Recovery Problems, 2010