A First Look at Proximal Methods

A First Look at Proximal Methods Samuel Vaiter CMAP, École
Polytechnique [email protected] May 21, 2015 Télécom ParisTech

What’s the Menu ? argmin x∈M f (x) In this
tutorial: M = RN (ﬁnite dimension, euclidian setting) f nicely convex no global smoothness assumption Fundamental ideas covered Fixed point Splitting Duality

Algorithm: Gradient Descent x(n+1) = x(n) − ρ∇f (x(n))

Descent Method Deﬁnition A point dx ∈ RN is a
descent direction at x if there exists ρx > 0 such that ∀ρ ∈ (0, ρx ), f (x + ρd) < f (x). f (x(n)) x(n) f (x(n+1)) x(n+1) dx(n) Descent method x(n+1) = x(n)+ρx(n) dx(n)

The Gradient Descent . . . . . . is
a Descent Method Proposition If f is diﬀerentiable and ∇f (x) = 0, then −∇f (x) is a descent direction. t f (t) (t, t2)

Convergence of the Gradient Descent x(n+1) = x(n) − ρ∇f
(x(n)) Proposition If 0 < ρ < 2α β , then x(n) converges to the unique minimizer x of f . Moreover, there exists 0 < γ < 1 such that ||x(n) − x || γk||x(0) − x ||

Proof Fixed point theory interpretation Tx = x − ρ∇f
(x) Fix T = argmin f T is a contractor Tx(n) → x

Fix T = argmin f Tx = x ⇔ x
= x − ρ∇f (x) ⇔ 0 = ∇f (x) ⇔ x minimizer of f

T is contractant ||Tx − Ty||2 =||x − y||2 −
2ρ ∇f (x) − ∇f (y), x − y −A1 + 2ρ2||∇f (x) − ∇f (y)||2 A2 A1 −2αρ||x − y||2 (strong convexity) A2 β2ρ2||x − y||2 (Lipschitz gradient) ||Tx − Ty||2 γ2||x − y||2 et γ = 1 − 2αρ + β2ρ2 and γ < 1 ⇔ 0 < ρ < 2α β2

Towards a New Operator First-order optimality condition f convex +
smooth 0 = ∇f (x ) ⇔ x minimizer of f f convex + nonsmooth 0 ∈ ∂f (x ) = 0 ⇔ x minimizer of f

Historical Note Nonlinear analysis (starting from ’1950s) Convex analysis Monotone
operator Nonexpansive mappings People: Brézis, Fenchel, Lions, Moreau, Rockafellar, etc. study of multivalued mappings of Banach spaces Today: application-driven approach

Reminder on Convex Analysis f : RN → ¯ R
Convexity f (tx + (1 − t)y) tf (x) + (1 − t)f (y) Lower semi-continuous lim inf x→x0 f (x) f (x0 ) Proper f (x) > −∞ et ∃x, f (x) = +∞ HERE convex ≡ convex, l.s.c and proper. We assume also that argmin f is not empty.

Subdiﬀerential t f (t) (t, t2) ∂f (t) = {η
: f (t ) f (t) + η, t − t }

Towards a New Operator First-order optimality condition f convex +
smooth 0 = ∇f (x ) ⇔ x minimizer of f f convex + nonsmooth 0 ∈ ∂f (x ) = 0 ⇔ x minimizer of f

Properties of the Subdifferential Definition The subdifferential of a convex
function f at x ∈ RN is the subset of RN defined by ∂f (t) = {η : f (t ) f (t) + η, t − t } Proposition If f is smooth, ∂f (x) = {∇f (x)} For ρ > 0, ∂(ρf )(x) = ρ∂f (x) ∂(f + g)(x) ⊆ ∂f (x) + ∂g(x) Equality if ri dom f ∩ ri dom g = ∅

Life is Smooth: Moreau–Yosida Infimal convolution (f g)(x) = inf
v f (x) + g(v − x) Definition The Moreau–Yosida regularization of f is defined as M[f ] = f (1/2)|| · ||2 Theorem For any convex function f (not smooth, not full-domain) dom M[f ] = Rn M[f ] is continuously differentiable argmin M[f ] = argmin f

Proximity Operator Proximity operator ≡ unique argument of Moreau infimum
Definition The proximity operator of a convex function f is defined by proxf (v) = argmin x f (x) + 1 2 ||x − v||2 Smooth interpretation: implicit gradient step proxf (x) = x − ∇M[f ](x)

Proximity ≈ Generalized Projection Indicator function ιC (x) = 0
if x ∈ C +∞ otherwise. Proposition (Proximity ≡ Projection) If C is a convex set, then proxιC = ΠC proxιC (v) = argmin x proxιC (v) + 1 2 ||x − v||2 = argmin x∈C 1 2 ||x − v||2 = ΠC (v)

Subdiﬀerential and Proximity Operator Proposition p = proxf (v) ⇔
v − p ∈ ∂f (p) Resolvant of the subdiﬀerential (as a notation) proxf (v) = (Id + ∂f )−1(v) Theorem Fix proxf = argmin f

Proximal Fixed Point T = proxf Fix T = argmin
f T is a contractor T is ﬁrmly nonexpansive Tnx → x Krasnosel’skii-Mann Firmly nonexpansive || proxf (x)−proxf (y)||2+||(Id−proxf )(x)−(Id−proxf )(y)||2 ||x−y||2

A First Set of Properties Separability: f (x, y) =
f1 (x) + f2 (y) proxf (v, w) = (Proxf1 (v), Proxf2 (w)) Orthogonal precomposition: f (x) = g(Ax) with AA∗ = Id proxf (v) = AT proxg (Av) Aﬃne addition: f (x) = g(x) + u, x + b proxf (v) = proxg (v − a)

A Concrete Example: The Lasso Observations y = Φx0 +
w Assumption: x0 sparse, i.e. Card supp(x0 ) N Variational recovery min x 1 2 ||y − Φx||2 + λ Card supp(x) Convex relaxation min x 1 2 ||y − Φx||2 + λ||x||1 where ||x||1 = i |xi |

An Idea: Splitting min x J(x) = 1 2 ||y
− Φx||2 f + λ||x||1 λg J not smooth / proxJ hard to compute But: f is smooth proxλg is easy to compute t ST(t) λ −λ Soft thresholding (proxλ||·||1 (x))i = sign(xi )(|xi |−λ)+

Fixed Point x ∈ argmin f + g 0 ∈
∇f (x ) + ∂g(x ) 0 ∈ ρ∇f (x ) + ρ∂g(x ) 0 ∈ ρ∇f (x ) − x + x + ρ∂g(x ) (Id − ρ∇f )(x ) ∈ (Id + ρ∂g)(x ) x = (Id + ρ∂g)−1(Id − ρ∇f )(x ) x = proxρg (x − ρ∇f (x )) Proposition Tx = proxρg (x − ρ∇f (x)) Fix T = argmin f (x) + g(x)

Algorithm: Forward-Backward x(n+1) = proxρg backward (x(n) − ρ∇f (x(n))
forward ) Proposition If 0 < ρ < 1 β , then x(n) converges to a minimizer x of f + g. Moreover, J(x(n)) − J(x ) = O(1/n) The convergence ||x(n) − x || may be arbitrary slow

Special Cases x(n+1) = proxρg (x(n) − ρ∇f (x(n))) Gradient
descent: g = 0 x(n+1) = x(n) − ρ∇f (x(n)) Proximal point: f = 0 x(n+1) = proxρg (x(n)) Projected gradient: g = ιC x(n+1) = ΠC (x(n) − ρ∇f (x(n)))

Another Example: The Basis Pursuit Noiseless observations y = Φx0
Assumption: x0 sparse, i.e. Card supp(x0 ) N Variational recovery min y=Φx Card supp(x) Convex relaxation min y=Φx ||x||1 Constraint-less formulation min x ||x||1 + ιy=Φx (x)

Another Example: The Basis Pursuit min x J(x) = ||x||1
f + ιy=Φx (x) g Neither f or g is diﬀerentiable But: (prox||·||1 (x))i = sign(xi )(|xi | − 1)+ proxιy=Φx (x) = Πy=Φx (x) = x − Φ+(Φx − y)

Fixed Point x ∈ argmin f + g 0 ∈
∂f (x ) + ∂g(x ) 2x ∈ (x + ρ∂f (x )) + (x + ρ∂g(x )) 2x ∈ (Id + ρ∂f )(x ) + (Id + ρ∂g)(x ) Idea: take z ∈ (Id + ρ∂g)(x ) 2x ∈ (Id + ρ∂f )(x ) + z x = (Id + ρ∂f )−1(2x − z) Almost a ﬁxed point . . . let Γρf = 2(Id + ρ∂f ) − Id 2x ∈ (Id + ρ∂f )(x ) + z 2x − z ∈ (Id + ρ∂f )(x ) Γρg (z) ∈ (Id + ρ∂f )(x ) In particular, x = (Id + ρ∂f )−1(Γρg (z)).

Fixed Point Γρf = 2(Id + ρ∂f ) − Id
x = (Id + ρ∂f )−1(Γρg (z)) et Γρg (z) ∈ (Id + ρ∂f )(x ) Thus, z = 2x −Γρg (z) = 2(Id+ρ∂f )−1(Γρg (z))−Γρg (z) = Γρf (Γρg (z)) Fixed point x ∈ argmin f + g ⇔ x = proxρg (z) z = (Γρf ◦ Γρg )(z) Proposition proxρg (Fix(Γρf ◦ Γρg )) = argmin f + g

Algorithm: Douglas–Rachford min x f (x) + g(x) Douglas–Rachford x(n)
= proxρg (y(n)) z(n) = proxρf (2x(n) − y(n)) y(n+1) = y(n) + γ(z(n) − x(n)) Proposition If ρ > 0, 1 < γ < 2, then x(n) converges to a minimizer x of f + g. The rate of convergence is hard to derive

Algorithm: FISTA Forward-Backward x(n+1) = proxρg (x(n) − ρ∇f (x(n)))
Rate of converence (f + g)(x(n)) − (f + g)(x ) = O(1/n) FISTA: non-convex updates x(n) = proxρg (y(n) − ρ∇f (y(n))) t(n+1) = 1 + √ 1 + 4t(n) 2 y(n+1) = x(n) + t(n) − 1 t(n+1) (x(n) − x(n−1)) Rate of converence (f + g)(x(n)) − (f + g)(x ) = O(1/n2)

100 101 102 103 104 n 10-11 10-10 10-9 10-8
10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 J(x(n) )−J(x ) FB FISTA 1/n 1/n2

Composite Problem Total Variation: Known in statistics (1D) as taut
string min x 1 2 ||y − Φx||2 + λ||Dx||p D first-order finite difference Composite Problem: min x f (Kx) + g(x) where K is a linear operator proxf easy to compute ⇒ proxf ◦K easy to compute ? No (except if K is orthogonal)

Fenchel-Rockafellar Conjugate Deﬁnition f ∗(ξ) = sup x x, ξ
− f (x) x ξ, x f (x) ˆ x ˆ x, ξ f (ˆ x) −f ∗(ξ)

Biconjugate Theorem If f is convex, lsc and proper then
f ∗∗ = f

Back to Moreau–Yosida Using that (f g) = f ∗
+ g∗ and (1/2)|| · ||2 self-dual M[f ] = (f ∗ + (1/2)|| · ||2)∗ Natural smoothing of 1 norm ? (M[||·||1 ](x))i = x2 i if |x| 1 2|xi | − 1 otherwise t | · | M[| · |] Other choices exist such as | · |2 + ε2

The Moreau Identity Theorem If f is convex, lsc and
proper then proxf (x) + proxf ∗ (x) = x Applications: Generalization of the orthogonal splitting of Rn ΠT (x) + ΠT⊥ (x) = x If proxf ∗ easy to compute, so does proxf .

The Moreau Identity: Norms and Balls proxλ||·|| (x) = x
− λΠB (x/λ) where B = {x : ||x||∗ 1} and ||x||∗ = sup||v|| 1 v, x 2 norm proxλ||·||2 (x) = 1 − λ ||x||2 + x Elastic net f = || · ||1 + (ε/2)|| · ||2 proxλf (x) = 1 1 + λε proxλ||·||1 (x) Group regularization f (x) = g∈G ||xg ||2 (proxλf (x))g = 1 − λ ||xg ||2 + xg

Primal–Dual Formulation min x f (Kx) + g(x) Objective: remove
K inside f . 1. Biconjugate theorem f (Kx) = (f ∗)∗(Kx) = sup ξ ξ, Kx − f ∗(ξ) 2. Adjoint operator f (Kx) = sup ξ K∗ξ, x − f ∗(ξ) 3. Rewrite initial problem (with qualiﬁcation assumption) min x max ξ K∗ξ, x − f ∗(ξ) + g(x)

Algorithm: Arrow–Hurwicz min x max ξ K∗ξ, x − f
∗(ξ) + g(x) ξ(n+1) = proxσf ∗ (ξ(n) + σKx(n)) x(n+1) = proxτg (x(n) − τK∗ξ(n+1))

Algorithm: Chambolle–Pock min x max ξ K∗ξ, x − f
∗(ξ) + g(x) ξ(n+1) = proxσf ∗ (ξ(n) + σK¯ x(n)) x(n+1) = proxτg (x(n) − τK∗ξ(n+1)) ¯ x(n+1) = 2x(n+1) − x(n)

Matrix Regularization Orthogonal invariance (∀X ∈ RN×M, ∀U ∈ ON
, V ∈ OM ) F(VXU) = F(X) Proposition If F is orthogonally invariant, then F(X) = F(diag(σ(X))), where σ(X) is the ordered singular values of X. Absolute symmetry (∀x ∈ RN, ∀Q signed permutation) f (Qx) = f (x) Theorem F orthogonally inv. ⇔ F = f ◦ σ with f absolutely symmetric. ex: || · ||nuc = || · ||1 ◦ σ

Transfer Principle F = f ◦ σ orthogonaly invariant. SVD
X = V diag(σ(X))U Proposition F convex ⇔ f convex ∂F(X) = {V diag(µ)U : µ ∈ ∂f (σ(X))} proxF (X) = V diag(proxf (diag(σ(X))))U Example: nuclear norm proxλ||·||nuc (X) = N i=1 (σ(X)i − λ)+ ui v∗ i

Not Covered Today A lot of things Augmented Lagrangian methods
(ADMM) Parallelization and multi-objective problems . . .

Software C, Matlab proximal (Parikh, Boyd) https://github.com/cvxgrp/proximal Matlab TFOCS (Becker,
Candès, Grant) https://github.com/cvxr/TFOCS Python pyprox (Vaiter) https://github.com/svaiter/pyprox (currently rewriting)

Selected Bibliography Y. Nesterov, Introductory Lectures on Convex Optimization, 2004
N. Parikh and S. Boyd, Proximal algorithms, 2013 P.L. Combettes and J.-C. Pesquet, Proximal Splitting Methods in Signal Processing, 2011 A. Beck and M. Teboulle, Gradient-Based Algorithms with Applications to Signal Recovery Problems, 2010

A First Look at Proximal Methods

A First Look at Proximal Methods

More Decks by Samuel Vaiter

Other Decks in Science

Featured

Transcript