admm - Speaker Deck

Slide 1

Slide 1 text

Alternating Direction Method of Multipliers ELEC5470/IEDA6100A - Convex Optimization a talk by Vinícius and Prof. Daniel Palomar The Hong Kong University of Science and Technology December, 2020

Slide 2

Slide 2 text

Contents 1. Introduction Optimization algorithms, motivation 2. Alternating Direction Method of Multipliers The basics 3. Practical Examples Robust PCA and Graphical Lasso

Slide 3

Slide 3 text

1/28 Whyuseoptimizationalgorithms?

Slide 4

Slide 4 text

Motivations 2/28 large-scale optimization machine learning/statistics with huge datasets computer vision descentralized optimization entities/agents/threads coordinate to solve a large problem by passing small messages

Slide 5

Slide 5 text

Optimization Algorithms 3/28 Gradient Descent Newton Interior Point Methods (IPM) Block Coordinate Descent (BCD) Majorization-Minimization (MM) Block Majorization-Minimization (BMM) Successive Convex Approximation (SCA)

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Reference 4/28 Boyd et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning. 2010. available online for free: https://web.stanford.edu/~boyd/ papers/pdf/admm_distr_stats.pdf citations: 135191 Yuxin Chen’s Princeton lecture notes ELE 522: Large-Scale Optimization for Data Science 1as of Nov. 24th 2020

Slide 9

Slide 9 text

Dual Problem 5/28 convex equality constrained optimization problem minimize x f(x) subject to Ax = b Lagrangian: L(x, y) = f(x) + y⊤(Ax − b) dual function: g(y) = inf x L(x, y) dual problem: maximize y g(y) recover: x⋆ = argmin x L(x, y⋆)

Slide 10

Slide 10 text

6/28 DualAscent

Slide 11

Slide 11 text

Dual Ascent 7/28 gradient method for dual problem: yk+1 = yk + ρk∇g(yk) ∇g(yk) = Axk+1 − b, where xk+1 = arg min x L(x, yk) dual ascent method is xk+1 := arg min x L(x, yk) yk+1 := yk + ρk Axk+1 − b

Slide 12

Slide 12 text

Slide 13

Slide 13 text

8/28 DualDecomposition

Slide 14

Slide 14 text

Dual Decomposition 9/28 suppose f is separable: f(x) = f1 (x1 ) + · · · + fn (xn ), x = (x1 , . . . , xn ) then the Lagrangian is separable in x: Li (xi , y) = fi (xi ) + y⊤a∗,i xi x-minimization splits into n separate minimizations xk+1 i := arg min xi Li (xi , yk), i = 1, ..., n which can be done in parallel and yk+1 = yk + αk n i=1 a∗,i xk+1 i − b

Slide 15

Slide 15 text

Optimization Problem 10/28 minimize x, z f(x) + g(z) subject to Ax + Bz = c (1) variables: x ∈ Rn and z ∈ Rm parameters: A ∈ Rp×n, B ∈ Rp×m, and c ∈ Rp optimal value: p⋆ = inf x,z {f(x) + g(z) : Ax + Bz = c}

Slide 16

Slide 16 text

11/28 AugmentedLagrangianMethod

Slide 17

Slide 17 text

Augmented Lagrangian Method 12/28 Augmented Lagrangian: Lρ (x, z, y) = f(x) + g(z) + ⟨y, Ax + Bz − c⟩ Lagrangian + ρ 2 ∥Ax + Bz − c∥2 F ALM consists of the iterations: xk+1, zk+1 := arg min x,z Lρ (x, z, yk) // primal update yk+1 := yk + ρ Axk+1 + Bzk+1 − c // dual update ρ > 0 is a penalty hyperparameter

Slide 18

Slide 18 text

Issues with Augmented Lagrangian Method 13/28 the primal step is often expensive to solve – as expensive as solving the original problem minimization of x and z has to be done jointly

Slide 19

Slide 19 text

14/28 AlternatingDirection MethodofMultipliers

Slide 20

Slide 20 text

Alternating Direction Method of Multipliers 15/28 Augmented Lagrangian: Lρ (x, z, y) = f(x) + g(z) + ⟨y, Ax + Bz − c⟩ Lagrangian + ρ 2 ∥Ax + Bz − c∥2 F ADMM consists of the iterations: xk+1 := arg min x Lρ (x, zk, yk) zk+1 := arg min z Lρ (xk+1, z, yk) yk+1 := yk + ρ Axk+1 + Bzk+1 − c ρ > 0 is a penalty hyperparameter

Slide 21

Slide 21 text

Convergence and Stopping Criteria 16/28 assume (very little!) f, g are convex, closed, proper L0 has a saddle point then ADMM converges: iterates approach feasibility: Axk + Bzk − c → 0 objective approaches optimal value: f(xk) + g(zk) → p⋆ false (in general) statements: x converges, z converges true statement: y converges what matters: residual is small and near optimality in objective value

Slide 22

Slide 22 text

Convergence of ADMM in Practice 17/28 ADMM is often slow to converge to high accuracy ADMM often converges to moderate accuracy within a few dozens of iterations, which is often sufficient for most practical purposes

Slide 23

Slide 23 text

18/28 PracticalExamples

Slide 24

Slide 24 text

Robust PCA (Candes et al. ’08) 19/28 We would like to model a data matrix M as low-rank plus sparse components: minimize L, S ∥L∥∗ + λ ∥S∥1 subject to L + S = M where ∥L∥ ∗ := n i=1 σi (L) is the nuclear norm and ∥S∥ 1 := i,j |Sij | is the entrywise ℓ1 -norm

Slide 25

Slide 25 text

Robust PCA via ADMM 20/28 ADMM for solving robust PCA: Lk+1 = arg min L ∥L∥∗ + tr Yk⊤L + ρ 2 L + Sk − M 2 F Sk+1 = arg min S λ ∥S∥1 + tr Yk⊤S + ρ 2 Lk+1 + S − M 2 F Yk+1 = Yk + ρ Lk+1 + Sk+1 − M

Slide 26

Slide 26 text

Robust PCA via ADMM 21/28 Lk+1 = SVTρ−1 M − Sk − 1 ρ Yk Sk+1 = STλρ−1 M − Lk+1 − 1 ρ Yk Yk+1 = Yk + ρ Lk+1 + Sk+1 − M , where for any X with SVD X = UΣV⊤, Σ = diag ({σi}), we have SVTτ (X) = Udiag (σi − τ)+ V⊤ and (STτ (X))ij =      Xij − τ, if Xij > τ, 0, if |Xij| ≤ τ, Xij + τ, if Xij < −τ

Slide 27

Slide 27 text

Graphical Lasso 22/28 Precision matrix estimation from Gaussian samples: minimize Θ −log det Θ + ⟨Θ, S⟩ neg. log likelihood +λ ∥Θ∥ 1 subject to Θ ≻ 0 Or equivalently, using a slack variable Ψ = Θ minimize Θ,Ψ −log det Θ + ⟨Θ, S⟩ neg. log likelihood +λ ∥Ψ∥ 1 subject to Θ ≻ 0, Θ = Ψ

Slide 28

Slide 28 text

Graphical Lasso via ADMM 23/28 Θk+1 = arg min Θ≻0 − log det Θ + ⟨Θ, S + Yk⟩ + ρ 2 Θ − Ψk 2 F Ψk+1 = arg min Ψ λ ∥Ψ∥1 − ⟨Ψ, Yk⟩ + ρ 2 Θk − Ψ 2 F Yk+1 = Yk + ρ Θk+1 − Ψk+1

Slide 29

Slide 29 text

Graphical Lasso via ADMM 24/28 Θk+1 = Fρ Ψk − 1 ρ Yk + S Ψk+1 = STλρ−1 Θk+1 + 1 ρ Yk Yk+1 = Yk + ρ Θk+1 − Ψk+1 where Fρ (X) := 1 2 Udiag λi + λ2 i + 4 ρ U⊤, for X = UΛU⊤.

Slide 30

Slide 30 text

Network of stocks via Graphical Lasso 25/28

Slide 31

Slide 31 text

Network of stocks via Graphical Lasso 26/28 90 100 110 120 130 140 150 0 100 200 300 400 CPU time [seconds] Lagrangian 0 2 4 6 8 0 100 200 300 400 CPU time [seconds] ||sl|| 2 2

Slide 32

Slide 32 text

Conclusion 27/28 ADMM is a versatile/flexible optimization framework may not be the best for a specific case, but often performs well in practice convergence often needs to be proved in a case-by-case scenario

Slide 33

Slide 33 text

28/28 Questions?