via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning. 2010. available online for free: https://web.stanford.edu/~boyd/ papers/pdf/admm_distr_stats.pdf citations: 135191 Yuxin Chen’s Princeton lecture notes ELE 522: Large-Scale Optimization for Data Science 1as of Nov. 24th 2020
(x1 ) + · · · + fn (xn ), x = (x1 , . . . , xn ) then the Lagrangian is separable in x: Li (xi , y) = fi (xi ) + y⊤a∗,i xi x-minimization splits into n separate minimizations xk+1 i := arg min xi Li (xi , yk), i = 1, ..., n which can be done in parallel and yk+1 = yk + αk n i=1 a∗,i xk+1 i − b
are convex, closed, proper L0 has a saddle point then ADMM converges: iterates approach feasibility: Axk + Bzk − c → 0 objective approaches optimal value: f(xk) + g(zk) → p⋆ false (in general) statements: x converges, z converges true statement: y converges what matters: residual is small and near optimality in objective value
to converge to high accuracy ADMM often converges to moderate accuracy within a few dozens of iterations, which is often sufficient for most practical purposes
to model a data matrix M as low-rank plus sparse components: minimize L, S ∥L∥∗ + λ ∥S∥1 subject to L + S = M where ∥L∥ ∗ := n i=1 σi (L) is the nuclear norm and ∥S∥ 1 := i,j |Sij | is the entrywise ℓ1 -norm