Taiji Suzuki
June 18, 2023
53

# 非凸確率的最適化と再生核ヒルベルト空間の最適化

「機械学習における最適化理論と学習理論的側面@組合せ最適化セミナー2020」第二部

June 18, 2023

## Transcript

1. ### ػցֶशʹ͓͚Δ࠷దԽཧ࿦ͱֶशཧ࿦తଆ໘ ୈೋ෦: ඇತ֬཰త࠷దԽͱ࠶ੜ֩ώϧϕϧτۭؒͷ࠷దԽ ླ໦େ࣊ ౦ژେֶେֶӃ৘ใཧ޻ֶܥݚڀՊ਺ཧ৘ใֶઐ߈ ཧݚ AIP 2020 ೥ 8

݄ 6 ೔ @૊߹ͤ࠷దԽηϛφʔ 2020 (COSS2020) 1 / 42

/ 42

42

5. ### ඇತؔ਺Ͱͷ SGD ໨తؔ਺ɿL(x) = Ez [ℓ(z, x)]ɽ (Լʹ༗ք, L∗ =

infx L(x) ͱ͢Δ) SGD zt ∼ P(Z) Λ؍ଌɽℓt (x) := ℓ(zt, x) ͱ͢Δ. gt ∈ ∂x ℓt (xt−1 ). xt = xt−1 − ηt gt. Ծఆ (A1) L ͸ γ-ฏ׈ (A2) E[∥gt − E[gt ]∥2] = σ2 (֬཰తޯ഑ͷ෼ࢄ͸ σ2). ηt = min { ˜ D σ √ T , 1 γ } ͱ͢Δͱ (Ghadimi and Lan (2013)) min 1≤t≤T E[∥∇L(xt )∥2] ≤ γσ √ T ( D2 f ˜ D + ˜ D ) + γ2D2 f T , ͨͩ͠ɼDf = √ 2(L(x1)−L∗) γ ɽ ʢඍ෼͕ 0 ΁ऩଋͯ͠Ώ͘͜ͱΛอূʣ ࠨลͷ min1≤t≤T ͷ୅ΘΓʹɼˆ t ∈ {1, . . . , T} ΛҰ༷෼෍ʹैͬͯબΜͰ E[∥∇L(xˆ t )∥2] ͱͯ͠΋ྑ͍ɽ 5 / 42
6. ### ඇತ SVRG min x∈Rp L(x) = min x∈Rp 1 n

n ∑ i=1 ℓi (x) SVRG Λͦͷ··ඇತؔ਺࠷దԽʹద༻ͯ͠Α͍ɽ ʢͨͩ͠εςοϓαΠζͱϛχ όον਺͸ద੾ʹௐ੔ʣ E[∥∇L(ˆ x)∥2] ≤ ϵ ʹͳΔ·Ͱͷߋ৽ճ਺ T (Allen-Zhu and Hazan, 2016, Reddi et al., 2016) ℓi ͕γ-ฏ׈ͷ࣌ɿ T ≥ Ω ( n + n2/3 ϵ ) . ʢී௨ͷඇ֬཰తޯ഑๏ͳΒ Ω(n/ϵ)ʣ ℓi ͕ γ-ฏ׈͔ͭ L(x) − L(x∗) ≤ τ∥∇L(x)∥2 (∀x) (x∗ ͸େҬత࠷దղ) ͷ࣌ (Polyak- Lojasiewicz, PL ৚݅)ɿ T ≥ Ω ( (n + τn2/3)log(1/ϵ) ) . ʢී௨ͷඇ֬཰తޯ഑๏ͳΒ Ω(τn log(1/ϵ))ʣ 6 / 42
7. ### Ҍ఺ճආͷํ๏ ୯७ͳޯ഑๏ʹࡶԻΛ৐ͤΔ (Jin et al., 2017a) Ճ଎ޯ഑๏΁ͷద༻ (Jin et al.,

2017b) 7 / 42
8. ### SARAH ͱͦͷվྑ๏ SSRGD (Li, 2019): SARAH + ϊΠζ෇ՃʹΑΔҌ఺୤ग़ Simple Stochastic

Recursive Gradient Descent (SSRGD) Iterate the following for t = 1, 2, . . . , T: 1 Ҍ఺୤ग़Ϟʔυʹೖ͓ͬͯΒͣɼ∥∇L(xt )∥ ≤ gthresh ͳΒɼ xt ← xt + ξ (ξ ∼ Unif(Br (Rd ))) ͱͯ͠ɼҌ఺୤ग़ϞʔυʹೖΔɽ 2 y0 = xt , v0 = ∇f (xt ) 3 For k = 1, . . . , m, 1 yk = yk−1 − ηvk−1 2 vk = 1 b ∑ i∈Ik (∇fi (yk ) − ∇fi (yk−1 )) + vk−1 (SARAH: variance reduction) 3 ͋Δఀࢭ৚݅Λຬ͍ͨͯͨ͠ΒҌ఺୤ग़ϞʔυΛࢭΊΔɽ 4 xt+1 = ym Output: xT SARAH: StochAstic Recursive grAdient algoritHm (Nguyen et al., 2017, Pham et al., 2020) ΦϯϥΠϯܕͷ৔߹͸ ∇L ͷܭࢉ͸αϯϓϧฏۉʹ͢Δ 1 B ∑ i∈It ∇fi (xt ). ೋ࣍࠷దੑ΋ߴ͍֬཰Ͱอূ 8 / 42
9. ### SARAH ʹ͍ͭͯ SARAH: vk = 1 b ∑ i∈Ik (∇fi

(yk ) − ∇fi (yk−1 )) + vk−1 SVRG: vk = 1 b ∑ i∈Ik (∇fi (yk ) − ∇fi (˜ x)) + ˜ v SVRG ͸಺෦ϧʔϓͷߋ৽ΛਐΊΔͱ෼ࢄ͕େ͖͘ͳΔɽ SARAH ͸಺෦ϧʔϓͷߋ৽ΛਐΊͯ΋෼ࢄ͕େ͖͘ͳΒͳ͍ or 0 ʹऩଋ͢Δ (ڧತͷ৔߹) ˠ ޯ഑͕๫ΕͣɼҰ࣌࠷దੑ৚݅Λຬͨ͢ղΛಘ΍͍͢ɽ (ತ࠷దԽͰ໨తؔ਺஋Λݟ͍ͯΔݶΓ͸͜ͷҧ͍͕ݟʹ͍͘) 9 / 42
10. ### ܭࢉྔͷൺֱ ϵ-Ұ࣍࠷దੑ৚݅: E[∥∇L(x)∥2] ≤ ϵ δ-ೋ࣍࠷దੑ৚݅: λmin (∇2L(x)) ≥ −δ

(with high probability) ΦϯϥΠϯܕ ख๏ ֬཰తޯ഑ͷܭࢉ਺ ࠷దੑ৚݅ GD O(n ϵ ) 1 ࣍ SVRG (Allen-Zhu and Hazan, 2016) O(n + n2/3 ϵ ) 1 ࣍ SARAH (Pham et al., 2020) O(n + √ n ϵ ) 1 ࣍ SSRGD (Li, 2019) O(n + √ n ϵ ) 1 ࣍ PGD (Jin et al., 2017b) O(n ϵ + n δ4 ) 2 ࣍ SSRGD (Li, 2019) O( √ n ϵ + √ n δ4 + n δ3 ) 2 ࣍ ༗ݶ࿨ܕ ख๏ ֬཰తޯ഑ͷܭࢉ਺ ࠷దੑ৚݅ SGD (Ghadimi and Lan, 2013) O(1/ϵ2) 1 ࣍ SVRG+ (Li and Li, 2018) O(1/ϵ7/4) 1 ࣍ SARAH (Pham et al., 2020) O(1/ϵ3/2) 1 ࣍ SSRGD (Li, 2019) O(1/ϵ3/2) 1 ࣍ SSRGD (Li, 2019) O( 1 ϵ3/2 + 1 ϵδ3 + 1 ϵ1/2δ4 ) 2 ࣍ 10 / 42
11. ### ʢࢀߟʣStrict saddle ਂ૚ֶशͳͲ͸ఀཹ఺͕ଟ͍ɽ ໨తؔ਺͕ strict saddle property ͱ͍͏ੑ࣭Λຬ͍ͨͯ͠Ε͹ɼαυϧϙΠ ϯτΛճආ͢Δ͜ͱ͕Ͱ͖Δɽ 

 ৴པྖҬ๏ (Conn et al., 2000) ΍ࡶԻΛՃ͑ͨ֬཰తޯ഑๏ (Ge et al., 2015) ͸ strict saddle ͳ໨తؔ਺ͷہॴత࠷దղʹ౸ୡ͢Δ (Sun et al., 2015).   ˞ ղʹࡶԻΛՃ͑Δ͜ͱͰαυϧϙΠϯτ͔Βൈ͚ग़ͤΔɽ Strict saddle ೋճඍ෼Մೳͳؔ਺ f ͕ strict saddle Ͱ͋Δͱ͸ɼ∀x Ͱ࣍ͷͲΕ͔͕ຬͨ͞Ε ͍ͯΔ: ∥∇f (x)∥ ≥ ϵ. λmin (∇2f (x)) ≤ −γ. ͋Δ x∗ ͕ଘࡏͯ͠ ∥x − x∗∥ ≤ δ ͔ͭ f (x) ͕ x∗ ͷۙ๣ {x′ | ∥x∗ − x′∥ ≤ 2δ} Ͱڧತؔ਺. E.g., ςϯιϧ෼ղ maxu∈Rp ⟨ ∑ d r=1 a⊗4 r , u ⊗ u ⊗ u ⊗ u ⟩ ͸ a⊤ r ar′ = δr,r′ ͳΒ strict saddleɽ 11 / 42
12. ### ઢܗ੍໿͋Γͷֶश໰୊ min x 1 n n ∑ i=1 fi (z⊤

i x) + ψ(B⊤x) ⇔ min x,y 1 n n ∑ i=1 fi (z⊤ i x) + ψ(y) s.t. y = B⊤x.   ֦ுϥάϥϯδΞϯ L(x, y, λ) = 1 n ∑ i fi (z⊤ i x) + ψ(y) + λ⊤(y − B⊤x) + ρ 2 ∥y − B⊤x∥2   inf x,y sup λ L(x, y, λ) Ͱ࠷దղ͕ٻ·Δɽ ৐਺๏: Hestenes (1969), Powell (1969), Rockafellar (1976). ަޓํ޲৐਺๏ (ADMM): Gabay and Mercier (1976), Mota et al. (2011), He and Yuan (2012), Deng and Yin (2012), Hong and Luo (2012a) ֬཰తަޓํ޲৐਺๏: SGD-ADMM (Suzuki, 2013, Ouyang et al., 2013), RDA-ADMM (Suzuki, 2013), SDCA-ADMM (Suzuki, 2014), SVRG-ADMM (Zheng and Kwok, 2016), ASVRG-ADMM (Liu et al., 2017). 12 / 42
13. ### ߏ଄తਖ਼ଇԽͷྫ Overlapped group lasso ˜ ψ(w) = C ∑ g∈G

∥wg ∥ It is diﬃcult to compute the proximal mapping. Solution: Prepare ψ for which proximal mapping is easily computable. Let ψ(B⊤w) = ˜ ψ(w), and utilize the proximal mapping w.r.t. ψ. BTw Decompose into independent groups: B⊤w = ψ(y) = C ∑ g′∈G′ ∥yg′ ∥ prox(q|ψ) = ( qg′ max { 1 − C ∥qg′ ∥ , 0 }) g′∈G′ 13 / 42
14. ### ͦͷଞͷྫ Graph guided regularization ˜ ψ(w) = C ∑ (i,j)∈E

|wi − wj |.      x1 x2 x3 x4 x5 ψ(y) = C ∑ e∈E |ye|, y = B⊤w = (wi − wj )(i,j)∈E ⇒    ψ(B⊤w) = ˜ ψ(w), prox(q|ψ) = ( qe max { 1 − C |qe | , 0 }) e∈E . Soft-Thresholding function. 14 / 42
15. ### ߏ଄తਖ਼ଇԽʹର͢Δަޓํ޲৐਺๏ min x {f (x) + ψ(B⊤w)} ⇔ min x,y

{f (x) + ψ(y) s.t. y = B⊤x} L(x, y, λ) = f (x) + ψ(y) + λ⊤(y − B⊤x) + ρ 2 ∥y − B⊤x∥2 ͨͩ͠ f (x) = 1 n ∑ fi (z⊤ i x) ADMM ʹΑΔߏ଄తਖ਼ଇԽֶश x(t) = arg min x {f (x) + λ(t−1)⊤ (−B⊤x) + ρ 2 ∥y(t−1) − B⊤x∥2} y(t) = arg min y {ψ(y) + λ(t)⊤ y + ρ 2 ∥y − B⊤x(t)∥2} (= prox(B⊤x(t) − λ(t)/ρ|ψ/ρ)) λ(t) = λ(t−1) − ρ(B⊤x(t) − y(t)) y ͷߋ৽͸୯७ͳ ψ ʹΑΔۙ઀ࣸ૾. → ղੳղ. Ұൠతʹ͸ O(1/k) (He and Yuan, 2012), ڧತͳΒ͹ઢܗऩଋ (Deng and Yin, 2012, Hong and Luo, 2012b)ɽ 15 / 42
16. ### SGD-ADMM minx EZ [ℓ(x, Z)] + ψ(B⊤x) ⇒ ֦ுϥάϥϯδΞϯ: EZ

[ℓ(x, Z)] + ψ(y) + λ⊤(y − B⊤x) + ρ 2 ∥y − B⊤x∥2. ௨ৗͷ SGD: xt+1 = arg minx { ⟨gt , x⟩ + ˜ ψ(x) + 1 2ηt ∥x − xt ∥2 } (gt ∈ ∂x ℓ(xt , zt )). SGD-ADMM xt+1 =argmin x∈X { g⊤ t x − λt ⊤(B⊤x − yt ) + ρ 2 ∥B⊤x − yt∥2 + 1 2ηt ∥x − xt∥2 Gt } , yt+1 = argmin y∈Y { ψ(y) − λ⊤ t (B⊤xt+1 − y) + ρ 2 ∥B⊤xt+1 − y∥2 } λt+1 =λt − ρ(B⊤xt+1 − yt+1 ). yt+1 ͱ λt+1 ͷߋ৽͸௨ৗͷ ADMM ͱಉ͡ɽ Gt ͸೚ҙͷਖ਼ఆ஋ରশߦྻɽ 16 / 42
17. ### SGD-ADMM minx EZ [ℓ(x, Z)] + ψ(B⊤x) ⇒ ֦ுϥάϥϯδΞϯ: EZ

[ℓ(x, Z)] + ψ(y) + λ⊤(y − B⊤x) + ρ 2 ∥y − B⊤x∥2. ௨ৗͷ SGD: xt+1 = arg minx { ⟨gt , x⟩ + ˜ ψ(x) + 1 2ηt ∥x − xt ∥2 } (gt ∈ ∂x ℓ(xt , zt )). SGD-ADMM xt+1 =argmin x∈X { g⊤ t x − λt ⊤(B⊤x − yt ) + ρ 2 ∥B⊤x − yt∥2 + 1 2ηt ∥x − xt∥2 Gt } , yt+1 = argmin y∈Y { ψ(y) − λ⊤ t (B⊤xt+1 − y) + ρ 2 ∥B⊤xt+1 − y∥2 } =prox(B⊤xt+1 − λt/ρ|ψ), λt+1 =λt − ρ(B⊤xt+1 − yt+1 ). yt+1 ͱ λt+1 ͷߋ৽͸௨ৗͷ ADMM ͱಉ͡ɽ Gt ͸೚ҙͷਖ਼ఆ஋ରশߦྻɽ 16 / 42
18. ### RDA-ADMM ௨ৗͷ RDA: wt+1 = arg minw { ⟨¯ gt

, w⟩ + ˜ ψ(w) + 1 2ηt ∥w∥2 } (¯ gt = 1 t (g1 + · · · + gt )) RDA-ADMM Let ¯ xt = 1 t ∑ t τ=1 xτ , ¯ λt = 1 t ∑ t τ=1 λτ , ¯ yt = 1 t ∑ t τ=1 yτ , ¯ gt = 1 t ∑ t τ=1 gτ . xt+1 =argmin x∈X { ¯ g⊤ t x − (B¯ λt )⊤x + ρ 2t ∥B⊤x∥2 +ρ(B⊤¯ xt − ¯ yt )⊤B⊤x + 1 2ηt ∥x∥2 Gt } , yt+1 =prox(B⊤xt+1 − λt/ρ|ψ), λt+1 =λt − ρ(B⊤xt+1 − yt+1 ). yt+1 ͱ λt+1 ͷߋ৽͸௨ৗͷ ADMM ͱಉ͡ɽ 17 / 42
19. ### Convergence analysis We bound the expected risk: Expected risk P(x)

= EZ [ℓ(Z, x)] + ˜ ψ(x). Assumptions: (A1) ∃G s.t. ∀g ∈ ∂x ℓ(z, x) satisﬁes ∥g∥ ≤ G for all z, x. (A2) ∃L s.t. ∀g ∈ ∂ψ(y) satisﬁes ∥g∥ ≤ L for all y. (A3) ∃R s.t. ∀x ∈ X satisﬁes ∥x∥ ≤ R. 18 / 42
20. ### Convergence rate: bounded gradient (A1) ∃G s.t. ∀g ∈ ∂x

ℓ(z, x) satisﬁes ∥g∥ ≤ G for all z, x. (A2) ∃L s.t. ∀g ∈ ∂ψ(y) satisﬁes ∥g∥ ≤ L for all y. (A3) ∃R s.t. ∀x ∈ X satisﬁes ∥x∥ ≤ R. Theorem (Convergence rate of RDA-ADMM) Under (A1), (A2), (A3), we have Ez1:T−1 [P(¯ xT ) − P(x∗)] ≤ 1 T T ∑ t=2 ηt−1 2(t − 1) G2 + γ ηT ∥x∗∥2 + K T . Theorem (Convergence rate of SGD-ADMM) Under (A1), (A2), (A3), we have Ez1:T−1 [P(¯ xT ) − P(x∗)] ≤ 1 2T ∑ T t=2 max { γ ηt − γ ηt−1 , 0 } R2 + 1 T ∑ T t=1 ηt 2 G2 + K T . Both methods have convergence rate O ( 1 √ T ) by letting ηt = η0 √ t for RDA-ADMM and ηt = η0/ √ t for SGD-ADMM. 19 / 42
21. ### ༗ݶ࿨ͷ໰୊ ਖ਼ଇԽ͋Γ܇࿅ޡࠩͷ૒ର໰୊ A = [a1, a2, . . . ,

an ] ∈ Rp×n. min w {1 n n ∑ i=1 fi (a⊤ i w) + ψ(B⊤w) } (P: ओ) = − min x∈Rn,y∈Rd {1 n n ∑ i=1 f ∗ i (xi ) + ψ∗ ( y n ) Ax + By = 0 } (D: ૒ର) ࠷దੑ৚݅: a⊤ i w∗ ∈ ∇f ∗ i (x∗ i ), 1 n y∗ ∈ ∇ψ(u)|u=B⊤w∗ , Ax∗ + By∗ = 0. ⋆ ֤࠲ඪ xi ͸֤؍ଌ஋ ai ʹରԠ. 20 / 42
22. ### SDCA-ADMM ֦ுϥάϥϯδΞϯ: L(x, y, w) := ∑ n i=1 f

∗ i (xi ) + nψ∗(y/n) − ⟨w, Ax + By⟩ + ρ 2 ∥Ax + By∥2. SDCA-ADMM For each t = 1, 2, . . . Choose i ∈ {1, . . . , n} uniformly at random, and update y(t) ← arg min y { L(x(t−1), y, w(t−1)) + 1 2 ∥y − y(t−1)∥2 Q } x(t) i ← arg min xi ∈R { L([xi ; x(t−1) \i ], y(t), w(t−1)) + 1 2 ∥xi − x(t−1) i ∥2 Gi,i } w(t) ← w(t−1) − ξρ{n(Ax(t) + By(t))−(n − 1)(Ax(t−1) + By(t−1))}. Q, Gi,i ͸͋Δ৚݅Λຬͨ͢ਖ਼ఆ஋ରশߦྻɽ ֤ߋ৽Ͱ i-൪໨ͷ࠲ඪ xi ͷΈߋ৽ɽ w ͷߋ৽͸ؾΛ෇͚Δඞཁ͕͋Δɽ 21 / 42

/ 42

25. ### ઢܗճؼ σβΠϯߦྻ X = (Xij ) ∈ Rn×p. Y =

[y1, . . . , yn ]⊤ ∈ Rn. ਅͷϕΫτϧ β∗ ∈ Rp: Ϟσϧ : Y = Xβ∗ + ξ. ϦοδճؼʢTsykonov ਖ਼ଇԽʣ ˆ β ← arg min β∈Rp 1 n ∥Xβ − Y ∥2 2 +λn∥β∥2 2 . 24 / 42
26. ### ઢܗճؼ σβΠϯߦྻ X = (Xij ) ∈ Rn×p. Y =

[y1, . . . , yn ]⊤ ∈ Rn. ਅͷϕΫτϧ β∗ ∈ Rp: Ϟσϧ : Y = Xβ∗ + ξ. ϦοδճؼʢTsykonov ਖ਼ଇԽʣ ˆ β ← arg min β∈Rp 1 n ∥Xβ − Y ∥2 2 +λn∥β∥2 2 . ม਺ม׵: ɹ ਖ਼ଇԽ߲ͷͨΊɼˆ β ∈ Ker(X)⊥ɽͭ·Γɼˆ β ∈ Im(X⊤)ɽ ͋Δ ˆ α ∈ Rn ͕ଘࡏͯ͠ɼˆ β = X⊤ ˆ α ͱॻ͚Δɽ (౳Ձͳ໰୊) ˆ α ← arg min α∈Rn 1 n ∥XX⊤α − Y ∥2 2 + λnα⊤(XX⊤)α. ˞ (XX⊤)ij = x⊤ i xj ΑΓɼ؍ଌ஋ xi ͱ xj ͷ಺ੵ͑͞ܭࢉͰ͖Ε͹Α͍ɽ 24 / 42
27. ### ϦοδճؼͷΧʔωϧԽ Ϧοδճؼʢม਺ม׵൛ʣ ˆ α ← arg min α∈Rn 1 n

∥(XX⊤)α − Y ∥2 2 + λnα⊤(XX⊤)α. ˞ (XX⊤)ij = x⊤ i xj ͸αϯϓϧ xi ͱ xj ͷ಺ੵɽ 25 / 42
28. ### ϦοδճؼͷΧʔωϧԽ Ϧοδճؼʢม਺ม׵൛ʣ ˆ α ← arg min α∈Rn 1 n

∥(XX⊤)α − Y ∥2 2 + λnα⊤(XX⊤)α. ˞ (XX⊤)ij = x⊤ i xj ͸αϯϓϧ xi ͱ xj ͷ಺ੵɽ • Χʔωϧ๏ͷΞΠσΟΞ x ͷؒͷ಺ੵΛଞͷඇઢܗͳؔ਺Ͱஔ͖׵͑Δ: x⊤ i xj → k(xi , xj ). ͜ͷ k : Rp × Rp → R ΛΧʔωϧؔ਺ͱݺͿ.   Χʔωϧؔ਺ͷຬͨ͢΂͖৚݅ ରশੑ: k(x, x′) = k(x′, x). ਖ਼஋ੑ: ∑ m i=1 ∑ m j=1 αi αj k(xi , xj ) ≥ 0, (∀{xi }m i=1 , {αi }m i=1 , m).   ٯʹ͜ͷੑ࣭Λຬͨؔ͢਺ͳΒԿͰ΋Χʔωϧ๏Ͱ༻͍ͯྑ͍ɽ 25 / 42
29. ### ΧʔωϧϦοδճؼ ΧʔωϧϦοδճؼ: K = (k(xi , xj ))n i,j=1 ͱͯ͠ɼ

ˆ α ← arg min β∈Rn 1 n ∥Kα − Y ∥2 2 + λnα⊤Kα. ৽͍͠ೖྗ x ʹରͯ͠͸ɼ y = n ∑ i=1 k(x, xi )ˆ αi Ͱ༧ଌɽ 26 / 42
30. ### ΧʔωϧϦοδճؼ ΧʔωϧϦοδճؼ: K = (k(xi , xj ))n i,j=1 ͱͯ͠ɼ

ˆ α ← arg min β∈Rn 1 n ∥Kα − Y ∥2 2 + λnα⊤Kα. ৽͍͠ೖྗ x ʹରͯ͠͸ɼ y = n ∑ i=1 k(x, xi )ˆ αi Ͱ༧ଌɽ Χʔωϧؔ਺ ⇔ ࠶ੜ֩ώϧϕϧτۭؒ (RKHS) k(x, x′) Hk ͋Δ ϕ(x) : Rp → Hk ͕ଘࡏͯ͠ɼ k(x, x′) = ⟨ϕ(x), ϕ(x′)⟩Hk ɽ ΧʔωϧτϦοΫ: ⟨ ∑ n i=1 αi ϕ(xi ), ϕ(x)⟩Hk = ∑ n i=1 αi k(xi , x). ˠΧʔωϧؔ਺ͷ஋͑͞ܭࢉͰ͖Ε͹ྑ͍ɽ 26 / 42
31. ### ࠶ੜ֩ώϧϕϧτۭؒ (Reproducing Kernel Hilbert Space, RKHS) ೖྗσʔλͷ෼෍ɿPX ɼରԠ͢Δ L2 ۭؒɿL2

(PX ) = {f | EX∼PX [f (X)2] < ∞}. Χʔωϧؔ਺͸ҎԼͷΑ͏ʹ෼ղͰ͖Δ (Steinwart and Scovel, 2012): k(x, x′) = ∞ ∑ j=1 µj ej (x)ej (x′). (ej )∞ j=1 ͸ L2 (PX ) ಺ͷਖ਼ن௚ަجఈ: ∥ej ∥L2(PX ) = 1, ⟨ej , ej′ ⟩L2(PX ) = 0 (j ̸= j′). µj ≥ 0. Deﬁnition (࠶ੜ֩ώϧϕϧτۭؒ (Hk )) ⟨f , g⟩Hk := ∑ ∞ j=1 1 µj αj βj for f = ∑ ∞ j=1 αj ej , g = ∑ ∞ j=1 βj ej ∈ L2 (PX ). ∥f ∥Hk := √ ⟨f , f ⟩Hk . Hk := {f ∈ L2 (PX ) | ∥f ∥Hk < ∞} equipped with ⟨·, ·⟩Hk . ࠶ੜੑ: f ∈ Hk ʹରͯ͠ f (x) ͸಺ੵͷܗͰʮ࠶ੜʯ͞ΕΔ: f (x) = ⟨f , k(x, ·)⟩Hk . 27 / 42
32. ### ࠶ੜ֩ώϧϕϧτۭؒͷੑ࣭ ϕk (x) = k(x, ·) ∈ Hk ͱॻ͚͹ɼk(x, x′)

= ⟨ϕk (x), ϕk (x′)⟩Hk ͱॻ͚Δɽ͜ͷ ϕk Λಛ௃ࣸ૾ͱ΋ݴ͏ɽ Χʔωϧؔ਺ʹରԠ͢Δੵ෼࡞༻ૉ Tk : L2 (PX ) → L2 (PX ): Tk f := ∫ f (x)k(x, ·)dPX (x). ઌͷΧʔωϧؔ਺ͷ෼ղ͸ Tk ͷεϖΫτϧ෼ղʹରԠɽ ࠶ੜ֩ώϧϕϧτۭؒ Hk ͸ҎԼͷΑ͏ʹ΋ॻ͚Δ: Hk = T1/2 k L2 (PX ). ∥f ∥Hk = inf{∥h∥L2(PX ) | f = T1/2 k h, h ∈ L2 (PX )}. f ∈ Hk ͸ f (x) = ∑ ∞ j=1 aj √ µj ej (x) ͱॻ͚ͯɼ∥f ∥Hk = √∑ ∞ j=1 a2 j ɽ (ej )j ͸ L2 ಺ͷਖ਼ن௚ަجఈɼ( √ µj ej )j ͸ RKHS ಺ͷ׬શਖ਼ن௚ަجఈɽ ಛ௃ࣸ૾ ϕk (x) = k(x, ·) ∈ Hk Λ׬શਖ਼ن௚ަجఈʹؔ͢Δ܎਺Ͱදݱ͢Δͱ ϕk (x) = ( √ µ1 e1 (x), √ µ2 e2 (x), . . . )⊤ 28 / 42
33. ### ࠶ੜ֩ώϧϕϧτۭؒͷΠϝʔδ ඇઢܗͳਪ࿦Λ࠶ੜ֩ώϧϕϧτۭؒ΁ͷඇઢܗࣸ૾ ϕ Λ༻͍ͯߦ͏ɽ ࠶ੜ֩ώϧϕϧτۭؒͰ͸ઢܗͳॲཧΛ͢Δɽ Reproducing Kernel Hilbert Space Χʔωϧ๏͸ୈҰ૚Λݻఆ͠ୈೋ૚໨

ͷύϥϝʔλΛֶश͢Δԣ෯ແݶେͷ 2 ૚χϡʔϥϧωοτϫʔΫͱ΋Έͳ ͤΔɽ ( ʠ ઙ͍ ʡ ֶशख๏ͷ୅දྫ) 29 / 42
34. ### ΧʔωϧϦοδճؼͷ࠶ఆࣜԽ ࠶ੜੑ: f ∈ Hk ʹର͠ f (x) = ⟨f

, ϕ(x)⟩Hk . ΧʔωϧϦοδճؼͷ࠶ఆࣜԽ ˆ f ← min f ∈Hk 1 n n ∑ i=1 (yi − f (xi ))2 + C∥f ∥2 Hk දݱఆཧ ∃αi ∈ R s.t. ˆ f (x) = n ∑ i=1 αi k(xi , x), ⇒ ∥ˆ f ∥Hk = √∑ n i,j=1 αi αj k(xi , xj ) = √ α⊤Kα. ͖͞΄ͲͷΧʔωϧϦοδճؼͷఆࣜԽͱҰகɽ 30 / 42
35. ### Χʔωϧͷྫ Ψ΢γΞϯΧʔωϧ k(x, x′) = exp ( − ∥x −

x′∥2 2σ2 ) ଟ߲ࣜΧʔωϧ k(x, x′) = ( 1 + x⊤x′ )p χ2-Χʔωϧ k(x, x′) = exp ( − γ2 ∑ d j=1 (xj −x′ j )2 (xj +x′ j ) ) Mat´ ern-kernel k(x, x′) = ∫ Rd eiλ⊤(x−x′) 1 (1 + ∥λ∥2)α+d/2 dλ άϥϑΧʔωϧɼ࣌ܥྻΧʔωϧɼ... 31 / 42
36. ### ࠶ੜ֩ώϧϕϧτۭؒ಺ͷ֬཰త࠷దԽ ໰୊ઃఆ: yi = f o(xi ) + ξi .

(xi , yi )n i=1 ͔Β f o Λਪఆ͍ͨ͠ɽ(f o ͸ Hk ʹ΄΅ೖ͍ͬͯΔ) ظ଴ଛࣦͷมܗ: E[(f (X) − Y )2] = E[(f (X) − f o(X) − ξ)2] = E[(f (X) − f o(X))2] + σ2 ˠ minf ∈Hk E[(f (X) − Y )2] Λղ͚͹ f o ͕ٻ·Δɽ Kx = k(x, ·) ∈ Hk ͱ͢Δͱɼf (x) = ⟨f , Kx ⟩Hk ΑΓ L(f ) = E[(f (X) − Y )2] ͷ RKHS ಺Ͱͷ Frechet ඍ෼͸ҎԼͷ௨Γ: ∇L(f ) = 2E[KX (⟨KX , f ⟩Hk − Y )] = 2(E[KX K∗ X ] =:Σ f − E[KX Y ]) = 2(Σf − E[KX Y ]). ظ଴ଛࣦͷޯ഑๏: f ∗ t = f ∗ t−1 − η2(Σf ∗ t−1 − E[KX Y ]). ܦݧଛࣦͷޯ഑๏ (E[·] ͸ඪຊฏۉ): ˆ ft = ˆ ft−1 − η2(Σˆ ft−1 − E[KX Y ]). ֬཰తޯ഑ʹΑΔߋ৽: gt = gt−1 − η2(Kxit K∗ xit gt−1 − Kxit yit ). ˞ (xit , yit )∞ t=1 ͸ (xi , yi )n i=1 ͔Β i.i.d. Ұ༷ʹऔಘɽ 32 / 42
37. ### ޯ഑ͷεϜʔδϯάͱͯ͠ͷݟํ ؔ਺஋ͷߋ৽ࣜ: f ∗ t (x) = f ∗ t−1

(x) − 2η ∫ k(x, X) (f ∗ t−1 (X) − Y ) →f ∗ t−1 (X)−f o(X) dP(X, Y ) = f ∗ t−1 (x) − 2ηTk (f ∗ t−1 − f o)(x). ੵ෼࡞༻ૉ Tk ͸ߴप೾੒෼Λ཈੍͢Δ࡞༻͕͋Δɽ RKHS ಺ͷޯ഑͸ L2 ಺ͷؔ਺ޯ഑ΛTk ʹΑͬͯฏ׈Խͨ͠΋ͷʹͳ͍ͬͯ Δɽ(࣮ࡍ͸ Tk ͷαϯϓϧ͔Βͷਪఆ஋Λ࢖͏) ߴप೾੒෼͕ग़ͯ͘ΔલʹࢭΊΕ͹աֶशΛ๷͛Δɽ ˠ Early stopping ӌᮣʹ Newton ๏ͳͲΛ࢖͏ͱةݥɽ 33 / 42
38. ### Early stopping ʹΑΔਖ਼ଇԽ Early stopping ʹΑΔਖ਼ଇԽ ॳظ஋ ܇࿅ޡࠩ࠷খԽݩ ʢաֶशʣ &BSMZTUPQQJOH

όΠΞε-όϦΞϯε෼ղ ∥f o − ˆ f ∥L2(PX ) Estimation error ≤ ∥f o − ˇ f ∥L2(PX ) Approximation error (bias) + ∥ˇ f − ˆ f ∥L2(PX ) Sample deviation (variance) ܇࿅ޡࠩ࠷খԽݩʹୡ͢ΔલʹࢭΊΔ (early stopping) ͜ͱͰਖ਼ଇԽ͕ಇ͘ɽ ແݶ࣍ݩϞσϧ (RKHS) ͸աֶश͠΍͍͢ͷͰؾΛ෇͚Δඞཁ͕͋Δɽ 34 / 42
39. ### ղੳʹ༻͍Δ৚݅ ௨ৗɼҎԼͷ৚݅Λߟ͑Δɽ ʢ౷ܭཧ࿦Ͱ΋ಉ༷ͷԾఆΛ՝͢ఆ൪ͷԾఆʣ (Caponnetto and de Vito, 2007, Dieuleveut et

al., 2016, Pillaud-Vivien et al., 2018) µi = O(i−α) for α > 1. α ͸ RKHSHk ͷෳࡶ͞Λಛ௃͚ͮΔɽ(খ͍͞ α: ෳࡶɼେ͖͍ α: ୯७) f o ∈ Tr (L2 (PX )) for r > 0. f o ͕ RKHS ͔ΒͲΕ͚ͩ “͸Έग़͍ͯΔ͔” Λಛ௃͚ͮɽ r = 1/2 ͸ f o ∈ Hk ʹରԠɽ(r < 1/2: ͸Έग़ͯΔ, r ≥ 1/2: ؚ·ΕΔ) ∥f ∥L∞(PX ) ≲ ∥f ∥1−µ L2(PX ) ∥f ∥µ Hk (∀f ∈ Hk ) for µ ∈ (0, 1]. Hk ʹؚ·Ε͍ͯΔؔ਺ͷ׈Β͔͞Λಛ௃͚ͮɽ ʢখ͍͞ µ: ׈Β͔ʣ ˞ ࠷ޙͷ৚݅ʹ͍ͭͯ: f ∈ W m([0, 1]d ) (Sobolev ۭؒ) ͔ͭ PX ͷ୆͕ [0, 1]d Ͱີ౓ؔ਺Λ࣋ͪɼͦͷີ౓͕Լ͔Β͋Δఆ਺ c > 0 Ͱ཈͑ΒΕ͍ͯΕ͹ɼ µ = d/(2m) ͰͳΓͨͭɽ 35 / 42
40. ### ऩଋϨʔτ όΠΞε-όϦΞϯεͷ෼ղ: ∥f o − gt∥2 L2(PX ) ≲ ∥f

o − f ∗ t ∥2 L2(PX ) (a): Bias + ∥f ∗ t − ˆ ft∥2 L2(PX ) (b): Variance + ∥ˆ ft − gt∥2 L2(PX ) (c): SGD deviation (a) (ηt)−2r , (b) (ηt)1/α+(ηt)µ−2r n , (c) η(ηt)1/α−1 (a) ޯ഑๏ͷղͷσʔλʹؔ͢Δظ଴஋ͱਅͷؔ਺ͱͷζϨ (Bias)ɽ (b) ޯ഑๏ͷղͷ෼ࢄ (Variance)ɽ (c) ֬཰తޯ഑Λ༻͍Δ͜ͱʹΑΔมಈ. ߋ৽਺ t Λେ͖͘͢Δͱ Bias ͸ݮΔ͕ Variance ͕૿͑Δɽ͜ΕΒΛόϥϯε͢ Δඞཁ͕͋Δ (Early stopping)ɽ Theorem (Multi-pass SGD ͷऩଋϨʔτ (Pillaud-Vivien et al., 2018)) η = 1/(4 supx k(x, x)2) ͱ͢Δɽ µα < 2rα + 1 < α ͷ࣌ɼt = Θ(nα/(2rα+1)) ͱ͢Ε͹ɼ E[L(gt )] − L(f o) = O(n−2rα/(2rα+1)). µα ≥ 2rα + 1 ͷ࣌ɼt = Θ(n 1 µ (log n) 1 µ ) ͱ͢Ε͹ɼE[L(gt )] − L(f o) = O(n−2r/µ). 36 / 42
41. ### Natural gradient ͷऩଋ Natural gradient (ࣗવޯ഑๏): ˆ ft = ˆ

ft−1 − η(Σ + λI)−1(Σˆ ft−1 − E[KX Y ]). (unlabeled data ͕୔ࢁ͋Γ Σ ͸ྑ͘ਪఆͰ͖Δઃఆ; GD ͷղੳ (Murata and Suzuki, 2020)) Theorem (Natural gradient ͷऩଋ (Amari et al., 2020)) E[∥ˆ ft − f o∥2 L2(PX ) ] ≲ B(t) + V (t), ͨͩ͠ɼB(t) = exp(−ηt) ∨ (λ/(ηt))2r , V (t) = (1 + ηt) λ−1B(t) + λ− 1 α n + (1 + tη)4 (1 ∨ λ2r−µ)λ− 1 α n . ಛʹɼλ = n− α 2rα+1 , t = Θ(log(n)) Ͱ E[∥ˆ ft − f o∥2 L2(PX ) ] = O(n− 2rα 2rα+1 log(n)4). ˞ όΠΞε͸ٸ଎ʹऩଋ͢Δ͕ɼόϦΞϯε΋଎͘૿େ͢Δɽ ˠ Preconditioning ͷͨΊߴप೾੒෼͕ૣΊʹग़ݱ͢ΔɽΑΓૣΊʹࢭΊ ͳ͍ͱաֶश͢Δɽ 37 / 42

43. ### ࡞༻ૉ Bernstein ͷෆ౳ࣜ Σ = Ex [Kx K∗ x ]:

Σf = ∫ k(·, x)f (x)dPx (x) Σ = 1 n ∑ n i=1 Kxi K∗ xi : Σf = 1 n ∑ n i=1 k(·, xi )f (xi ) Σλ := Σ + λIɼF∞ (λ) := supx K∗ x Σ−1 λ Kx ͱ͢ΔɽҎԼͷΑ͏ͳධՁ͕ඞཁ: ∥Σ−1 λ (Σ − Σ)Σ−1 λ ∥ ≲ √ F∞ (λ)β n + (1 + F∞ (λ))β n with prob. 1 − δɽͨͩ͠ɼβ = log(4Tr[ΣΣ−1 λ ] δ )ɽ ˠ ܦݧ෼෍ͱਅͷ෼෍ͷͣΕΛό΢ϯυɽ Theorem (ࣗݾڞ໾࡞༻ૉͷ Bernstein ͷෆ౳ࣜ (Minsker, 2017)) (Xi )n i=1 ͸ಠཱͳࣗݾڞ໾࡞༻ૉͷ֬཰ม਺Ͱ E[Xi ] = 0 ͔ͭɼ σ2 ≥ ∥ ∑ n i=1 E[X2 i ]∥, U ≥ ∥Xi ∥ ͱ͢Δɽr(A) = Tr[A]/∥A∥ ͱͯ͠ɼ P ( n ∑ i=1 Xi ≥ t ) ≤ 14r( ∑ n i=1 E[X2 i ]) exp ( − t2 2(σ2 + tU/3) ) . Xi = Σ−1 λ Kxi K∗ xi Σ−1 λ ͱ͢Δɽ (Tropp (2012) ΋ࢀর) 39 / 42
44. ### ਖ਼ଇԽ͋Γͷ֬཰త࠷దԽ ೋ৐ଛࣦΛ֦ுͯ͠ɼҰൠͷ׈Β͔ͳತଛࣦؔ਺ ℓ Λߟ͑Δɽ ʢ൑ผ໰୊ͳͲʣ ਖ਼ଇԽ͋Γͷظ଴ଛࣦ࠷খԽ: min f ∈Hk E[ℓ(Y

, f (X))] + λ∥f ∥2 Hk =: Lλ (f ). ͜ΕΛ SGD Ͱղ͘ɽ໨తؔ਺͕ λ-ڧತͰ͋Δ͜ͱΛར༻ɽ gt+1 = gt − ηt (ℓ′(yt, gt (xt )) + λgt ). ¯ gT+1 = T+1 ∑ t=1 2(c0+t−1) (2c0+T)(T+1) gt (ଟ߲ࣜฏۉ). Ծఆɿ(i) ℓ ͸ γ-ฏ׈ɼ∥ℓ′∥∞ ≤ M, (ii) k(x, x) ≤ 1. gλ = argming∈Hk Lλ (g). Theorem (Nitanda and Suzuki (2019)) ద੾ͳ c0 > 0 ʹରͯ͠ ηt = 2/(λ(c0 + t)) ͱ͢Ε͹ɼ E[Lλ (¯ gT+1 ) − Lλ (gλ )] ≲ M2 λ(c0 + T) + γ + λ T + 1 ∥g1 − gλ ∥2 Hk . ͞ΒʹϚϧνϯήʔϧ֬཰ूதෆ౳ࣜΑΓ High probability bound ΋ಘΒΕΔɽ ൑ผ໰୊ͳΒ strong low noise condition ͷ΋ͱ൑ผޡࠩͷࢦ਺ऩଋ΋ࣔͤΔɽ40 / 42
45. ### Ϛϧνϯήʔϧ Hoeﬀding ͷෆ౳ࣜ Theorem (Ϛϧνϯήʔϧ Hoeﬀding ܕूதෆ౳ࣜ (Pinelis, 1994)) ֬཰ม਺ྻ:

D1, . . . , DT ∈ Hk ɽE[Dt ] = 0ɼ∥Dt∥Hk ≤ Rt (a.s.) ͱ͢Δɽ ∀ϵ > 0 ʹର͠ P [ max 1≤t≤T ∥ t ∑ s=1 Ds∥Hk ≥ ϵ ] ≤ 2 exp ( − ϵ2 2 ∑ T t=1 R2 t ) . Dt = E[¯ gT+1|Z1, . . . , Zt ] − E[¯ gT+1|Z1, . . . , Zt−1 ], ͨͩ͠ Zt = (xt, yt ) ͱ͢Ε͹ɼ ∑ T t=1 Dt = ¯ gT+1 − E[¯ gT+1 ] ͱͳΓɼظ଴஋ͱ࣮ ݱ஋ͷͣΕΛ཈͑ΒΕΔɽ 41 / 42
46. ### Ϛϧνϯήʔϧ Hoeﬀding ͷෆ౳ࣜ Theorem (Ϛϧνϯήʔϧ Hoeﬀding ܕूதෆ౳ࣜ (Pinelis, 1994)) ֬཰ม਺ྻ:

D1, . . . , DT ∈ Hk ɽE[Dt ] = 0ɼ∥Dt∥Hk ≤ Rt (a.s.) ͱ͢Δɽ ∀ϵ > 0 ʹର͠ P [ max 1≤t≤T ∥ t ∑ s=1 Ds∥Hk ≥ ϵ ] ≤ 2 exp ( − ϵ2 2 ∑ T t=1 R2 t ) . Dt = E[¯ gT+1|Z1, . . . , Zt ] − E[¯ gT+1|Z1, . . . , Zt−1 ], ͨͩ͠ Zt = (xt, yt ) ͱ͢Ε͹ɼ ∑ T t=1 Dt = ¯ gT+1 − E[¯ gT+1 ] ͱͳΓɼظ଴஋ͱ࣮ ݱ஋ͷͣΕΛ཈͑ΒΕΔɽ (ิ଍) Lλ ͸ RKHS ϊϧϜʹؔͯ͠ λ-ڧತͰ͋Δ͜ͱΑΓɼ ∥¯ gT+1 − gλ ∥Hk ≤ O( 1 λ2T ) ͕ߴ͍֬཰Ͱ੒Γཱͭɽ࣮͸ ∥ · ∥∞ ≤ ∥ · ∥Hk Ͱ΋͋ΔͷͰɼ |P(Y = 1|X) − P(Y = −1|X)| ≥ δ ͳΔϚʔδϯ৚݅ (strong low noise condition) ͷ΋ͱɼ׬શͳ൑ผ͕ߴ͍֬཰ͰͰ͖ΔΑ͏ʹͳΔɽ 41 / 42

48. ### Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex

optimization. arXiv preprint arXiv:1603.05643, 2016. S. Amari, J. Ba, R. Grosse, X. Li, A. Nitanda, T. Suzuki, D. Wu, and J. Xu. When does preconditioning help or hurt generalization?, 2020. A. Caponnetto and E. de Vito. Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007. A. R. Conn, N. I. Gould, and P. L. Toint. Trust region methods, volume 1. Siam, 2000. W. Deng and W. Yin. On the global and linear convergence of the generalized alternating direction method of multipliers. Technical report, Rice University CAAM TR12-14, 2012. A. Dieuleveut, F. Bach, et al. Nonparametric stochastic approximation with large step-sizes. The Annals of Statistics, 44(4):1363–1399, 2016. D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via ﬁnite-element approximations. Computers & Mathematics with Applications, 2:17–40, 1976. R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015. 42 / 42
49. ### S. Ghadimi and G. Lan. Stochastic ﬁrst-and zeroth-order methods for

nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013. B. He and X. Yuan. On the O(1/n) convergence rate of the Douglas-Rachford alternating direction method. SIAM J. Numerical Analisis, 50(2):700–709, 2012. M. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory & Applications, 4:303–320, 1969. M. Hong and Z.-Q. Luo. On the linear convergence of the alternating direction method of multipliers. arXiv preprint arXiv:1208.3922, 2012a. M. Hong and Z.-Q. Luo. On the linear convergence of the alternating direction method of multipliers. Technical report, 2012b. arXiv:1208.3922. C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points eﬃciently. In International Conference on Machine Learning, pages 1724–1732, 2017a. C. Jin, P. Netrapalli, and M. I. Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456, 2017b. Z. Li. Ssrgd: Simple stochastic recursive gradient descent for escaping saddle points. In Advances in Neural Information Processing Systems, pages 1523–1533, 2019. 42 / 42
50. ### Z. Li and J. Li. A simple proximal stochastic gradient

method for nonsmooth nonconvex optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5564–5574. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/ 7800-a-simple-proximal-stochastic-gradient-method-for-nonsmooth-n pdf. Y. Liu, F. Shang, and J. Cheng. Accelerated variance reduced stochastic admm. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, pages 2287–2293, 2017. S. Minsker. On some extensions of Bernstein’s inequality for self-adjoint operators. Statistics & Probability Letters, 127:111–119, 2017. J. F. Mota, J. M. Xavier, P. M. Aguiar, and M. P¨ uschel. A proof of convergence for the alternating direction method of multipliers applied to polyhedral-constrained functions. arXiv preprint arXiv:1112.2295, 2011. T. Murata and T. Suzuki. Gradient descent in rkhs with importance labeling, 2020. L. M. Nguyen, J. Liu, K. Scheinberg, and M. Tak´ aˇ c. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2613–2621, International Convention Centre, Sydney, Australia, 06–11 42 / 42
51. ### Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/nguyen17b.html. A. Nitanda and T. Suzuki.

Stochastic gradient descent with exponential convergence rates of expected classiﬁcation errors. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 1417–1426. PMLR, 16–18 Apr 2019. URL http://proceedings.mlr.press/v89/nitanda19a.html. H. Ouyang, N. He, L. Q. Tran, and A. Gray. Stochastic alternating direction method of multipliers. In Proceedings of the 30th International Conference on Machine Learning, 2013. N. H. Pham, L. M. Nguyen, D. T. Phan, and Q. Tran-Dinh. Proxsarah: An eﬃcient algorithmic framework for stochastic composite nonconvex optimization. Journal of Machine Learning Research, 21(110):1–48, 2020. L. Pillaud-Vivien, A. Rudi, and F. Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. In Advances in Neural Information Processing Systems, pages 8114–8124, 2018. I. Pinelis. Optimum bounds for the distributions of martingales in banach spaces. The Annals of Probability, pages 1679–1706, 1994. M. Powell. A method for nonlinear constraints in minimization problems. In R. Fletcher, editor, Optimization, pages 283–298. Academic Press, London, New York, 1969. 42 / 42
52. ### S. J. Reddi, A. Hefny, S. Sra, B. P´ ocz´

os, and A. Smola. Stochastic variance reduction for nonconvex optimization. arXiv preprint arXiv:1603.06160, 2016. R. T. Rockafellar. Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of Operations Research, 1: 97–116, 1976. I. Steinwart and C. Scovel. Mercer ʟ s theorem on general domains: on the interaction between measures, kernels, and RKHSs. Constructive Approximation, 35(3):363–417, 2012. J. Sun, Q. Qu, and J. Wright. When are nonconvex problems not scary? arXiv preprint arXiv:1510.06096, 2015. T. Suzuki. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In Proceedings of the 30th International Conference on Machine Learning, pages 392–400, 2013. T. Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers. In Proceedings of the 31th International Conference on Machine Learning, pages 736–744, 2014. J. A. Tropp. User-friendly tools for random matrices: An introduction. Technical report, 2012. S. Zheng and J. T. Kwok. Fast-and-light stochastic ADMM. In Proceedings of the Twenty-Fifth International Joint Conference on Artiﬁcial Intelligence, pages 2407–2413, 2016. 42 / 42