Slide 1

Slide 1 text

ػցֶशʹ͓͚Δ࠷దԽཧ࿦ͱֶशཧ࿦తଆ໘ ୈೋ෦: ඇತ֬཰త࠷దԽͱ࠶ੜ֩ώϧϕϧτۭؒͷ࠷దԽ ླ໦େ࣊ ౦ژେֶେֶӃ৘ใཧ޻ֶܥݚڀՊ਺ཧ৘ใֶઐ߈ ཧݚ AIP 2020 ೥ 8 ݄ 6 ೔ @૊߹ͤ࠷దԽηϛφʔ 2020 (COSS2020) 1 / 42

Slide 2

Slide 2 text

Outline 1 ֬཰త࠷దԽͷΑΓߴ౓ͳ࿩୊ ඇತؔ਺ͷ֬཰త࠷దԽ ߏ଄తਖ਼ଇԽͷ࠷దԽ 2 ແݶ࣍ݩͷ֬཰త࠷దԽɿΧʔωϧ๏ ࠶ੜ֩ώϧϕϧτۭؒͷఆٛ ࠶ੜ֩ώϧϕϧτۭؒʹ͓͚Δ࠷దԽ 2 / 42

Slide 3

Slide 3 text

ඇತؔ਺࠷খԽ ͜Ε·Ͱ঺հͨ͠ತ࠷దԽख๏Λͦͷ··౰ͯ͸Ίͯ΋࣮༻্͸݁ߏಈ͘ɽ ͨͩ͠ɼ૒ର໰୊Λղ͘ํ๏͸ͦͷ··Ͱ͸ద༻Ͱ͖ͳ͍ɽ εςοϓαΠζΛద੾ʹબͿඞཁ͕͋Δɽ େҬత࠷దԽ͸೉͍͠ɽඍ෼͕ 0 ʹͳΔఀཹ఺΁ͷऩଋ͸อূͰ͖Δɽ େҬత࠷దԽΛݫີʹߦ͏ʹ͸ΞχʔϦϯάͳͲͷํ๏Λ࢖͏ඞཁ͕͋Δɽ 3 / 42

Slide 4

Slide 4 text

ఀཹ఺ͱہॴ࠷దղ ఀཹ఺ ہॴ࠷దղ ہॴ࠷దղ ఀཹ఺ ˠେҬత࠷దղͰ΋͋Δ 4 / 42

Slide 5

Slide 5 text

ඇತؔ਺Ͱͷ SGD ໨తؔ਺ɿL(x) = Ez [ℓ(z, x)]ɽ (Լʹ༗ք, L∗ = infx L(x) ͱ͢Δ) SGD zt ∼ P(Z) Λ؍ଌɽℓt (x) := ℓ(zt, x) ͱ͢Δ. gt ∈ ∂x ℓt (xt−1 ). xt = xt−1 − ηt gt. Ծఆ (A1) L ͸ γ-ฏ׈ (A2) E[∥gt − E[gt ]∥2] = σ2 (֬཰తޯ഑ͷ෼ࢄ͸ σ2). ηt = min { ˜ D σ √ T , 1 γ } ͱ͢Δͱ (Ghadimi and Lan (2013)) min 1≤t≤T E[∥∇L(xt )∥2] ≤ γσ √ T ( D2 f ˜ D + ˜ D ) + γ2D2 f T , ͨͩ͠ɼDf = √ 2(L(x1)−L∗) γ ɽ ʢඍ෼͕ 0 ΁ऩଋͯ͠Ώ͘͜ͱΛอূʣ ࠨลͷ min1≤t≤T ͷ୅ΘΓʹɼˆ t ∈ {1, . . . , T} ΛҰ༷෼෍ʹैͬͯબΜͰ E[∥∇L(xˆ t )∥2] ͱͯ͠΋ྑ͍ɽ 5 / 42

Slide 6

Slide 6 text

ඇತ SVRG min x∈Rp L(x) = min x∈Rp 1 n n ∑ i=1 ℓi (x) SVRG Λͦͷ··ඇತؔ਺࠷దԽʹద༻ͯ͠Α͍ɽ ʢͨͩ͠εςοϓαΠζͱϛχ όον਺͸ద੾ʹௐ੔ʣ E[∥∇L(ˆ x)∥2] ≤ ϵ ʹͳΔ·Ͱͷߋ৽ճ਺ T (Allen-Zhu and Hazan, 2016, Reddi et al., 2016) ℓi ͕γ-ฏ׈ͷ࣌ɿ T ≥ Ω ( n + n2/3 ϵ ) . ʢී௨ͷඇ֬཰తޯ഑๏ͳΒ Ω(n/ϵ)ʣ ℓi ͕ γ-ฏ׈͔ͭ L(x) − L(x∗) ≤ τ∥∇L(x)∥2 (∀x) (x∗ ͸େҬత࠷దղ) ͷ࣌ (Polyak- Lojasiewicz, PL ৚݅)ɿ T ≥ Ω ( (n + τn2/3)log(1/ϵ) ) . ʢී௨ͷඇ֬཰తޯ഑๏ͳΒ Ω(τn log(1/ϵ))ʣ 6 / 42

Slide 7

Slide 7 text

Ҍ఺ճආͷํ๏ ୯७ͳޯ഑๏ʹࡶԻΛ৐ͤΔ (Jin et al., 2017a) Ճ଎ޯ഑๏΁ͷద༻ (Jin et al., 2017b) 7 / 42

Slide 8

Slide 8 text

SARAH ͱͦͷվྑ๏ SSRGD (Li, 2019): SARAH + ϊΠζ෇ՃʹΑΔҌ఺୤ग़ Simple Stochastic Recursive Gradient Descent (SSRGD) Iterate the following for t = 1, 2, . . . , T: 1 Ҍ఺୤ग़Ϟʔυʹೖ͓ͬͯΒͣɼ∥∇L(xt )∥ ≤ gthresh ͳΒɼ xt ← xt + ξ (ξ ∼ Unif(Br (Rd ))) ͱͯ͠ɼҌ఺୤ग़ϞʔυʹೖΔɽ 2 y0 = xt , v0 = ∇f (xt ) 3 For k = 1, . . . , m, 1 yk = yk−1 − ηvk−1 2 vk = 1 b ∑ i∈Ik (∇fi (yk ) − ∇fi (yk−1 )) + vk−1 (SARAH: variance reduction) 3 ͋Δఀࢭ৚݅Λຬ͍ͨͯͨ͠ΒҌ఺୤ग़ϞʔυΛࢭΊΔɽ 4 xt+1 = ym Output: xT SARAH: StochAstic Recursive grAdient algoritHm (Nguyen et al., 2017, Pham et al., 2020) ΦϯϥΠϯܕͷ৔߹͸ ∇L ͷܭࢉ͸αϯϓϧฏۉʹ͢Δ 1 B ∑ i∈It ∇fi (xt ). ೋ࣍࠷దੑ΋ߴ͍֬཰Ͱอূ 8 / 42

Slide 9

Slide 9 text

SARAH ʹ͍ͭͯ SARAH: vk = 1 b ∑ i∈Ik (∇fi (yk ) − ∇fi (yk−1 )) + vk−1 SVRG: vk = 1 b ∑ i∈Ik (∇fi (yk ) − ∇fi (˜ x)) + ˜ v SVRG ͸಺෦ϧʔϓͷߋ৽ΛਐΊΔͱ෼ࢄ͕େ͖͘ͳΔɽ SARAH ͸಺෦ϧʔϓͷߋ৽ΛਐΊͯ΋෼ࢄ͕େ͖͘ͳΒͳ͍ or 0 ʹऩଋ͢Δ (ڧತͷ৔߹) ˠ ޯ഑͕๫ΕͣɼҰ࣌࠷దੑ৚݅Λຬͨ͢ղΛಘ΍͍͢ɽ (ತ࠷దԽͰ໨తؔ਺஋Λݟ͍ͯΔݶΓ͸͜ͷҧ͍͕ݟʹ͍͘) 9 / 42

Slide 10

Slide 10 text

ܭࢉྔͷൺֱ ϵ-Ұ࣍࠷దੑ৚݅: E[∥∇L(x)∥2] ≤ ϵ δ-ೋ࣍࠷దੑ৚݅: λmin (∇2L(x)) ≥ −δ (with high probability) ΦϯϥΠϯܕ ख๏ ֬཰తޯ഑ͷܭࢉ਺ ࠷దੑ৚݅ GD O(n ϵ ) 1 ࣍ SVRG (Allen-Zhu and Hazan, 2016) O(n + n2/3 ϵ ) 1 ࣍ SARAH (Pham et al., 2020) O(n + √ n ϵ ) 1 ࣍ SSRGD (Li, 2019) O(n + √ n ϵ ) 1 ࣍ PGD (Jin et al., 2017b) O(n ϵ + n δ4 ) 2 ࣍ SSRGD (Li, 2019) O( √ n ϵ + √ n δ4 + n δ3 ) 2 ࣍ ༗ݶ࿨ܕ ख๏ ֬཰తޯ഑ͷܭࢉ਺ ࠷దੑ৚݅ SGD (Ghadimi and Lan, 2013) O(1/ϵ2) 1 ࣍ SVRG+ (Li and Li, 2018) O(1/ϵ7/4) 1 ࣍ SARAH (Pham et al., 2020) O(1/ϵ3/2) 1 ࣍ SSRGD (Li, 2019) O(1/ϵ3/2) 1 ࣍ SSRGD (Li, 2019) O( 1 ϵ3/2 + 1 ϵδ3 + 1 ϵ1/2δ4 ) 2 ࣍ 10 / 42

Slide 11

Slide 11 text

ʢࢀߟʣStrict saddle ਂ૚ֶशͳͲ͸ఀཹ఺͕ଟ͍ɽ ໨తؔ਺͕ strict saddle property ͱ͍͏ੑ࣭Λຬ͍ͨͯ͠Ε͹ɼαυϧϙΠ ϯτΛճආ͢Δ͜ͱ͕Ͱ͖Δɽ ৴པྖҬ๏ (Conn et al., 2000) ΍ࡶԻΛՃ͑ͨ֬཰తޯ഑๏ (Ge et al., 2015) ͸ strict saddle ͳ໨తؔ਺ͷہॴత࠷దղʹ౸ୡ͢Δ (Sun et al., 2015). ˞ ղʹࡶԻΛՃ͑Δ͜ͱͰαυϧϙΠϯτ͔Βൈ͚ग़ͤΔɽ Strict saddle ೋճඍ෼Մೳͳؔ਺ f ͕ strict saddle Ͱ͋Δͱ͸ɼ∀x Ͱ࣍ͷͲΕ͔͕ຬͨ͞Ε ͍ͯΔ: ∥∇f (x)∥ ≥ ϵ. λmin (∇2f (x)) ≤ −γ. ͋Δ x∗ ͕ଘࡏͯ͠ ∥x − x∗∥ ≤ δ ͔ͭ f (x) ͕ x∗ ͷۙ๣ {x′ | ∥x∗ − x′∥ ≤ 2δ} Ͱڧತؔ਺. E.g., ςϯιϧ෼ղ maxu∈Rp ⟨ ∑ d r=1 a⊗4 r , u ⊗ u ⊗ u ⊗ u ⟩ ͸ a⊤ r ar′ = δr,r′ ͳΒ strict saddleɽ 11 / 42

Slide 12

Slide 12 text

ઢܗ੍໿͋Γͷֶश໰୊ min x 1 n n ∑ i=1 fi (z⊤ i x) + ψ(B⊤x) ⇔ min x,y 1 n n ∑ i=1 fi (z⊤ i x) + ψ(y) s.t. y = B⊤x. ֦ுϥάϥϯδΞϯ L(x, y, λ) = 1 n ∑ i fi (z⊤ i x) + ψ(y) + λ⊤(y − B⊤x) + ρ 2 ∥y − B⊤x∥2 inf x,y sup λ L(x, y, λ) Ͱ࠷దղ͕ٻ·Δɽ ৐਺๏: Hestenes (1969), Powell (1969), Rockafellar (1976). ަޓํ޲৐਺๏ (ADMM): Gabay and Mercier (1976), Mota et al. (2011), He and Yuan (2012), Deng and Yin (2012), Hong and Luo (2012a) ֬཰తަޓํ޲৐਺๏: SGD-ADMM (Suzuki, 2013, Ouyang et al., 2013), RDA-ADMM (Suzuki, 2013), SDCA-ADMM (Suzuki, 2014), SVRG-ADMM (Zheng and Kwok, 2016), ASVRG-ADMM (Liu et al., 2017). 12 / 42

Slide 13

Slide 13 text

ߏ଄తਖ਼ଇԽͷྫ Overlapped group lasso ˜ ψ(w) = C ∑ g∈G ∥wg ∥ It is difficult to compute the proximal mapping. Solution: Prepare ψ for which proximal mapping is easily computable. Let ψ(B⊤w) = ˜ ψ(w), and utilize the proximal mapping w.r.t. ψ. BTw Decompose into independent groups: B⊤w = ψ(y) = C ∑ g′∈G′ ∥yg′ ∥ prox(q|ψ) = ( qg′ max { 1 − C ∥qg′ ∥ , 0 }) g′∈G′ 13 / 42

Slide 14

Slide 14 text

ͦͷଞͷྫ Graph guided regularization ˜ ψ(w) = C ∑ (i,j)∈E |wi − wj |. x1 x2 x3 x4 x5 ψ(y) = C ∑ e∈E |ye|, y = B⊤w = (wi − wj )(i,j)∈E ⇒    ψ(B⊤w) = ˜ ψ(w), prox(q|ψ) = ( qe max { 1 − C |qe | , 0 }) e∈E . Soft-Thresholding function. 14 / 42

Slide 15

Slide 15 text

ߏ଄తਖ਼ଇԽʹର͢Δަޓํ޲৐਺๏ min x {f (x) + ψ(B⊤w)} ⇔ min x,y {f (x) + ψ(y) s.t. y = B⊤x} L(x, y, λ) = f (x) + ψ(y) + λ⊤(y − B⊤x) + ρ 2 ∥y − B⊤x∥2 ͨͩ͠ f (x) = 1 n ∑ fi (z⊤ i x) ADMM ʹΑΔߏ଄తਖ਼ଇԽֶश x(t) = arg min x {f (x) + λ(t−1)⊤ (−B⊤x) + ρ 2 ∥y(t−1) − B⊤x∥2} y(t) = arg min y {ψ(y) + λ(t)⊤ y + ρ 2 ∥y − B⊤x(t)∥2} (= prox(B⊤x(t) − λ(t)/ρ|ψ/ρ)) λ(t) = λ(t−1) − ρ(B⊤x(t) − y(t)) y ͷߋ৽͸୯७ͳ ψ ʹΑΔۙ઀ࣸ૾. → ղੳղ. Ұൠతʹ͸ O(1/k) (He and Yuan, 2012), ڧತͳΒ͹ઢܗऩଋ (Deng and Yin, 2012, Hong and Luo, 2012b)ɽ 15 / 42

Slide 16

Slide 16 text

SGD-ADMM minx EZ [ℓ(x, Z)] + ψ(B⊤x) ⇒ ֦ுϥάϥϯδΞϯ: EZ [ℓ(x, Z)] + ψ(y) + λ⊤(y − B⊤x) + ρ 2 ∥y − B⊤x∥2. ௨ৗͷ SGD: xt+1 = arg minx { ⟨gt , x⟩ + ˜ ψ(x) + 1 2ηt ∥x − xt ∥2 } (gt ∈ ∂x ℓ(xt , zt )). SGD-ADMM xt+1 =argmin x∈X { g⊤ t x − λt ⊤(B⊤x − yt ) + ρ 2 ∥B⊤x − yt∥2 + 1 2ηt ∥x − xt∥2 Gt } , yt+1 = argmin y∈Y { ψ(y) − λ⊤ t (B⊤xt+1 − y) + ρ 2 ∥B⊤xt+1 − y∥2 } λt+1 =λt − ρ(B⊤xt+1 − yt+1 ). yt+1 ͱ λt+1 ͷߋ৽͸௨ৗͷ ADMM ͱಉ͡ɽ Gt ͸೚ҙͷਖ਼ఆ஋ରশߦྻɽ 16 / 42

Slide 17

Slide 17 text

SGD-ADMM minx EZ [ℓ(x, Z)] + ψ(B⊤x) ⇒ ֦ுϥάϥϯδΞϯ: EZ [ℓ(x, Z)] + ψ(y) + λ⊤(y − B⊤x) + ρ 2 ∥y − B⊤x∥2. ௨ৗͷ SGD: xt+1 = arg minx { ⟨gt , x⟩ + ˜ ψ(x) + 1 2ηt ∥x − xt ∥2 } (gt ∈ ∂x ℓ(xt , zt )). SGD-ADMM xt+1 =argmin x∈X { g⊤ t x − λt ⊤(B⊤x − yt ) + ρ 2 ∥B⊤x − yt∥2 + 1 2ηt ∥x − xt∥2 Gt } , yt+1 = argmin y∈Y { ψ(y) − λ⊤ t (B⊤xt+1 − y) + ρ 2 ∥B⊤xt+1 − y∥2 } =prox(B⊤xt+1 − λt/ρ|ψ), λt+1 =λt − ρ(B⊤xt+1 − yt+1 ). yt+1 ͱ λt+1 ͷߋ৽͸௨ৗͷ ADMM ͱಉ͡ɽ Gt ͸೚ҙͷਖ਼ఆ஋ରশߦྻɽ 16 / 42

Slide 18

Slide 18 text

RDA-ADMM ௨ৗͷ RDA: wt+1 = arg minw { ⟨¯ gt , w⟩ + ˜ ψ(w) + 1 2ηt ∥w∥2 } (¯ gt = 1 t (g1 + · · · + gt )) RDA-ADMM Let ¯ xt = 1 t ∑ t τ=1 xτ , ¯ λt = 1 t ∑ t τ=1 λτ , ¯ yt = 1 t ∑ t τ=1 yτ , ¯ gt = 1 t ∑ t τ=1 gτ . xt+1 =argmin x∈X { ¯ g⊤ t x − (B¯ λt )⊤x + ρ 2t ∥B⊤x∥2 +ρ(B⊤¯ xt − ¯ yt )⊤B⊤x + 1 2ηt ∥x∥2 Gt } , yt+1 =prox(B⊤xt+1 − λt/ρ|ψ), λt+1 =λt − ρ(B⊤xt+1 − yt+1 ). yt+1 ͱ λt+1 ͷߋ৽͸௨ৗͷ ADMM ͱಉ͡ɽ 17 / 42

Slide 19

Slide 19 text

Convergence analysis We bound the expected risk: Expected risk P(x) = EZ [ℓ(Z, x)] + ˜ ψ(x). Assumptions: (A1) ∃G s.t. ∀g ∈ ∂x ℓ(z, x) satisfies ∥g∥ ≤ G for all z, x. (A2) ∃L s.t. ∀g ∈ ∂ψ(y) satisfies ∥g∥ ≤ L for all y. (A3) ∃R s.t. ∀x ∈ X satisfies ∥x∥ ≤ R. 18 / 42

Slide 20

Slide 20 text

Convergence rate: bounded gradient (A1) ∃G s.t. ∀g ∈ ∂x ℓ(z, x) satisfies ∥g∥ ≤ G for all z, x. (A2) ∃L s.t. ∀g ∈ ∂ψ(y) satisfies ∥g∥ ≤ L for all y. (A3) ∃R s.t. ∀x ∈ X satisfies ∥x∥ ≤ R. Theorem (Convergence rate of RDA-ADMM) Under (A1), (A2), (A3), we have Ez1:T−1 [P(¯ xT ) − P(x∗)] ≤ 1 T T ∑ t=2 ηt−1 2(t − 1) G2 + γ ηT ∥x∗∥2 + K T . Theorem (Convergence rate of SGD-ADMM) Under (A1), (A2), (A3), we have Ez1:T−1 [P(¯ xT ) − P(x∗)] ≤ 1 2T ∑ T t=2 max { γ ηt − γ ηt−1 , 0 } R2 + 1 T ∑ T t=1 ηt 2 G2 + K T . Both methods have convergence rate O ( 1 √ T ) by letting ηt = η0 √ t for RDA-ADMM and ηt = η0/ √ t for SGD-ADMM. 19 / 42

Slide 21

Slide 21 text

༗ݶ࿨ͷ໰୊ ਖ਼ଇԽ͋Γ܇࿅ޡࠩͷ૒ର໰୊ A = [a1, a2, . . . , an ] ∈ Rp×n. min w {1 n n ∑ i=1 fi (a⊤ i w) + ψ(B⊤w) } (P: ओ) = − min x∈Rn,y∈Rd {1 n n ∑ i=1 f ∗ i (xi ) + ψ∗ ( y n ) Ax + By = 0 } (D: ૒ର) ࠷దੑ৚݅: a⊤ i w∗ ∈ ∇f ∗ i (x∗ i ), 1 n y∗ ∈ ∇ψ(u)|u=B⊤w∗ , Ax∗ + By∗ = 0. ⋆ ֤࠲ඪ xi ͸֤؍ଌ஋ ai ʹରԠ. 20 / 42

Slide 22

Slide 22 text

SDCA-ADMM ֦ுϥάϥϯδΞϯ: L(x, y, w) := ∑ n i=1 f ∗ i (xi ) + nψ∗(y/n) − ⟨w, Ax + By⟩ + ρ 2 ∥Ax + By∥2. SDCA-ADMM For each t = 1, 2, . . . Choose i ∈ {1, . . . , n} uniformly at random, and update y(t) ← arg min y { L(x(t−1), y, w(t−1)) + 1 2 ∥y − y(t−1)∥2 Q } x(t) i ← arg min xi ∈R { L([xi ; x(t−1) \i ], y(t), w(t−1)) + 1 2 ∥xi − x(t−1) i ∥2 Gi,i } w(t) ← w(t−1) − ξρ{n(Ax(t) + By(t))−(n − 1)(Ax(t−1) + By(t−1))}. Q, Gi,i ͸͋Δ৚݅Λຬͨ͢ਖ਼ఆ஋ରশߦྻɽ ֤ߋ৽Ͱ i-൪໨ͷ࠲ඪ xi ͷΈߋ৽ɽ w ͷߋ৽͸ؾΛ෇͚Δඞཁ͕͋Δɽ 21 / 42

Slide 23

Slide 23 text

Outline 1 ֬཰త࠷దԽͷΑΓߴ౓ͳ࿩୊ ඇತؔ਺ͷ֬཰త࠷దԽ ߏ଄తਖ਼ଇԽͷ࠷దԽ 2 ແݶ࣍ݩͷ֬཰త࠷దԽɿΧʔωϧ๏ ࠶ੜ֩ώϧϕϧτۭؒͷఆٛ ࠶ੜ֩ώϧϕϧτۭؒʹ͓͚Δ࠷దԽ 22 / 42

Slide 24

Slide 24 text

࠶ੜ֩ώϧϕϧτ্ۭؒͰͷ࠷దԽ (ޙͷ Neural Tangent Kernel ͱ΋ͭͳ͕ΔͷͰ঺հ) 23 / 42

Slide 25

Slide 25 text

ઢܗճؼ σβΠϯߦྻ X = (Xij ) ∈ Rn×p. Y = [y1, . . . , yn ]⊤ ∈ Rn. ਅͷϕΫτϧ β∗ ∈ Rp: Ϟσϧ : Y = Xβ∗ + ξ. ϦοδճؼʢTsykonov ਖ਼ଇԽʣ ˆ β ← arg min β∈Rp 1 n ∥Xβ − Y ∥2 2 +λn∥β∥2 2 . 24 / 42

Slide 26

Slide 26 text

ઢܗճؼ σβΠϯߦྻ X = (Xij ) ∈ Rn×p. Y = [y1, . . . , yn ]⊤ ∈ Rn. ਅͷϕΫτϧ β∗ ∈ Rp: Ϟσϧ : Y = Xβ∗ + ξ. ϦοδճؼʢTsykonov ਖ਼ଇԽʣ ˆ β ← arg min β∈Rp 1 n ∥Xβ − Y ∥2 2 +λn∥β∥2 2 . ม਺ม׵: ɹ ਖ਼ଇԽ߲ͷͨΊɼˆ β ∈ Ker(X)⊥ɽͭ·Γɼˆ β ∈ Im(X⊤)ɽ ͋Δ ˆ α ∈ Rn ͕ଘࡏͯ͠ɼˆ β = X⊤ ˆ α ͱॻ͚Δɽ (౳Ձͳ໰୊) ˆ α ← arg min α∈Rn 1 n ∥XX⊤α − Y ∥2 2 + λnα⊤(XX⊤)α. ˞ (XX⊤)ij = x⊤ i xj ΑΓɼ؍ଌ஋ xi ͱ xj ͷ಺ੵ͑͞ܭࢉͰ͖Ε͹Α͍ɽ 24 / 42

Slide 27

Slide 27 text

ϦοδճؼͷΧʔωϧԽ Ϧοδճؼʢม਺ม׵൛ʣ ˆ α ← arg min α∈Rn 1 n ∥(XX⊤)α − Y ∥2 2 + λnα⊤(XX⊤)α. ˞ (XX⊤)ij = x⊤ i xj ͸αϯϓϧ xi ͱ xj ͷ಺ੵɽ 25 / 42

Slide 28

Slide 28 text

ϦοδճؼͷΧʔωϧԽ Ϧοδճؼʢม਺ม׵൛ʣ ˆ α ← arg min α∈Rn 1 n ∥(XX⊤)α − Y ∥2 2 + λnα⊤(XX⊤)α. ˞ (XX⊤)ij = x⊤ i xj ͸αϯϓϧ xi ͱ xj ͷ಺ੵɽ • Χʔωϧ๏ͷΞΠσΟΞ x ͷؒͷ಺ੵΛଞͷඇઢܗͳؔ਺Ͱஔ͖׵͑Δ: x⊤ i xj → k(xi , xj ). ͜ͷ k : Rp × Rp → R ΛΧʔωϧؔ਺ͱݺͿ. Χʔωϧؔ਺ͷຬͨ͢΂͖৚݅ ରশੑ: k(x, x′) = k(x′, x). ਖ਼஋ੑ: ∑ m i=1 ∑ m j=1 αi αj k(xi , xj ) ≥ 0, (∀{xi }m i=1 , {αi }m i=1 , m). ٯʹ͜ͷੑ࣭Λຬͨؔ͢਺ͳΒԿͰ΋Χʔωϧ๏Ͱ༻͍ͯྑ͍ɽ 25 / 42

Slide 29

Slide 29 text

ΧʔωϧϦοδճؼ ΧʔωϧϦοδճؼ: K = (k(xi , xj ))n i,j=1 ͱͯ͠ɼ ˆ α ← arg min β∈Rn 1 n ∥Kα − Y ∥2 2 + λnα⊤Kα. ৽͍͠ೖྗ x ʹରͯ͠͸ɼ y = n ∑ i=1 k(x, xi )ˆ αi Ͱ༧ଌɽ 26 / 42

Slide 30

Slide 30 text

ΧʔωϧϦοδճؼ ΧʔωϧϦοδճؼ: K = (k(xi , xj ))n i,j=1 ͱͯ͠ɼ ˆ α ← arg min β∈Rn 1 n ∥Kα − Y ∥2 2 + λnα⊤Kα. ৽͍͠ೖྗ x ʹରͯ͠͸ɼ y = n ∑ i=1 k(x, xi )ˆ αi Ͱ༧ଌɽ Χʔωϧؔ਺ ⇔ ࠶ੜ֩ώϧϕϧτۭؒ (RKHS) k(x, x′) Hk ͋Δ ϕ(x) : Rp → Hk ͕ଘࡏͯ͠ɼ k(x, x′) = ⟨ϕ(x), ϕ(x′)⟩Hk ɽ ΧʔωϧτϦοΫ: ⟨ ∑ n i=1 αi ϕ(xi ), ϕ(x)⟩Hk = ∑ n i=1 αi k(xi , x). ˠΧʔωϧؔ਺ͷ஋͑͞ܭࢉͰ͖Ε͹ྑ͍ɽ 26 / 42

Slide 31

Slide 31 text

࠶ੜ֩ώϧϕϧτۭؒ (Reproducing Kernel Hilbert Space, RKHS) ೖྗσʔλͷ෼෍ɿPX ɼରԠ͢Δ L2 ۭؒɿL2 (PX ) = {f | EX∼PX [f (X)2] < ∞}. Χʔωϧؔ਺͸ҎԼͷΑ͏ʹ෼ղͰ͖Δ (Steinwart and Scovel, 2012): k(x, x′) = ∞ ∑ j=1 µj ej (x)ej (x′). (ej )∞ j=1 ͸ L2 (PX ) ಺ͷਖ਼ن௚ަجఈ: ∥ej ∥L2(PX ) = 1, ⟨ej , ej′ ⟩L2(PX ) = 0 (j ̸= j′). µj ≥ 0. Definition (࠶ੜ֩ώϧϕϧτۭؒ (Hk )) ⟨f , g⟩Hk := ∑ ∞ j=1 1 µj αj βj for f = ∑ ∞ j=1 αj ej , g = ∑ ∞ j=1 βj ej ∈ L2 (PX ). ∥f ∥Hk := √ ⟨f , f ⟩Hk . Hk := {f ∈ L2 (PX ) | ∥f ∥Hk < ∞} equipped with ⟨·, ·⟩Hk . ࠶ੜੑ: f ∈ Hk ʹରͯ͠ f (x) ͸಺ੵͷܗͰʮ࠶ੜʯ͞ΕΔ: f (x) = ⟨f , k(x, ·)⟩Hk . 27 / 42

Slide 32

Slide 32 text

࠶ੜ֩ώϧϕϧτۭؒͷੑ࣭ ϕk (x) = k(x, ·) ∈ Hk ͱॻ͚͹ɼk(x, x′) = ⟨ϕk (x), ϕk (x′)⟩Hk ͱॻ͚Δɽ͜ͷ ϕk Λಛ௃ࣸ૾ͱ΋ݴ͏ɽ Χʔωϧؔ਺ʹରԠ͢Δੵ෼࡞༻ૉ Tk : L2 (PX ) → L2 (PX ): Tk f := ∫ f (x)k(x, ·)dPX (x). ઌͷΧʔωϧؔ਺ͷ෼ղ͸ Tk ͷεϖΫτϧ෼ղʹରԠɽ ࠶ੜ֩ώϧϕϧτۭؒ Hk ͸ҎԼͷΑ͏ʹ΋ॻ͚Δ: Hk = T1/2 k L2 (PX ). ∥f ∥Hk = inf{∥h∥L2(PX ) | f = T1/2 k h, h ∈ L2 (PX )}. f ∈ Hk ͸ f (x) = ∑ ∞ j=1 aj √ µj ej (x) ͱॻ͚ͯɼ∥f ∥Hk = √∑ ∞ j=1 a2 j ɽ (ej )j ͸ L2 ಺ͷਖ਼ن௚ަجఈɼ( √ µj ej )j ͸ RKHS ಺ͷ׬શਖ਼ن௚ަجఈɽ ಛ௃ࣸ૾ ϕk (x) = k(x, ·) ∈ Hk Λ׬શਖ਼ن௚ަجఈʹؔ͢Δ܎਺Ͱදݱ͢Δͱ ϕk (x) = ( √ µ1 e1 (x), √ µ2 e2 (x), . . . )⊤ 28 / 42

Slide 33

Slide 33 text

࠶ੜ֩ώϧϕϧτۭؒͷΠϝʔδ ඇઢܗͳਪ࿦Λ࠶ੜ֩ώϧϕϧτۭؒ΁ͷඇઢܗࣸ૾ ϕ Λ༻͍ͯߦ͏ɽ ࠶ੜ֩ώϧϕϧτۭؒͰ͸ઢܗͳॲཧΛ͢Δɽ Reproducing Kernel Hilbert Space Χʔωϧ๏͸ୈҰ૚Λݻఆ͠ୈೋ૚໨ ͷύϥϝʔλΛֶश͢Δԣ෯ແݶେͷ 2 ૚χϡʔϥϧωοτϫʔΫͱ΋Έͳ ͤΔɽ ( ʠ ઙ͍ ʡ ֶशख๏ͷ୅දྫ) 29 / 42

Slide 34

Slide 34 text

ΧʔωϧϦοδճؼͷ࠶ఆࣜԽ ࠶ੜੑ: f ∈ Hk ʹର͠ f (x) = ⟨f , ϕ(x)⟩Hk . ΧʔωϧϦοδճؼͷ࠶ఆࣜԽ ˆ f ← min f ∈Hk 1 n n ∑ i=1 (yi − f (xi ))2 + C∥f ∥2 Hk දݱఆཧ ∃αi ∈ R s.t. ˆ f (x) = n ∑ i=1 αi k(xi , x), ⇒ ∥ˆ f ∥Hk = √∑ n i,j=1 αi αj k(xi , xj ) = √ α⊤Kα. ͖͞΄ͲͷΧʔωϧϦοδճؼͷఆࣜԽͱҰகɽ 30 / 42

Slide 35

Slide 35 text

Χʔωϧͷྫ Ψ΢γΞϯΧʔωϧ k(x, x′) = exp ( − ∥x − x′∥2 2σ2 ) ଟ߲ࣜΧʔωϧ k(x, x′) = ( 1 + x⊤x′ )p χ2-Χʔωϧ k(x, x′) = exp ( − γ2 ∑ d j=1 (xj −x′ j )2 (xj +x′ j ) ) Mat´ ern-kernel k(x, x′) = ∫ Rd eiλ⊤(x−x′) 1 (1 + ∥λ∥2)α+d/2 dλ άϥϑΧʔωϧɼ࣌ܥྻΧʔωϧɼ... 31 / 42

Slide 36

Slide 36 text

࠶ੜ֩ώϧϕϧτۭؒ಺ͷ֬཰త࠷దԽ ໰୊ઃఆ: yi = f o(xi ) + ξi . (xi , yi )n i=1 ͔Β f o Λਪఆ͍ͨ͠ɽ(f o ͸ Hk ʹ΄΅ೖ͍ͬͯΔ) ظ଴ଛࣦͷมܗ: E[(f (X) − Y )2] = E[(f (X) − f o(X) − ξ)2] = E[(f (X) − f o(X))2] + σ2 ˠ minf ∈Hk E[(f (X) − Y )2] Λղ͚͹ f o ͕ٻ·Δɽ Kx = k(x, ·) ∈ Hk ͱ͢Δͱɼf (x) = ⟨f , Kx ⟩Hk ΑΓ L(f ) = E[(f (X) − Y )2] ͷ RKHS ಺Ͱͷ Frechet ඍ෼͸ҎԼͷ௨Γ: ∇L(f ) = 2E[KX (⟨KX , f ⟩Hk − Y )] = 2(E[KX K∗ X ] =:Σ f − E[KX Y ]) = 2(Σf − E[KX Y ]). ظ଴ଛࣦͷޯ഑๏: f ∗ t = f ∗ t−1 − η2(Σf ∗ t−1 − E[KX Y ]). ܦݧଛࣦͷޯ഑๏ (E[·] ͸ඪຊฏۉ): ˆ ft = ˆ ft−1 − η2(Σˆ ft−1 − E[KX Y ]). ֬཰తޯ഑ʹΑΔߋ৽: gt = gt−1 − η2(Kxit K∗ xit gt−1 − Kxit yit ). ˞ (xit , yit )∞ t=1 ͸ (xi , yi )n i=1 ͔Β i.i.d. Ұ༷ʹऔಘɽ 32 / 42

Slide 37

Slide 37 text

ޯ഑ͷεϜʔδϯάͱͯ͠ͷݟํ ؔ਺஋ͷߋ৽ࣜ: f ∗ t (x) = f ∗ t−1 (x) − 2η ∫ k(x, X) (f ∗ t−1 (X) − Y ) →f ∗ t−1 (X)−f o(X) dP(X, Y ) = f ∗ t−1 (x) − 2ηTk (f ∗ t−1 − f o)(x). ੵ෼࡞༻ૉ Tk ͸ߴप೾੒෼Λ཈੍͢Δ࡞༻͕͋Δɽ RKHS ಺ͷޯ഑͸ L2 ಺ͷؔ਺ޯ഑ΛTk ʹΑͬͯฏ׈Խͨ͠΋ͷʹͳ͍ͬͯ Δɽ(࣮ࡍ͸ Tk ͷαϯϓϧ͔Βͷਪఆ஋Λ࢖͏) ߴप೾੒෼͕ग़ͯ͘ΔલʹࢭΊΕ͹աֶशΛ๷͛Δɽ ˠ Early stopping ӌᮣʹ Newton ๏ͳͲΛ࢖͏ͱةݥɽ 33 / 42

Slide 38

Slide 38 text

Early stopping ʹΑΔਖ਼ଇԽ Early stopping ʹΑΔਖ਼ଇԽ ॳظ஋ ܇࿅ޡࠩ࠷খԽݩ ʢաֶशʣ &BSMZTUPQQJOH όΠΞε-όϦΞϯε෼ղ ∥f o − ˆ f ∥L2(PX ) Estimation error ≤ ∥f o − ˇ f ∥L2(PX ) Approximation error (bias) + ∥ˇ f − ˆ f ∥L2(PX ) Sample deviation (variance) ܇࿅ޡࠩ࠷খԽݩʹୡ͢ΔલʹࢭΊΔ (early stopping) ͜ͱͰਖ਼ଇԽ͕ಇ͘ɽ ແݶ࣍ݩϞσϧ (RKHS) ͸աֶश͠΍͍͢ͷͰؾΛ෇͚Δඞཁ͕͋Δɽ 34 / 42

Slide 39

Slide 39 text

ղੳʹ༻͍Δ৚݅ ௨ৗɼҎԼͷ৚݅Λߟ͑Δɽ ʢ౷ܭཧ࿦Ͱ΋ಉ༷ͷԾఆΛ՝͢ఆ൪ͷԾఆʣ (Caponnetto and de Vito, 2007, Dieuleveut et al., 2016, Pillaud-Vivien et al., 2018) µi = O(i−α) for α > 1. α ͸ RKHSHk ͷෳࡶ͞Λಛ௃͚ͮΔɽ(খ͍͞ α: ෳࡶɼେ͖͍ α: ୯७) f o ∈ Tr (L2 (PX )) for r > 0. f o ͕ RKHS ͔ΒͲΕ͚ͩ “͸Έग़͍ͯΔ͔” Λಛ௃͚ͮɽ r = 1/2 ͸ f o ∈ Hk ʹରԠɽ(r < 1/2: ͸Έग़ͯΔ, r ≥ 1/2: ؚ·ΕΔ) ∥f ∥L∞(PX ) ≲ ∥f ∥1−µ L2(PX ) ∥f ∥µ Hk (∀f ∈ Hk ) for µ ∈ (0, 1]. Hk ʹؚ·Ε͍ͯΔؔ਺ͷ׈Β͔͞Λಛ௃͚ͮɽ ʢখ͍͞ µ: ׈Β͔ʣ ˞ ࠷ޙͷ৚݅ʹ͍ͭͯ: f ∈ W m([0, 1]d ) (Sobolev ۭؒ) ͔ͭ PX ͷ୆͕ [0, 1]d Ͱີ౓ؔ਺Λ࣋ͪɼͦͷີ౓͕Լ͔Β͋Δఆ਺ c > 0 Ͱ཈͑ΒΕ͍ͯΕ͹ɼ µ = d/(2m) ͰͳΓͨͭɽ 35 / 42

Slide 40

Slide 40 text

ऩଋϨʔτ όΠΞε-όϦΞϯεͷ෼ղ: ∥f o − gt∥2 L2(PX ) ≲ ∥f o − f ∗ t ∥2 L2(PX ) (a): Bias + ∥f ∗ t − ˆ ft∥2 L2(PX ) (b): Variance + ∥ˆ ft − gt∥2 L2(PX ) (c): SGD deviation (a) (ηt)−2r , (b) (ηt)1/α+(ηt)µ−2r n , (c) η(ηt)1/α−1 (a) ޯ഑๏ͷղͷσʔλʹؔ͢Δظ଴஋ͱਅͷؔ਺ͱͷζϨ (Bias)ɽ (b) ޯ഑๏ͷղͷ෼ࢄ (Variance)ɽ (c) ֬཰తޯ഑Λ༻͍Δ͜ͱʹΑΔมಈ. ߋ৽਺ t Λେ͖͘͢Δͱ Bias ͸ݮΔ͕ Variance ͕૿͑Δɽ͜ΕΒΛόϥϯε͢ Δඞཁ͕͋Δ (Early stopping)ɽ Theorem (Multi-pass SGD ͷऩଋϨʔτ (Pillaud-Vivien et al., 2018)) η = 1/(4 supx k(x, x)2) ͱ͢Δɽ µα < 2rα + 1 < α ͷ࣌ɼt = Θ(nα/(2rα+1)) ͱ͢Ε͹ɼ E[L(gt )] − L(f o) = O(n−2rα/(2rα+1)). µα ≥ 2rα + 1 ͷ࣌ɼt = Θ(n 1 µ (log n) 1 µ ) ͱ͢Ε͹ɼE[L(gt )] − L(f o) = O(n−2r/µ). 36 / 42

Slide 41

Slide 41 text

Natural gradient ͷऩଋ Natural gradient (ࣗવޯ഑๏): ˆ ft = ˆ ft−1 − η(Σ + λI)−1(Σˆ ft−1 − E[KX Y ]). (unlabeled data ͕୔ࢁ͋Γ Σ ͸ྑ͘ਪఆͰ͖Δઃఆ; GD ͷղੳ (Murata and Suzuki, 2020)) Theorem (Natural gradient ͷऩଋ (Amari et al., 2020)) E[∥ˆ ft − f o∥2 L2(PX ) ] ≲ B(t) + V (t), ͨͩ͠ɼB(t) = exp(−ηt) ∨ (λ/(ηt))2r , V (t) = (1 + ηt) λ−1B(t) + λ− 1 α n + (1 + tη)4 (1 ∨ λ2r−µ)λ− 1 α n . ಛʹɼλ = n− α 2rα+1 , t = Θ(log(n)) Ͱ E[∥ˆ ft − f o∥2 L2(PX ) ] = O(n− 2rα 2rα+1 log(n)4). ˞ όΠΞε͸ٸ଎ʹऩଋ͢Δ͕ɼόϦΞϯε΋଎͘૿େ͢Δɽ ˠ Preconditioning ͷͨΊߴप೾੒෼͕ૣΊʹग़ݱ͢ΔɽΑΓૣΊʹࢭΊ ͳ͍ͱաֶश͢Δɽ 37 / 42

Slide 42

Slide 42 text

ऩଋͷ༷ࢠ /BUVSBMHSBEJFOU (SBEJFOUEFTDFOU 1SFEJDUJWFFSSPS 7BSJBODF #JBT 4UFQ 38 / 42

Slide 43

Slide 43 text

࡞༻ૉ Bernstein ͷෆ౳ࣜ Σ = Ex [Kx K∗ x ]: Σf = ∫ k(·, x)f (x)dPx (x) Σ = 1 n ∑ n i=1 Kxi K∗ xi : Σf = 1 n ∑ n i=1 k(·, xi )f (xi ) Σλ := Σ + λIɼF∞ (λ) := supx K∗ x Σ−1 λ Kx ͱ͢ΔɽҎԼͷΑ͏ͳධՁ͕ඞཁ: ∥Σ−1 λ (Σ − Σ)Σ−1 λ ∥ ≲ √ F∞ (λ)β n + (1 + F∞ (λ))β n with prob. 1 − δɽͨͩ͠ɼβ = log(4Tr[ΣΣ−1 λ ] δ )ɽ ˠ ܦݧ෼෍ͱਅͷ෼෍ͷͣΕΛό΢ϯυɽ Theorem (ࣗݾڞ໾࡞༻ૉͷ Bernstein ͷෆ౳ࣜ (Minsker, 2017)) (Xi )n i=1 ͸ಠཱͳࣗݾڞ໾࡞༻ૉͷ֬཰ม਺Ͱ E[Xi ] = 0 ͔ͭɼ σ2 ≥ ∥ ∑ n i=1 E[X2 i ]∥, U ≥ ∥Xi ∥ ͱ͢Δɽr(A) = Tr[A]/∥A∥ ͱͯ͠ɼ P ( n ∑ i=1 Xi ≥ t ) ≤ 14r( ∑ n i=1 E[X2 i ]) exp ( − t2 2(σ2 + tU/3) ) . Xi = Σ−1 λ Kxi K∗ xi Σ−1 λ ͱ͢Δɽ (Tropp (2012) ΋ࢀর) 39 / 42

Slide 44

Slide 44 text

ਖ਼ଇԽ͋Γͷ֬཰త࠷దԽ ೋ৐ଛࣦΛ֦ுͯ͠ɼҰൠͷ׈Β͔ͳತଛࣦؔ਺ ℓ Λߟ͑Δɽ ʢ൑ผ໰୊ͳͲʣ ਖ਼ଇԽ͋Γͷظ଴ଛࣦ࠷খԽ: min f ∈Hk E[ℓ(Y , f (X))] + λ∥f ∥2 Hk =: Lλ (f ). ͜ΕΛ SGD Ͱղ͘ɽ໨తؔ਺͕ λ-ڧತͰ͋Δ͜ͱΛར༻ɽ gt+1 = gt − ηt (ℓ′(yt, gt (xt )) + λgt ). ¯ gT+1 = T+1 ∑ t=1 2(c0+t−1) (2c0+T)(T+1) gt (ଟ߲ࣜฏۉ). Ծఆɿ(i) ℓ ͸ γ-ฏ׈ɼ∥ℓ′∥∞ ≤ M, (ii) k(x, x) ≤ 1. gλ = argming∈Hk Lλ (g). Theorem (Nitanda and Suzuki (2019)) ద੾ͳ c0 > 0 ʹରͯ͠ ηt = 2/(λ(c0 + t)) ͱ͢Ε͹ɼ E[Lλ (¯ gT+1 ) − Lλ (gλ )] ≲ M2 λ(c0 + T) + γ + λ T + 1 ∥g1 − gλ ∥2 Hk . ͞ΒʹϚϧνϯήʔϧ֬཰ूதෆ౳ࣜΑΓ High probability bound ΋ಘΒΕΔɽ ൑ผ໰୊ͳΒ strong low noise condition ͷ΋ͱ൑ผޡࠩͷࢦ਺ऩଋ΋ࣔͤΔɽ40 / 42

Slide 45

Slide 45 text

Ϛϧνϯήʔϧ Hoeffding ͷෆ౳ࣜ Theorem (Ϛϧνϯήʔϧ Hoeffding ܕूதෆ౳ࣜ (Pinelis, 1994)) ֬཰ม਺ྻ: D1, . . . , DT ∈ Hk ɽE[Dt ] = 0ɼ∥Dt∥Hk ≤ Rt (a.s.) ͱ͢Δɽ ∀ϵ > 0 ʹର͠ P [ max 1≤t≤T ∥ t ∑ s=1 Ds∥Hk ≥ ϵ ] ≤ 2 exp ( − ϵ2 2 ∑ T t=1 R2 t ) . Dt = E[¯ gT+1|Z1, . . . , Zt ] − E[¯ gT+1|Z1, . . . , Zt−1 ], ͨͩ͠ Zt = (xt, yt ) ͱ͢Ε͹ɼ ∑ T t=1 Dt = ¯ gT+1 − E[¯ gT+1 ] ͱͳΓɼظ଴஋ͱ࣮ ݱ஋ͷͣΕΛ཈͑ΒΕΔɽ 41 / 42

Slide 46

Slide 46 text

Ϛϧνϯήʔϧ Hoeffding ͷෆ౳ࣜ Theorem (Ϛϧνϯήʔϧ Hoeffding ܕूதෆ౳ࣜ (Pinelis, 1994)) ֬཰ม਺ྻ: D1, . . . , DT ∈ Hk ɽE[Dt ] = 0ɼ∥Dt∥Hk ≤ Rt (a.s.) ͱ͢Δɽ ∀ϵ > 0 ʹର͠ P [ max 1≤t≤T ∥ t ∑ s=1 Ds∥Hk ≥ ϵ ] ≤ 2 exp ( − ϵ2 2 ∑ T t=1 R2 t ) . Dt = E[¯ gT+1|Z1, . . . , Zt ] − E[¯ gT+1|Z1, . . . , Zt−1 ], ͨͩ͠ Zt = (xt, yt ) ͱ͢Ε͹ɼ ∑ T t=1 Dt = ¯ gT+1 − E[¯ gT+1 ] ͱͳΓɼظ଴஋ͱ࣮ ݱ஋ͷͣΕΛ཈͑ΒΕΔɽ (ิ଍) Lλ ͸ RKHS ϊϧϜʹؔͯ͠ λ-ڧತͰ͋Δ͜ͱΑΓɼ ∥¯ gT+1 − gλ ∥Hk ≤ O( 1 λ2T ) ͕ߴ͍֬཰Ͱ੒Γཱͭɽ࣮͸ ∥ · ∥∞ ≤ ∥ · ∥Hk Ͱ΋͋ΔͷͰɼ |P(Y = 1|X) − P(Y = −1|X)| ≥ δ ͳΔϚʔδϯ৚݅ (strong low noise condition) ͷ΋ͱɼ׬શͳ൑ผ͕ߴ͍֬཰ͰͰ͖ΔΑ͏ʹͳΔɽ 41 / 42

Slide 47

Slide 47 text

(ࢀߟ) Strong low noise condition 42 / 42

Slide 48

Slide 48 text

Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. arXiv preprint arXiv:1603.05643, 2016. S. Amari, J. Ba, R. Grosse, X. Li, A. Nitanda, T. Suzuki, D. Wu, and J. Xu. When does preconditioning help or hurt generalization?, 2020. A. Caponnetto and E. de Vito. Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007. A. R. Conn, N. I. Gould, and P. L. Toint. Trust region methods, volume 1. Siam, 2000. W. Deng and W. Yin. On the global and linear convergence of the generalized alternating direction method of multipliers. Technical report, Rice University CAAM TR12-14, 2012. A. Dieuleveut, F. Bach, et al. Nonparametric stochastic approximation with large step-sizes. The Annals of Statistics, 44(4):1363–1399, 2016. D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite-element approximations. Computers & Mathematics with Applications, 2:17–40, 1976. R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015. 42 / 42

Slide 49

Slide 49 text

S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013. B. He and X. Yuan. On the O(1/n) convergence rate of the Douglas-Rachford alternating direction method. SIAM J. Numerical Analisis, 50(2):700–709, 2012. M. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory & Applications, 4:303–320, 1969. M. Hong and Z.-Q. Luo. On the linear convergence of the alternating direction method of multipliers. arXiv preprint arXiv:1208.3922, 2012a. M. Hong and Z.-Q. Luo. On the linear convergence of the alternating direction method of multipliers. Technical report, 2012b. arXiv:1208.3922. C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points efficiently. In International Conference on Machine Learning, pages 1724–1732, 2017a. C. Jin, P. Netrapalli, and M. I. Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456, 2017b. Z. Li. Ssrgd: Simple stochastic recursive gradient descent for escaping saddle points. In Advances in Neural Information Processing Systems, pages 1523–1533, 2019. 42 / 42

Slide 50

Slide 50 text

Z. Li and J. Li. A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5564–5574. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/ 7800-a-simple-proximal-stochastic-gradient-method-for-nonsmooth-n pdf. Y. Liu, F. Shang, and J. Cheng. Accelerated variance reduced stochastic admm. In Thirty-First AAAI Conference on Artificial Intelligence, pages 2287–2293, 2017. S. Minsker. On some extensions of Bernstein’s inequality for self-adjoint operators. Statistics & Probability Letters, 127:111–119, 2017. J. F. Mota, J. M. Xavier, P. M. Aguiar, and M. P¨ uschel. A proof of convergence for the alternating direction method of multipliers applied to polyhedral-constrained functions. arXiv preprint arXiv:1112.2295, 2011. T. Murata and T. Suzuki. Gradient descent in rkhs with importance labeling, 2020. L. M. Nguyen, J. Liu, K. Scheinberg, and M. Tak´ aˇ c. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2613–2621, International Convention Centre, Sydney, Australia, 06–11 42 / 42

Slide 51

Slide 51 text

Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/nguyen17b.html. A. Nitanda and T. Suzuki. Stochastic gradient descent with exponential convergence rates of expected classification errors. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 1417–1426. PMLR, 16–18 Apr 2019. URL http://proceedings.mlr.press/v89/nitanda19a.html. H. Ouyang, N. He, L. Q. Tran, and A. Gray. Stochastic alternating direction method of multipliers. In Proceedings of the 30th International Conference on Machine Learning, 2013. N. H. Pham, L. M. Nguyen, D. T. Phan, and Q. Tran-Dinh. Proxsarah: An efficient algorithmic framework for stochastic composite nonconvex optimization. Journal of Machine Learning Research, 21(110):1–48, 2020. L. Pillaud-Vivien, A. Rudi, and F. Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. In Advances in Neural Information Processing Systems, pages 8114–8124, 2018. I. Pinelis. Optimum bounds for the distributions of martingales in banach spaces. The Annals of Probability, pages 1679–1706, 1994. M. Powell. A method for nonlinear constraints in minimization problems. In R. Fletcher, editor, Optimization, pages 283–298. Academic Press, London, New York, 1969. 42 / 42

Slide 52

Slide 52 text

S. J. Reddi, A. Hefny, S. Sra, B. P´ ocz´ os, and A. Smola. Stochastic variance reduction for nonconvex optimization. arXiv preprint arXiv:1603.06160, 2016. R. T. Rockafellar. Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of Operations Research, 1: 97–116, 1976. I. Steinwart and C. Scovel. Mercer ʟ s theorem on general domains: on the interaction between measures, kernels, and RKHSs. Constructive Approximation, 35(3):363–417, 2012. J. Sun, Q. Qu, and J. Wright. When are nonconvex problems not scary? arXiv preprint arXiv:1510.06096, 2015. T. Suzuki. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In Proceedings of the 30th International Conference on Machine Learning, pages 392–400, 2013. T. Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers. In Proceedings of the 31th International Conference on Machine Learning, pages 736–744, 2014. J. A. Tropp. User-friendly tools for random matrices: An introduction. Technical report, 2012. S. Zheng and J. T. Kwok. Fast-and-light stochastic ADMM. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 2407–2413, 2016. 42 / 42