Slide 1

Slide 1 text

ػցֶशʹ͓͚Δ࠷దԽཧ࿦ͱֶशཧ࿦తଆ໘ ୈҰ෦ɿۙ઀ޯ഑๏ͱ֬཰తޯ഑߱Լ๏ ླ໦େ࣊ ౦ژେֶେֶӃ৘ใཧ޻ֶܥݚڀՊ਺ཧ৘ใֶઐ߈ ཧݚ AIP 2020 ೥ 8 ݄ 6 ೔ @૊߹ͤ࠷దԽηϛφʔ 2020 (COSS2020) 1 / 167

Slide 2

Slide 2 text

ຊηϛφʔͷΞ΢τϥΠϯ 1 ୈҰ෦ɿۙ઀ޯ഑๏ͱ֬཰త࠷దԽ (ತɼ༗ݶ࣍ݩ) 2 ୈೋ෦ɿඇತ࠷దԽͱ࠶ੜ֩ώϧϕϧτۭؒʹ͓͚Δ࠷దԽ 3 ୈࡾ෦ɿਂ૚ֶशͷ࠷దԽ (ඇತɼແݶ࣍ݩ) ൚ԽޡࠩΛߟྀͨ͠࠷దԽख๏ͷઃܭ γϯϓϧͳղ๏ʹΑΔʮ͍ܰʯֶशͷ࣮ݱɿϏοάσʔλղੳ ਂ૚ֶशͱ͍͏ղੳͷ೉͍͠ର৅ͷ࠷దԽཧ࿦ɿඇತ࠷దԽʹݱΕΔ “ತੑ” 2 / 167

Slide 3

Slide 3 text

Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 3 / 167

Slide 4

Slide 4 text

Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 4 / 167

Slide 5

Slide 5 text

ػցֶशͷ໰୊ઃఆ ڭࢣ͋Γֶश ɹσʔλ͕ೖྗͱͦΕʹର͢Δϥϕϧͷ૊Ͱ༩͑ΒΕΔɽ ɹ৽͍͠ೖྗ͕དྷͨ࣌ʹରԠ͢ΔϥϕϧΛ༧ଌ͢Δ໰୊ɽ ɹ໰୊ͷྫɿճؼɼ൑ผ ɹɹɹɹɹ ( , 3) ( , 5) ڭࢣͳֶ͠श ɹσʔλʹϥϕϧ͕෇͍͍ͯͳ͍ɽ ɹ໰୊ͷྫɿΫϥελϦϯάɼԻݯ෼཭ɼҟৗݕ஌ ൒ڭࢣ͋Γֶश ɹҰ෦ͷσʔλʹϥϕϧ͕෇͍͍ͯΔɽ ڧԽֶश ɹࢼߦࡨޡ͠ͳ͕Βࣗ෼ͰσʔλΛूΊΔɽ 5 / 167

Slide 6

Slide 6 text

ػցֶशͷྲྀΕ ಛ௃நग़: ը૾ͳͲͷର৅ΛԿΒ͔ͷํ๏ͰϕΫτϧʹม׵ɽ(෼໺͝ͱͷ ϊ΢ϋ΢) Ұ౓ಛ௃ϕΫτϧʹม׵ͯ͠͠·͑͹͋ͱ͸౷ܭͷ໰୊ɽ x1 x2 xd ಛ௃நग़ ಛ௃ϕΫτϧ ෼ੳ ύϥϝʔλਪఆ ༧ଌϞσϧͷߏங (θ: Ϟσϧͷύϥϝʔλ) ʢڭࢣ༗Γֶशʣ y = f (x; θ) ˞ਂ૚ֶश͸ಛ௃நग़ͷ෦෼ΛωοτϫʔΫߏ଄Λ޻෉͢Δ͜ͱͰֶशʹ૊Έࠐ ΜͰ͍Δɽ 6 / 167

Slide 7

Slide 7 text

ଛࣦؔ਺Λ༻͍ͨఆࣜԽ ڭࢣ͋Γ/ͳֶ͠शɼ͍ͣΕ΋ଛࣦؔ਺ͷ࠷খԽͱͯ͠ఆࣜԽͰ͖Δɽ σʔλͷߏ଄Λද͢ύϥϝʔλ θ ∈ Θ (Θ ͸Ծઆू߹ (Ϟσϧ)) ˡ ʮֶशʯ≈ θ ͷਪఆ ଛࣦؔ਺: ύϥϝʔλ θ ͕σʔλ z ΛͲΕ͚ͩΑ͘આ໌͍ͯ͠Δ͔; ℓ(z, θ). ൚Խޡࠩ (ظ଴ޡࠩ)ɿଛࣦͷظ଴஋ ˠ ൚Խޡࠩ࠷খԽ ≈ʮֶशʯ min θ∈Θ EZ [ℓ(Z, θ)]. ܇࿅ޡࠩʢܦݧޡࠩʣɿ؍ଌ͞ΕͨσʔλͰ୅༻, min θ∈Θ 1 n n ∑ i=1 ℓ(zi , θ). ˞ ܇࿅ޡࠩͱ൚Խޡࠩʹ͕ࠩ͋Δ͜ͱ͕ػցֶशʹ͓͚Δ࠷దԽͷಛ௃ɽ 7 / 167

Slide 8

Slide 8 text

Ϟσϧͷྫ (ڭࢣ͋Γ) ճؼ z = (x, y) ∈ Rd+1 ℓ(z, θ) = (y − θ⊤x)2 (ೋ৐ޡࠩ) min θ∈Rd 1 n n ∑ i=1 ℓ(zi , θ) = min θ∈Rd 1 n n ∑ i=1 (yi − θ⊤xi )2 (࠷খೋ৐๏) zi 8 / 167

Slide 9

Slide 9 text

ڭࢣ͋Γֶशͷଛࣦؔ਺ʢճؼʣ ͷσʔλ z = (x, y) ʹ͓͚Δ f = x⊤θ ͷଛࣦ. ೋ৐ଛࣦ: ℓ(y, f ) = 1 2 (y − f )2. τ-෼Ґ఺ଛࣦ: ℓ(y, f ) = (1 − τ) max{f − y, 0} + τ max{y − f , 0}. ͨͩ͠ɼτ ∈ (0, 1). ෼Ґ఺ճؼʹ༻͍ΒΕΔɽ ϵ-ײ౓ଛࣦ: ℓ(y, f ) = max{|y − f | − ϵ, 0}, ͨͩ͠ɼϵ > 0. αϙʔτϕΫτϧճؼʹ༻͍ΒΕΔɽ f-y -3 -2 -1 0 1 2 3 0 1 2 3 τ Huber ǫ 9 / 167

Slide 10

Slide 10 text

ڭࢣ͋Γֶशͷଛࣦؔ਺ʢ൑ผʣ y ∈ {±1} ϩδεςΟ οΫଛࣦ: ℓ(y, f ) = log((1 + exp(−yf ))/2). ώϯδଛࣦ: ℓ(y, f ) = max{1 − yf , 0}. ࢦ਺ଛࣦ: ℓ(y, f ) = exp(−yf ). ฏ׈Խώϯδଛࣦ: ℓ(y, f ) =      0, (yf ≥ 1), 1 2 − yf , (yf < 0), 1 2 (1 − yf )2, (otherwise). yf -3 -2 -1 0 1 2 3 0 1 2 3 4 0-1 10 / 167

Slide 11

Slide 11 text

աֶश ܦݧޡࠩ࠷খԽͱ൚Խޡࠩ࠷খԽʹ͸େ͖ͳΪϟ οϓ͕͋Δɽ ୯ͳΔܦݧޡࠩ࠷খԽ͸ʮաֶशʯΛҾ͖ى͜͢ɽ 5 10 15 20 0 2 4 6 8 10 12 cubic spline fitting Index y True Overfitting 11 / 167

Slide 12

Slide 12 text

ਖ਼ଇԽ๏ ී௨ͷϩεؔ਺ (ෛͷର਺໬౓) ࠷খԽ: min β n ∑ i=1 ℓ(yi , β⊤xi ). ਖ਼ଇԽ෇͖ଛࣦؔ਺࠷খԽ: min β n ∑ i=1 ℓ(yi , β⊤xi ) + ψ(β) ਖ਼ଇԽ߲ . ਖ਼ଇԽ߲ͷྫ: Ϧοδਖ਼ଇԽ (ℓ2 -ਖ਼ଇԽ): ψ(β) = λ∥β∥2 2 ℓ1 -ਖ਼ଇԽ: ψ(β) = λ∥β∥1 τϨʔεϊϧϜਖ਼ଇԽ: ψ(W ) = Tr[(W ⊤W )1/2] (W ∈ RN×M : ߦྻ) 12 / 167

Slide 13

Slide 13 text

ਖ਼ଇԽ๏ ී௨ͷϩεؔ਺ (ෛͷର਺໬౓) ࠷খԽ: min β n ∑ i=1 ℓ(yi , β⊤xi ). ਖ਼ଇԽ෇͖ଛࣦؔ਺࠷খԽ: min β n ∑ i=1 ℓ(yi , β⊤xi ) + ψ(β) ਖ਼ଇԽ߲ . ਖ਼ଇԽ߲ͷྫ: Ϧοδਖ਼ଇԽ (ℓ2 -ਖ਼ଇԽ): ψ(β) = λ∥β∥2 2 ℓ1 -ਖ਼ଇԽ: ψ(β) = λ∥β∥1 τϨʔεϊϧϜਖ਼ଇԽ: ψ(W ) = Tr[(W ⊤W )1/2] (W ∈ RN×M : ߦྻ) ਖ਼ଇԽ߲ʹΑΓ෼ࢄ͕཈͑ΒΕɼաֶश͕๷͕ΕΔɽ ͦͷ෼ɼόΠΞε͕৐Δɽ ˠ ద੾ͳਖ਼ଇԽͷڧ͞ (λ) ΛબͿඞཁ͕͋Δɽ 12 / 167

Slide 14

Slide 14 text

ਖ਼ଇԽͷྫɿ Ϧοδਖ਼ଇԽͱաֶश ଟ߲ࣜճؼʢ15 ࣍ଟ߲ࣜʣ min θ∈R15 1 n n ∑ i=1 {yi − (β1 xi + β2 x2 i + · · · + β15 x15 i )}2 + λ∥β∥2 2 13 / 167

Slide 15

Slide 15 text

ਖ਼ଇԽͷྫɿℓ1 -ਖ਼ଇԽʢεύʔεਪఆʣ ˆ β = arg min β∈Rp 1 n n ∑ i=1 (yi − x⊤ i β)2 + λ p ∑ j=1 |βj |. R. Tsibshirani (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., Vol. 58, No. 1, pages 267–288. 14 / 167

Slide 16

Slide 16 text

εύʔεੑͷԸܙ yi = x⊤ i β∗ + ϵi (i = 1, . . . , n)ɽβ∗ : ਅͷϕΫτϧɽ ˆ β = arg min β∈Rp 1 n n ∑ i=1 (yi − x⊤ i β)2 + λ p ∑ j=1 |βj |. xi ∈ Rp (p ࣍ݩ), d = ∥β∗∥0 (ਅͷඇ 0 ཁૉͷ਺) ͱ͢Δɽ Theorem (Lasso ͷऩଋϨʔτ) ͋Δ৚݅ͷ΋ͱɼ͋Δఆ਺ C ͕ଘࡏͯ͠ ∥ˆ β − β∗∥2 2 ≤ C dlog(p) n . ˞શମͷ࣍ݩ p ͸͔͔ͨͩ O(log(p)) Ͱ͔͠Өڹ͠ͳ͍ ! ࣮࣭త࣍ݩ d ͕ࢧ഑తɽ (Lasso) d log(p) n ≪ p n (࠷খೋ৐๏) 15 / 167

Slide 17

Slide 17 text

੍ݶݻ༗஋৚݅ (Restricted eigenvalue condition) A = 1 n X⊤X ͱ͢Δɽ Definition (੍ݶݻ༗஋৚݅ (RE(k′, C))) ϕRE (k′, C) = ϕRE (k′, C, A) := inf J⊆{1,...,n},v∈Rp: |J|≤k′,C∥vJ ∥1≥∥vJc ∥1 v⊤Av ∥vJ ∥2 2 ʹର͠ɼϕRE > 0 ͕੒Γཱͭɽ ΄΅εύʔεͳϕΫτϧʹ੍ݶͯ͠ఆٛͨ͠࠷খݻ༗஋ɽ k′ = 2d Ͱ੒Γཱ͍ͬͯΕ͹Α͍ɽ ϥϯμϜͳ X ʹରͯ͠ߴ֬཰Ͱ੒Γཱͭ͜ͱ͕ࣔͤΔ: Johnson Lindenstrauss ͷิ୊ (Johnson et al., 1986, Dasgupta and Gupta, 1999, Rudelson and Zhou, 2013)ɽ J 相関が小さい 一次独立 16 / 167

Slide 18

Slide 18 text

Ұ༷ό΢ϯυ f f* L(f) L(f ) = E[ℓ(Y , f (X)), L(f ) = 1 n ∑ n i=1 ℓ(yi , f (xi ))] 17 / 167

Slide 19

Slide 19 text

Ұ༷ό΢ϯυ f f* L(f) L(f) ^ f ^ “ͨ·ͨ·” ͏·͍͘͘΍͕͍ͭΔ (աֶश) ͔΋͠Εͳ͍ɽ ࣮ࡍɼF ͕ෳࡶͳ৔߹ऩଋ͠ͳ͍ྫ͕ 17 / 167

Slide 20

Slide 20 text

Ұ༷ό΢ϯυ f f* L(f) L(f) ^ f ^ Ұ༷ͳό΢ϯυ Ұ༷ͳό΢ϯυʹΑͬͯʮͨ·ͨ·͏·͍͘͘ʯ͕ (΄ͱΜͲ) ͳ͍͜ͱΛอূ (ܦݧաఔͷཧ࿦) 17 / 167

Slide 21

Slide 21 text

Rademacher ෳࡶ౓ (Ұ༷ό΢ϯυ) L(ˆ f ) − ˆ L(ˆ f ) ≤ sup f ∈F { L(f ) − ˆ L(f ) } ≤ (?) Rademacher ෳࡶ౓: ϵ1, ϵ2, . . . , ϵn : Rademacher ม਺, i.e., P(ϵi = 1) = P(ϵi = −1) = 1 2 . R(ℓ ◦ F) := E{ϵi },{xi } [ sup f ∈F 1 n n ∑ i=1 ϵi ℓ(yi , f (xi )) ] . ରশԽ: (ظ଴஋ͷό΢ϯυ) E [ sup f ∈F |L(f ) − L(f )| ] ≤ 2R(ℓ ◦ F). Rademacher ෳࡶ͞Λ཈͑Ε͹Ұ༷ό΢ϯυ͕ಘΒΕΔʂ جຊతʹɼR(ℓ ◦ F) ≤ O(1/ √ n) Ͱ཈͑ΒΕΔɽ ྫɿ F = {f (x) = x⊤β | β ∈ Rd , ∥β∥ ≤ 1} ͔ͭ ℓ ͕ 1-Lipshitz ࿈ଓͳ࣌ɼ R(ℓ ◦ F) ≤ O( √ d/n). 18 / 167

Slide 22

Slide 22 text

ΧόϦϯάφϯόʔʢࢀߟʣ Rademacher ෳࡶ౓Λ཈͑ΔͨΊʹ༗༻ɽ ΧόϦϯάφϯόʔ: Ծઆू߹ F ͷෳࡶ͞ɾ༰ྔɽ ϵ-ΧόϦϯάφϯόʔ N(F, ϵ, d) ϊϧϜ d Ͱఆ·Δ൒ܘ ϵ ͷϘʔϧͰ F Λ෴͏ͨΊ ʹඞཁͳ࠷খͷϘʔϧͷ਺ɽ F ༗ݶݸͷݩͰ F Λۙࣅ͢Δͷʹ࠷௿ݶඞཁͳݸ਺ɽ Theorem (Dudley ੵ෼) ∥f ∥2 n := 1 n ∑ n i=1 f (xi )2 ͱ͢Δͱɼ R(F) ≤ C √ n EDn [∫ ∞ 0 √ log(N(F, ϵ, ∥ · ∥n ))dϵ ] . 19 / 167

Slide 23

Slide 23 text

ہॴ Rademacher ෳࡶ͞ʢࢀߟʣ ہॴ Rademacher ෳࡶ͞: Rδ (F) := R({f ∈ F | E[(f − f ∗)2] ≤ δ}). ࣍ͷ৚݅ΛԾఆͯ͠ΈΔ. F ͸ 1 Ͱ্͔Β཈͑ΒΕ͍ͯΔ: ∥f ∥∞ ≤ 1 (∀f ∈ F). ℓ ͸ Lipschitz ࿈ଓ͔ͭڧತ: E[ℓ(Y , f (X))] − E[ℓ(Y , f ∗(X))] ≥ BE[(f − f ∗)2] (∀f ∈ F). Theorem (Fast learning rate (Bartlett et al., 2005)) δ∗ = inf{δ | δ ≥ Rδ (F)} ͱ͢Δͱɼ֬཰ 1 − e−t Ͱ L(ˆ f ) − L(f ∗) ≤ C ( δ∗ + t n ) . δ∗ ≤ R(F) ͸ৗʹ੒Γཱͭ (ӈਤࢀর). ͜ΕΛ Fast learning rate ͱݴ͏ɽ R± (F) ± ± ±* 20 / 167

Slide 24

Slide 24 text

ਖ਼ଇԽͱ࠷దԽ Ϟσϧͷ੍ݶʹΑΔਖ਼ଇԽ Early stopping ʹΑΔਖ਼ଇԽ ਅͷؔ਺ Ϟσϧ ਪఆྔ όΠΞε όϦΞϯε όϦΞϯε ॳظ஋ ܇࿅ޡࠩ࠷খԽݩ ʢաֶशʣ &BSMZTUPQQJOH όΠΞε-όϦΞϯε෼ղ ∥f o − ˆ f ∥L2(PX ) Estimation error ≤ ∥f o − ˇ f ∥L2(PX ) Approximation error (bias) + ∥ˇ f − ˆ f ∥L2(PX ) Sample deviation (variance) ܇࿅ޡࠩ࠷খԽݩʹୡ͢ΔલʹࢭΊΔ (early stopping) ͜ͱͰਖ਼ଇԽ͕ಇ͘ɽ ˠ ਂ૚ֶशɼBoosting ͷৗ౟खஈɽ 21 / 167

Slide 25

Slide 25 text

Early stopping ʹΑΔաֶशͷճආ Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurlien Gron. Chapter 4. Training Models. https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/ch04.html 22 / 167

Slide 26

Slide 26 text

ػցֶशͷ࠷దԽͷಛ௃ ൚ԽޡࠩΛখ͘͢͞Δ͜ͱ͕ॏཁɽඞͣ͠΋࠷దԽ໰୊Λ׬શʹղ͘ඞཁ͸ ͳ͍ɽ ໨తʹԠͯ͡࠷దԽ͠΍͍͢Α͏ʹ໰୊Λม͑ͯྑ͍ɽ ྫ: εύʔεਪఆ (૊߹ͤ࠷దԽΛತ࠷దԽʹ؇࿨)ɽ େن໛ɾߴ࣍ݩσʔλɽ ˠ ͳΔ΂ָͯ͘͠࠷దԽ͍ͨ͠ɽҰ࣍࠷దԽ๏ɼ֬཰త࠷దԽ๏ɽ 23 / 167

Slide 27

Slide 27 text

Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 24 / 167

Slide 28

Slide 28 text

ਖ਼ଇԽֶश๏ ܇࿅ޡࠩ࠷খԽ: min x∈Rp 1 n n ∑ i=1 ℓ(zi , x). ਖ਼ଇԽ෇͖܇࿅ޡࠩ࠷খԽ: min x∈Rp 1 n n ∑ i=1 ℓ(zi , x) + ψ(x). ͠͹Β͘ ℓ ͱ ψ ͸ತؔ਺Ͱ͋ΔͱԾఆ. 25 / 167

Slide 29

Slide 29 text

ฏ׈ੑͱڧತੑ Definition ฏ׈ੑ: ޯ഑͕Ϧϓγοπ࿈ଓ: ∥∇f (x) − ∇f (x′)∥ ≤ L∥x − x′∥. ⇒ f (x) ≤ f (y) + ⟨x − y, ∇f (y)⟩ + L 2 ∥x − y∥2. ڧತੑ: ∀θ ∈ (0, 1), ∀x, y ∈ dom(f ), µ 2 θ(1 − θ)∥x − y∥2 + f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y). ฏ׈͕ͩ ڧತͰ͸ͳ͍ ฏ׈͔ͭ ڧತ ڧತ͕ͩ ฏ׈Ͱ͸ͳ͍ 26 / 167

Slide 30

Slide 30 text

ฏ׈ੑˠ࠷ద஋Λ্͔Β཈͑ΒΕΔɽ ڧತੑˠ࠷ద஋ͷൣғΛݶఆͰ͖Δɽ 27 / 167

Slide 31

Slide 31 text

ฏ׈ੑͱڧತੑͷ૒ରੑ ฏ׈ੑͱڧತੑ͸ޓ͍ʹ૒ରͷؔ܎ʹ͋Δ. Theorem f : Rp → ¯ R Λਅดತؔ਺Ͱ͋Δͱ͢Δɽͦͷ࣌ɼҎԼ͕੒Γཱͭ: f ͕ L-ฏ׈ ⇐⇒ f ∗ ͕ 1/L-ڧತ. logistic loss its dual function Smooth but not strongly convex Strongly convex but not smooth (gradient → ∞) f ∗(y) = sup x∈Rp {⟨x, y⟩ − f (x)}. 28 / 167

Slide 32

Slide 32 text

Ұ࣍࠷దԽ๏ xt-1 xt xt+1 ؔ਺஋ f (x) ͱޯ഑ g ∈ ∂f (x) ͷ৘ใͷΈΛ༻͍ͨ࠷దԽख๏. Ұճͷߋ৽ʹ͔͔Δܭࢉྔ͕ܰ͘ɼߴ࣍ݩ࠷దԽ໰୊ʹ༗༻ɽ χϡʔτϯ๏͸ೋ࣍࠷దԽख๏. 29 / 167

Slide 33

Slide 33 text

࠷ٸ߱Լ๏ f (x) = ∑ n i=1 ℓ(zi , x) ͱ͢Δ. min x f (x). (ྼ) ޯ഑๏ ඍ෼Մೳͳ f (x): xt = xt−1 − ηt ∇f (xt−1 ). 30 / 167

Slide 34

Slide 34 text

࠷ٸ߱Լ๏ f (x) = ∑ n i=1 ℓ(zi , x) ͱ͢Δ. min x f (x). (ྼ) ޯ഑๏ ྼඍ෼Մೳͳ f (x): gt ∈ ∂f (xt−1 ), xt = xt−1 − ηt gt . 30 / 167

Slide 35

Slide 35 text

࠷ٸ߱Լ๏ f (x) = ∑ n i=1 ℓ(zi , x) ͱ͢Δ. min x f (x). (ྼ) ޯ഑๏ (ಉ஋ͳදݱ) ྼඍ෼Մೳͳ f (x): xt = xt−1 − ηt gt = argmin x { 1 2ηt ∥x − (xt−1 − ηt gt )∥2 } = argmin x { ⟨x, gt ⟩ + 1 2ηt ∥x − xt−1∥2 } , ͨͩ͠ɼgt ∈ ∂f (xt−1 ). ۙ઀఺ΞϧΰϦζϜ: xt = argmin x { f (x) + 1 2ηt ∥x − xt−1 ∥2 } . Ұൠͷ৔߹: f (xt ) − f (x∗) ≤ 1 2 ∑ t k=1 ηk ∥x0 − x∗∥. f (x) ͕ڧತͷ৔߹: f (xt ) − f (x∗) ≤ 1 2η ( 1 1+ση ) t−1 ∥x0 − x∗∥2. 30 / 167

Slide 36

Slide 36 text

xt-1 31 / 167

Slide 37

Slide 37 text

xt-1 f(x) 31 / 167

Slide 38

Slide 38 text

˒ ۙ઀ޯ഑๏ f (x) = ∑ n i=1 ℓ(zi , x) ͱ͢Δ. min x f (x) + ψ(x). ۙ઀ޯ഑๏ xt = argmin x { ⟨x, gt⟩ + ψ(x) + 1 2ηt ∥x − xt−1∥2 } = argmin x { ηtψ(x) + 1 2 ∥x − (xt−1 − ηt gt )∥2 } ͨͩ͠, gt ∈ ∂f (xt−1 ). ߋ৽ଇ͸ۙ઀ࣸ૾Ͱ༩͑ΒΕΔ: prox(q| ˜ ψ) = argmin x { ˜ ψ(x) + 1 2 ∥x − q∥2 } → ۙ઀ࣸ૾ʹΑΓਖ਼ଇԽ߲ͷѱ͍ੑ࣭ͷӨڹΛճආ (ඍ෼ෆՄೳੑͳͲ). 32 / 167

Slide 39

Slide 39 text

ۙ઀ࣸ૾ͷྫ̍ L1 ਖ਼ଇԽ: ψ(x) = λ∥x∥1 . xt = argmin x { ληt∥x∥1 + 1 2 ∥x − (xt−1 − ηt gt ) =:qt ∥2 } = argmin x { ∑p j=1 [ ληt|xj | + 1 2 (xj − qt,j )2 ]} ࠲ඪ͝ͱʹ෼͔Ε͍ͯΔʂ xt,j = STληt (qt,j ) (j ൪໨ͷཁૉ) ͨͩ͠ ST ͸Soft-Thresholding functionͱݺ͹ΕΔ: STC (q) = sign(q) max{|q| − C, 0}. → ॏཁͰͳ͍ཁૉΛ 0 ʹ͢Δ. 33 / 167

Slide 40

Slide 40 text

ۙ઀ࣸ૾ͷྫ̎ τϨʔεϊϧϜ: ψ(X) = λ∥X∥tr = λ ∑ j σj (X) (ಛҟ஋ͷ࿨). Xt−1 − ηt Gt = U diag(σ1, . . . , σd )V , ͱಛҟ஋෼ղ͢Δͱɼ Xt = U    STληt (σ1 ) ... STλη (σd )    V . 34 / 167

Slide 41

Slide 41 text

ۙ઀ޯ഑๏ͷऩଋ ڧತੑͱฏ׈ੑ͕ऩଋϨʔτΛܾΊΔɽ xt = prox(xt−1 − ηt gt|ηtψ(x)). f ͷੑ࣭ µ-ڧತ ඇڧತ γ-ฏ׈ exp ( −t µ γ ) γ t ඇฏ׈ 1 µt 1 √ t εςοϓ෯ ηt ͸ద੾ʹબͿඞཁ͕͋Δ. ηt ͷઃఆ ڧತ ඇڧತ ฏ׈ 1 γ 1 γ ඇฏ׈ 2 µt 1 √ t ࠷దͳऩଋϨʔτΛಘΔͨΊʹ͸దٓ {xt}t ͷฏۉΛऔΔඞཁ͕͋Δɽ Polyak-Ruppert ฏۉԽ, ଟ߲ࣜฏۉԽ. 35 / 167

Slide 42

Slide 42 text

ۙ઀ޯ഑๏ͷऩଋ ڧತੑͱฏ׈ੑ͕ऩଋϨʔτΛܾΊΔɽ xt = prox(xt−1 − ηt gt|ηtψ(x)). f ͷੑ࣭ µ-ڧತ ඇڧತ γ-ฏ׈ exp ( −t √ µ γ ) γ t2 ඇฏ׈ 1 µt 1 √ t εςοϓ෯ ηt ͸ద੾ʹબͿඞཁ͕͋Δ. ηt ͷઃఆ ڧತ ඇڧತ ฏ׈ 1 γ 1 γ ඇฏ׈ 2 µt 1 √ t ࠷దͳऩଋϨʔτΛಘΔͨΊʹ͸దٓ {xt}t ͷฏۉΛऔΔඞཁ͕͋Δɽ Polyak-Ruppert ฏۉԽ, ଟ߲ࣜฏۉԽ. ฏ׈ͳଛࣦͳΒ Nesterov ͷՃ଎๏ʹΑΓऩଋΛվળͰ͖Δ. → ࠷దͳऩଋϨʔτ 35 / 167

Slide 43

Slide 43 text

Nesterov ͷՃ଎๏ (ඇڧತ) minx {f (x) + ψ(x)} Ծఆ: f (x) ͸ γ-ฏ׈. Nesterov ͷՃ଎๏ s1 = 1, η = 1 γ ͱ͢Δɽt = 1, 2, . . . ͰҎԼΛ܁Γฦ͢: 1 gt = ∇f (yt ) ͱͯ͠ xt = prox(yt − ηgt|ηψ) ͱߋ৽. 2 st+1 = 1+ √ 1+4s2 t 2 ͱઃఆ. 3 yt+1 = xt + ( st −1 st+1 ) (xt − xt−1 ) ͱߋ৽. f ͕ γ-ฏ׈ͳΒ͹ɼ f (xt ) − f (x∗) ≤ 2γ∥x0 − x∗∥2 t2 . Fast Iterative Shrinkage Thresholding Algorithm (FISTA) (Beck and Teboulle, 2009) ͱ΋ݺ͹Ε͍ͯΔ. εςοϓαΠζ η = 1/γ ͸όοΫτϥοΩϯάͰܾఆͰ͖Δ. ਂ૚ֶशͰ࢖ΘΕ͍ͯΔ “ϞʔϝϯλϜ” ๏΋ࣅͨΑ͏ͳํ๏ (Sutskever et al., 2013). 36 / 167

Slide 44

Slide 44 text

37 / 167

Slide 45

Slide 45 text

Nesterov ͷՃ଎๏ͷղऍ Ճ଎๏ͷղऍ͸༷ʑͳํ޲͔Βͳ͞Ε͖ͯͨɽͦͷதͰ΋ɼAhn (2020) ʹΑΔ ࠷ۙͷ݁Ռ͸ཧղ͠΍͍͢ɽ ۙ઀఺ΞϧΰϦζϜ: xt+1 = argminx {f (x) + 1 2ηt+1 ∥x − xt∥2} (ྑ͍ऩଋ). ˠ̎छྨͷۙࣅ: gt = ∇f (yt ), f (yt ) + ⟨gt, x − yt⟩ ≤ f (x) ≤ f (yt ) + ⟨gt, x − yt⟩ + γ 2 ∥x − yt∥2. ̎छྨͷۙࣅΛ༻͍ͨަޓ࠷దԽ: zt+1 = argmin z { f (yt ) + ⟨∇f (yt ), z − yt⟩ + 1 2ηt+1 ∥z − zt∥2 } , yt+1 = argmin y { f (yt ) + ⟨∇f (yt ), y − yt⟩ + γ 2 ∥y − yt∥2 + 1 2ηt+1 ∥y − zt+1∥2 } . yt = 1/γ 1/γ+1/ηt zt + 1/γ 1/γ+1/ηt xt, zt+1 = zt − ηt+1∇f (yt ), xt+1 = yt − 1 γ ∇f (yt ). ≃ yt = 1/γ 1/γ+1/ηt zt + 1/γ 1/γ+1/ηt xt, xt+1 = yt − 1 γ ∇f (yt ), zt+1 = xt+1 + γηt (xt+1 − xt ). ɹ ◦ (γηt+1 + 1/2)2 = (γηt + 1)2 + 1/4 ͱ͢Ε͹ɼݩͷߋ৽ࣜΛಘΔ (ηt = Θ(t))ɽ ◦ ࠨͷߋ৽ࣜͰ΋ O(1/t2) Λୡ੒ɽ 38 / 167

Slide 46

Slide 46 text

ηt = Θ(t) ͱ͢Δɽͭ·Γɼt ͕૿େ͢ΔʹͭΕɼԼքʹؔ͢Δߋ৽͕ڧௐ͞Ε Δɽڧತ౓߹͍͕ڧ͍ํ޲΁ઌʹऩଋͯ͠ʢ্քͷํʣ ɼޙ͔Βڧತ۩߹͕ऑ͍ ํ޲ʢԼքͷํʣΛऩଋͤ͞Δಈ͖ʹͳΔɽ 39 / 167

Slide 47

Slide 47 text

Ճ଎๏ͷيಓɽAhn (2020) ΑΓɽ Approach 1: ԼքͷΈɽApproach 2: ্քͷΈɽApproach 1+2: Ճ଎๏ɽ 40 / 167

Slide 48

Slide 48 text

Nesterov ͷՃ଎๏ (ڧತ) Ϧελʔτ๏ ͋Δఔ౓ Nesterov ͷՃ଎๏Ͱߋ৽Λ܁Γฦͨ͠ΒɼॳظԽ͠ͳ͓ͯ͠Ϧελʔ τ͢Δɽ ௚઀Ճ଎͢Δόʔδϣϯ΋͋Δ͕ɼ৚͕݅ऑͯ͘ࡁΉ (Ұ఺ڧತੑ)ɼ Ϧελʔτ൛ͷํ͕ݟ௨͕͠ Α͍ɼ࣮૷΋ָɽ Ϧελʔτͷن४ t ≥ √ 8γ µ ճߋ৽ͨ͠ΒϦελʔτ (Excess risk ͕ 1/2 ʹͳΔͨΊ) ໨తؔ਺͕Ұ౓্ঢͨ͠ΒϦελʔτ (yt+1 − xt−1 )⊤(xt − xt−1 ) ≥ 0 ͱͳͬͨΒϦελʔτ ୈೋɼୈࡾͷํ๏͸ώϡʔϦεςΟΫεɽexp ( −t √ µ γ ) ͷऩଋϨʔτɽ Ϧελʔτ 41 / 167

Slide 49

Slide 49 text

20 40 60 80 100 120 140 160 180 200 Number of iterations 10-10 10-8 10-6 10-4 10-2 100 102 P(w) - P(w*) Prox-Grad Nesterov(normal) Nesterov(restart) Nesterov’s acceleration vs. normal gradient descent Lasso: n = 8, 000, p = 500. 42 / 167

Slide 50

Slide 50 text

Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 43 / 167

Slide 51

Slide 51 text

Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 44 / 167

Slide 52

Slide 52 text

ػցֶशʹ͓͚Δ֬཰త࠷దԽͷྺ࢙ 1951 Robbins and Monro Stochastic approximation ྵ఺໰୊ 1957 Rosenblatt ύʔηϓτϩϯ 1978 Nemirovskii and Yudin ׈Β͔Ͱͳ͍ؔ਺ʹ͓͚Δ 1983 ϩόετͳํࡦ͓Αͼ࠷దੑ 1988 Ruppert ׈Β͔ͳؔ਺ʹ͓͚Δϩόετͳ 1992 Polyak and Juditsky εςοϓαΠζ΍ฏۉԽͷํࡦ 1998 Bottou ΦϯϥΠϯܕ֬཰త࠷దԽʹΑΔ 2004 Bottou and LeCun େن໛ػցֶश 2009- 2012 Singer and Duchi; Duchi et al.; Xiao FOBOS, AdaGrad, RDA 2012- Le Roux et al. όονܕख๏, ઢܗऩଋ 2013 Shalev-Shwartz and Zhang (SAG,SDCA,SVRG) Johnson and Zhang 2016 Allen-Zhu Katyusya όονܕख๏ͷՃ଎ 2017- ֤छඇತ࠷దԽख๏ͷൃల 45 / 167

Slide 53

Slide 53 text

֬཰త࠷దԽ๏ͱ͸ ໨తؔ਺: F(x) = EZ [ℓ(Z, x)] F ࣗମͰ͸ͳ͘ɼℓ(Z, x) ΛαϯϓϦϯά͢Δ͜ͱ͔͠Ͱ͖ͳ͍ঢ়گͰ F Λ࠷খ Խ͢Δ໰୊ʢ֬཰తܭը໰୊ʣΛղ͘ख๏ɼ·ͨ͸ҙਤతʹϥϯμϜωεΛ༻͍ ͯ F Λ࠷దԽ͢Δख๏ɽػցֶशͰ͸ F ͕ཅʹܭࢉͰ͖Δঢ়گͰ΋Θ͟ͱϥϯ μϜωεΛར༻ͯ͠ղ͘͜ͱ΋ଟ͍ɽ ΦϯϥΠϯܕ σʔλ͸͔࣍Β࣍΁ͱདྷΔ. جຊతʹ֤܇࿅σʔλ͸Ұճ͔͠࢖Θͳ͍. min x EZ [ℓ(Z, x)] όονܕ σʔλ਺ݻఆ. ܇࿅σʔλ͸Կ౓΋࢖ͬͯྑ͍͕ɼn ͕େ͖͍ঢ়گΛ૝ఆɽ ∑ n i=1 · ͸ͳΔ΂͘ ܭࢉͨ͘͠ͳ͍ɽ min x 1 n n ∑ i=1 ℓ(zi , x) 46 / 167

Slide 54

Slide 54 text

ΦϯϥΠϯܕ֬཰త࠷దԽͷ໨తؔ਺ ℓ(z, x) Λ؍ଌ z ʹର͢Δύϥϝʔλ x ͷଛࣦ. (ظ଴ଛࣦ) L(x) = EZ [ℓ(Z, x)] or (ਖ਼ଇԽ෇͖ظ଴ଛࣦ) Lψ (x) = EZ [ℓ(Z, x)] + ψ(x) ؍ଌ஋ Z ͷ෼෍͸ঢ়گʹΑ͍ͬͯΖ͍Ζ ਅͷ෼෍ (ͭ·Γ L(x) = ∫ ℓ(Z, x)dP(Z) ͷ࣌) → L(x) ͸൚Խޡࠩ. ΦϯϥΠϯܕ࠷దԽ͸ͦΕࣗମֶ͕श! ڊେετϨʔδʹهԱ͞Ε͍ͯΔσʔλͷܦݧ෼෍ (ͭ·Γ L(x) = 1 n ∑ n i=1 ℓ(zi , x) ͷ࣌) → L (·ͨ͸ Lψ ) ͸ (ਖ਼ଇԽ͋Γͷ) ܇࿅ޡࠩ. 47 / 167

Slide 55

Slide 55 text

Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 48 / 167

Slide 56

Slide 56 text

֬཰తޯ഑߱Լ๏ (Stochastic Gradient Descent, SGD) SGD (ਖ਼ଇԽͳ͠) zt ∼ P(Z) Λ؍ଌɽℓt (x) := ℓ(zt, x) ͱ͢Δ. ʢ͚͕ͩ͜͜ී௨ͷޯ഑๏ͱҧ͏఺ʣ ଛࣦؔ਺ͷྼඍ෼Λܭࢉ: gt ∈ ∂x ℓt (xt−1 ). x Λߋ৽: xt = xt−1 − ηt gt. ֤൓෮ͰҰݸͷσʔλ zt Λ؍ଌ͢Ε͹ྑ͍. → ֤൓෮͝ͱʹ O(1) ͷܭࢉྔ (શσʔλ࢖͏ޯ഑๏͸ O(n)). σʔλશମ {zi }n i=1 Λ࢖Θͳ͍Ͱྑ͍ɽ Reminder: prox(q|ψ) := argminx { ψ(x) + 1 2 ∥x − q∥2 } . 49 / 167

Slide 57

Slide 57 text

֬཰తޯ഑߱Լ๏ (Stochastic Gradient Descent, SGD) SGD (ਖ਼ଇԽ͋Γ) zt ∼ P(Z) Λ؍ଌɽℓt (x) := ℓ(zt, x) ͱ͢Δ. ʢ͚͕ͩ͜͜ී௨ͷޯ഑๏ͱҧ͏ʣ ଛࣦؔ਺ͷྼඍ෼Λܭࢉ: gt ∈ ∂x ℓt (xt−1 ). x Λߋ৽: xt = prox(xt−1 − ηt gt|ηtψ). ֤൓෮ͰҰݸͷσʔλ zt Λ؍ଌ͢Ε͹ྑ͍. → ֤൓෮͝ͱʹ O(1) ͷܭࢉྔ (શσʔλ࢖͏ޯ഑๏͸ O(n)). σʔλશମ {zi }n i=1 Λ࢖Θͳ͍Ͱྑ͍ɽ Reminder: prox(q|ψ) := argminx { ψ(x) + 1 2 ∥x − q∥2 } . 49 / 167

Slide 58

Slide 58 text

ॏཁͳ఺ ֬཰తޯ഑ͷظ଴஋͸ຊ౰ͷޯ഑ gt = ∇ℓt (xt−1 ) ΑΓɼ Ezt [gt ] = Ezt [∇ℓ(Z, xt−1 )] = ∇Ezt [ℓ(Z, xt−1 )] = ∇L(xt−1 ) ⇒ ֬཰తޯ഑͸ຊ౰ͷޯ഑ͷෆภਪఆྔ 50 / 167

Slide 59

Slide 59 text

SGD ͷৼΔ෣͍ 0 0.2 0.4 0.6 0.8 1 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 SGD Batch 51 / 167

Slide 60

Slide 60 text

SGD ͷऩଋղੳ Ծఆ (A1) E[∥gt∥2] ≤ G2. (A2) E[∥xt − x∗∥2] ≤ D2. (Ծఆ (A2) ͸ {x | ∥x∥ ≤ D/2} ͳΔू߹ʹఆٛҬΛݶΔ͜ͱͰอূ) Theorem ¯ xT = 1 T+1 ∑ T t=0 xt (Polyak-Ruppert ฏۉԽ) ͱ͢Δ. εςοϓαΠζΛ ηt = η0 √ t ͱ͢Ε͹ Ez1:T [Lψ (¯ xT ) − Lψ (x∗)] ≤ η0 G2 + D2/η0 √ T . η0 = D G ͱ͢Ε͹, ظ଴ޡࠩͷ্ք͸ 2GD √ T . ͜Ε͸ϛχϚοΫε࠷ద (ఆ਺ഒΛআ͍ͯ). G ͸ ψ ͱؔ܎ͳ͍. ˠۙ઀ࣸ૾ͷ͓͔͛. ˞ L1 ਖ਼ଇԽͰ͸ ∥∂ψ(x)∥ ≤ C √ p Ͱ͋Δ. 52 / 167

Slide 61

Slide 61 text

SGD ͷऩଋղੳ (ฏ׈ͳ໨తؔ਺) Ծఆ (A1’) L ͸ γ-ฏ׈ɼE[∥gt − E[gt ]∥2] = σ2. (A2) E[∥xt − x∗∥2] ≤ D2. (Ծఆ (A2) ͸ {x | ∥x∥ ≤ D/2} ͳΔू߹ʹఆٛҬΛݶΔ͜ͱͰอূ) Theorem ¯ xT = 1 T+1 ∑ T t=0 xt (Polyak-Ruppert ฏۉԽ) ͱ͢Δ. ∀µ′ > 0, Ez1:T [Lψ (¯ xT ) − Lψ (x∗)] ≤ 1 2T T ∑ t=1 ( 1 ηt+1 − 1 ηt ) D2 + 1 2Tη1 D2 + 1 2 T ∑ t=1 ( γ − 1 2ηt ) ∥βt − βt+1∥2 + σ2 T T ∑ t=1 ηt ηt = D σ 1 √ t ͱ͔ͯͭ͠ 1 2γ > ηt ͳΒ, ظ଴ޡࠩͷ্ք͸ O( σD √ T ). σ2 = 0 (ϊΠζͳ͠) ͳΒɼηt = 1/(2γ) ͱ͢Δ͜ͱͰظ଴ޡࠩ = O( γ T D2) ͕ಘΒΕɼ௨ৗͷޯ഑๏ͷϨʔτ͕෮ݩ͞ΕΔ. 53 / 167

Slide 62

Slide 62 text

ূ໌ ψ = 0 Ͱࣔ͢ɽ L(xt ) ≤ L(xt−1 ) + ∇⊤L(xt−1 )(xt − xt−1 ) + γ 2 ∥xt − xt−1∥2 L(xt−1 ) + ∇⊤L(xt−1 )(x∗ − xt−1 ) ≤ L(x∗) ʹ஫ҙ͢Δɽ͜ͷೋࣜΛ଍͢͜ͱͰɼ L(xt ) − L(x∗) ≤ ∇⊤L(xt−1 )(xt − x∗) + γ 2 ∥xt − xt−1∥2 = ⟨gt + ϵt , xt − x∗⟩ + γ 2 ∥xt − xt−1∥2 ≤ − 1 ηt ⟨xt − xt−1, xt − x∗⟩ + ⟨ϵt , xt − xt−1 + xt−1 − x∗⟩ + γ 2 ∥xt − xt−1∥2 ≤ − 1 ηt ⟨xt − xt−1, xt − x∗⟩ + 1 4ηt ∥xt − xt−1∥2 + ηt ∥ϵt ∥2 + γ 2 ∥xt − xt−1∥2 + ⟨ϵt , xt−1 − x∗⟩ = 1 2ηt ( ∥xt−1 − x∗∥2 − ∥xt−1 − xt ∥2 − ∥xt − x∗∥2 ) + 1 4ηt ∥xt − x∗∥2 + ηt ∥ϵt ∥2 + γ 2 ∥xt − xt−1∥2 = 1 2ηt ( ∥xt−1 − x∗∥2 − ∥xt − x∗∥2 ) + 1 2 ( γ − 1 2ηt ) ∥xt−1 − xt ∥2 + [ηt ∥ϵt ∥2 + ⟨ϵt , xt−1 − x∗⟩] (← [·] ͷظ଴஋ ≤ ηt σ2 + 0). 54 / 167

Slide 63

Slide 63 text

ূ໌ (ଓ) ͋ͱ͸྆ลظ଴஋औͬͯɼt = 1, . . . , T Ͱ଍͠߹ΘͤΕ͹Α͍ɽ ΄΅௨ৗͷޯ഑๏ͷධՁํ๏ͱಉ͕ͩ͡ɼϊΠζ͕৐ͬͨ෼͚ͩ σ2 T ∑ T t=1 ηt ͩ ͚ͣΕΔɽ ͜ͷޙग़ͯ͘Δ෼ࢄॖখޯ഑߱Լ๏ͳͲ΋جຊ͸͜ͷධՁࣜɽ 55 / 167

Slide 64

Slide 64 text

SGD ͷऩଋղੳ (ڧತ) Ծఆ (A1) E[∥gt∥2] ≤ G2. (A3) Lψ ͸ µ-ڧತ. Theorem ¯ xT = 1 T+1 ∑ T t=0 xt ͱ͢Δ. εςοϓαΠζΛ ηt = 1 µt ͱ͢Ε͹ɼ Ez1:T [Lψ (¯ xT ) − Lψ (x∗)] ≤ G2 log(T) Tµ . ඇڧತͳ৔߹ΑΓ΋଎͍ऩଋɽ ͔͠͠, ͜Ε͸ϛχϚοΫε࠷దͰ͸ͳ͍. ্քࣗମ͸λΠτ (Rakhlin et al., 2012). 56 / 167

Slide 65

Slide 65 text

ڧತ໨తؔ਺ʹ͓͚Δଟ߲ࣜฏۉԽ Ծఆ (A1) E[∥gt∥2] ≤ G2. (A3) Lψ ͸ µ-ڧತ. ߋ৽ଇΛ xt = prox ( xt−1 − ηt t t + 1 gt|ηtψ ) , ͱ͠ɼॏΈ෇͖ฏۉΛऔΔ: ¯ xT = 2 (T+1)(T+2) ∑ T t=0 (t + 1)xt. Theorem ηt = 2 µt ʹର͠, Ez1:T [Lψ (¯ xT ) − Lψ (x∗)] ≤ 2G2 Tµ Ͱ͋Δɽ log(T) ͕ফ͑ͨ. ͜Ε͸ϛχϚοΫε࠷ద. 57 / 167

Slide 66

Slide 66 text

ҰൠԽͨ͠εςοϓαΠζͱՙॏํࡦ st (t = 1, 2, . . . , T + 1) Λ ∑ T+1 t=1 st = 1 ͳΔ਺ྻͱ͢Δ. xt = prox ( xt−1 − ηt st st+1 gt|ηtψ ) (t = 1, . . . , T) ¯ xT = T ∑ t=0 st+1 xt. Ծఆ: (A1) E[∥gt ∥2] ≤ G2, (A2) E[∥xt − x∗∥2] ≤ D2, (A3) Lψ ͸ µ-ڧತ. Theorem Ez1:T [Lψ (¯ xT ) − Lψ (x∗)] ≤ T ∑ t=1 st+1ηt+1 2 G2 + T−1 ∑ t=0 max{ st+2 ηt+1 − st+1 ( 1 ηt + µ), 0}D2 2 ͨͩ͠ t = 0 Ͱ͸ 1/η0 = 0 ͱ͢Δ. 58 / 167

Slide 67

Slide 67 text

ಛผͳྫ ॏΈ st ΛεςοϓαΠζ ηt ʹൺྫͤͯ͞ΈΔ (εςοϓαΠζΛॏཁ౓ͱΈͳ ͢): st = ηt ∑ T+1 τ=1 ητ . ͜ͷઃఆͰ͸ɼલड़ͷఆཧΑΓ Ez1:T [Lψ (¯ xT ) − Lψ (x∗)] ≤ ∑ T t=1 η2 t G2 + D2 2 ∑ T t=1 ηt ∞ ∑ t=1 ηt = ∞ ∞ ∑ t=1 η2 t < ∞ ͳΒ͹ऩଋɽԕ͘·Ͱ౸ୡͰ͖ͯɼ͔ͭద౓ʹݮ଎. 59 / 167

Slide 68

Slide 68 text

֬཰త࠷దԽʹΑΔֶश͸ʮ଎͍ʯ 60 / 167

Slide 69

Slide 69 text

ܭࢉྔͱ൚Խޡࠩͷؔ܎ ڧತͳ൚Խޡࠩͷ࠷దͳऩଋϨʔτ͸ O(1/n) (n ͸σʔλ਺). O(1/n) ͳΔ൚ԽޡࠩΛୡ੒͢Δʹ͸ɼ܇࿅ޡࠩ΋ O(1/n) ·Ͱݮগͤ͞ͳ ͍ͱ͍͚ͳ͍. ௨ৗͷޯ഑๏ SGD ൓෮͝ͱͷܭࢉྔ n 1 ޡࠩ ϵ ·Ͱͷ൓෮਺ log(1/ϵ) 1/ϵ ޡࠩ ϵ ·Ͱͷܭࢉྔ n log(1/ϵ) 1/ϵ ޡࠩ 1/n ·Ͱͷܭࢉྔ n log(n) n (Bottou, 2010) SGD ͸ O(log(n)) ͚ͩ௨ৗͷޯ഑๏ΑΓ΋ʮ଎͍ʯ. ʮn ݸσʔλݟΔ·Ͱݮগͤͣʯ vs ʮn ݸσʔλݟΕ͹ 1/n ·Ͱݮগʯ 61 / 167

Slide 70

Slide 70 text

యܕతͳৼΔ෣͍ Elaplsed time (s) 0 5 10 15 20 25 30 Trainig error 100 101 102 SGD Batch Elaplsed time (s) 0 5 10 15 20 25 30 Generalization error 0 5 10 15 20 25 Normal gradient descent vs. SGD L1 ਖ਼ଇԽ෇͖ϩδεςΟ οΫճؼ: n = 15, 000, p = 1, 000. ࠷ॳ SGD ͸܇࿅ޡࠩͱ൚ԽޡࠩΛͱ΋ʹૣ͘ݮগͤ͞Δɽ܇࿅ޡࠩ͸్தͰશσʔλΛ ࢖͏ޯ഑๏ʹ௥͍͔ͭΕൈ͔ΕΔɽ൚Խޡࠩ͸ SGD ͕༏Ґͳ͜ͱ͕ଟ͍ɽ 62 / 167

Slide 71

Slide 71 text

SGD ʹର͢Δ Nesterov ͷՃ଎๏ Ծఆ: ظ଴ޡࠩ L(x) ͸ γ-ฏ׈. ޯ഑ͷ෼ࢄ͸ σ2: EZ [∥∇β ℓ(Z, β) − ∇L(β)∥2] ≤ σ2. → Nesterov ͷՃ଎ʹΑͬͯɼΑΓ଎͍ऩଋΛ࣮ݱ. SGD ͷՃ଎: Hu et al. (2009) SDA ͷՃ଎: Xiao (2010), Chen et al. (2012) ඇತؔ਺΋ؚΊͨΑΓҰൠతͳ࿮૊Έ: Lan (2012), Ghadimi and Lan (2012, 2013) (ඇڧತ) Ez1:T [Lψ (x(T))] − Lψ (x∗) ≤ C ( σD √ T + D2γ T2 ) (ڧತ) Ez1:T [Lψ (x(T))] − Lψ (x∗) ≤ C ( σ2 µT + exp ( −C √ µ γ T )) 63 / 167

Slide 72

Slide 72 text

Ճ଎ SGD ͷϛχόον๏ʹΑΔՃ଎ Ez1:T [Lψ (x(T))] − Lψ (x∗) ≤ C ( σD √ T + D2γ T2 ) σ2 ͸ޯ഑ͷਪఆྔͷ෼ࢄ: EZ [∥∇β ℓ(Z, β) − ∇L(β)∥2] ≤ σ2. ෼ࢄ͸ϛχόον๏Λ༻͍Δ͜ͱͰݮগͤ͞Δ͜ͱ͕Մೳ: g = ∇ℓ(z, x(t−1)) ⇒ g = 1 K K ∑ k=1 ∇ℓ(zk , x(t−1)) (෼ࢄ) σ2 σ2/K K ݸͷޯ഑Λܭࢉ͢Δͷ͸ฒྻԽͰ͖Δ. σ → 0 Ͱ͖Ε͹ɼ্هͷ্ք͸ O(1/T2) ͱͳΔɽ͜Ε͸ඇ֬཰తͳ Nesterov ͷՃ଎๏ͱಉ͡ऩଋ଎౓ɽ 64 / 167

Slide 73

Slide 73 text

(a) Objective for L1 (b) Objective for Elastic-net ਓ޻σʔλʹ͓͚Δ਺஋࣮ݧ (a) L1 ਖ਼ଇԽ (Lasso)ɼ(b) ΤϥεςΟ οΫωοτ ਖ਼ଇԽ (figure is from Chen et al. (2012)). SAGE: Accelerated SGD (Hu et al., 2009), AC-RDA: Accelerated stochastic RDA (Xiao, 2010), AC-SA: Accelerated stochastic approximation Ghadimi and Lan (2012), ORDA: Optimal stochastic RDA (Chen et al., 2012) 65 / 167

Slide 74

Slide 74 text

ΦϯϥΠϯܕ֬཰త࠷దԽͷ·ͱΊ ֤ߋ৽ͰͨͬͨҰͭͷσʔλΛݟΔ͚ͩͰྑ͍ɽ ͋Δఔ౓ͷ൚ԽޡࠩΛಘΔ·Ͱ͕ૣ͍ɽ Nesterov ͷՃ଎๏ͱͷ૊߹ͤ΋Մೳ: Hu et al. (2009), Xiao (2010), Chen et al. (2012)ɽ AdaGrad (Duchi et al., 2011) ͳͲʹΑΔܭྔͷࣗಈௐઅ๏΋͋Δɽ ظ଴ޡࠩͷऩଋϨʔτ: GR √ T (ඇฏ׈, ඇڧತ) Polyak-Ruppert ฏۉԽ G2 µT (ඇฏ׈, ڧತ) ଟ߲ࣜฏۉԽ σR √ T + R2L T2 (ฏ׈, ඇڧತ) Ճ଎ σ2 µT + exp ( − √ µ L T ) (ฏ׈, ڧತ) Ճ଎ G: ޯ഑ͷϊϧϜͷ্ք, R: ఆٛҬͷ൒ܘ, L: ฏ׈ੑ, µ: ڧತੑ, σ: ޯ഑ͷ෼ࢄ 66 / 167

Slide 75

Slide 75 text

֬཰తҰ࣍࠷దԽ๏ͷϛχϚοΫε࠷దϨʔτ min x∈B L(x) = min x∈B EZ [ℓ(Z, x)] Condition ˆ gx ∈ ∂x ℓ(Z, x) ͷظ଴஋͸༗ք: ∥E[ˆ gx ]∥ ≤ G (∀x ∈ B). υϝΠϯ B ͸൒ܘ R ͷٿΛؚΉ. L(x) ͸ µ-ڧತ (µ = 0 ΋ڐ͢). Theorem (ϛχϚοΫε࠷దϨʔτ (Agarwal et al., 2012, Nemirovsky and Yudin, 1983)) ೚ҙͷ֬཰తҰ࣍࠷దԽ๏ʹର͠ɼ্هͷ৚݅Λຬͨ͋͢Δଛࣦؔ਺ ℓ ͱ෼෍ P(Z) ͕ଘࡏͯ͠ɼ E[L(x(T)) − L(x∗)] ≥ c min { GR √ T , G2 µT , GR √ p } . SGD ͸͜ͷ࠷దϨʔτΛୡ੒͢Δɽ Ұ࣍࠷దԽ๏: ଛࣦؔ਺ͷ஋͓ΑͼΫΤϦ఺ x ʹ͓͚Δޯ഑ (ℓ(Z, x), ˆ gx ) ʹͷΈ ґଘ͢ΔΞϧΰϦζϜ. (SGD ͸ؚ·Ε͍ͯΔ) 67 / 167

Slide 76

Slide 76 text

Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 68 / 167

Slide 77

Slide 77 text

ΦϯϥΠϯͱόονͷҧ͍ όονܕͰ͸σʔλ͸ݻఆ. ͨͩ୯ʹ܇࿅ޡࠩΛ࠷খԽ͢Δ: P(x) = 1 n n ∑ i=1 ℓi (x) + ψ(x). όονܕख๏ ̍ճͷߋ৽ʹ̍σʔλͷΈར༻ (ΦϯϥΠϯܕͱಉ͡) ઢܗऩଋ͢Δ (ΦϯϥΠϯܕͱҧ͏) : T > (n + γ/λ)log(1/ϵ) ճͷߋ৽Ͱ ϵ ޡࠩΛୡ੒ɽͨͩ͠ γ-ฏ׈ͳଛࣦͱ λ-ڧತͳ ਖ਼ଇԽΛԾఆ. ˞௨ৗͷޯ഑๏͸ nγ/λ log(1/ϵ) ͳΔܭࢉྔɽ ֬཰తख๏ͱඇ֬཰తख๏ͷྑ͍ͱ͜ͲΓΛ͍ͨ͠ɽ 69 / 167

Slide 78

Slide 78 text

ߋ৽ճ਺ ܇࿅ޡࠩ MPH εέʔϧ 4UPDIBTUJD %FUFSNJOJTUJD 'JOJUFTVNTUPDIBTUJD 70 / 167

Slide 79

Slide 79 text

୅දతͳࡾͭͷํ๏ ֬཰త෼ࢄॖখޯ഑๏ Stochastic Variance Reduced Gradient descent, SVRG (Johnson and Zhang, 2013, Xiao and Zhang, 2014) ֬཰తฏۉޯ഑๏ Stochastic Average Gradient descent, SAG (Le Roux et al., 2012, Schmidt et al., 2013, Defazio et al., 2014) ֬཰త૒ର࠲ඪ߱Լ๏ Stochastic Dual Coordinate Ascent, SDCA (Shalev-Shwartz and Zhang, 2013) - SAG ͱ SVRG ͸ओ໰୊Λղ͘ํ๏. - SDCA ͸૒ର໰୊Λղ͘ํ๏. 71 / 167

Slide 80

Slide 80 text

໰୊ઃఆ P(x) = 1 n ∑ n i=1 ℓi (x) ฏ׈ + ψ(x) ڧತ Ծఆ: ℓi : ଛࣦؔ਺͸ γ-ฏ׈. ψ: ਖ਼ଇԽؔ਺͸ λ-ڧತ. యܕతʹ͸ λ = O(1/n) ·ͨ͸ O(1/ √ n). ྫ: ଛࣦؔ਺ ฏ׈Խώϯδଛࣦ ϩδεςΟ οΫଛࣦ ਖ਼ଇԽؔ਺ L2 ਖ਼ଇԽ ΤϥεςΟ οΫωοτਖ਼ଇԽ ˜ ψ(x) + λ∥x∥2 72 / 167

Slide 81

Slide 81 text

Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 73 / 167

Slide 82

Slide 82 text

ओ໰୊Λѻ͏ํ๏ ΞΠσΟΞ: ޯ഑ͷ෼ࢄΛখ͘͢͞Δɽ 1 n n ∑ i=1 ℓi (x) Ͳ͏΍ͬͯۙࣅ͢Δ͔? +ψ(x) 74 / 167

Slide 83

Slide 83 text

ओ໰୊Λѻ͏ํ๏ ΞΠσΟΞ: ޯ഑ͷ෼ࢄΛখ͘͢͞Δɽ 1 n n ∑ i=1 ℓi (x) Ͳ͏΍ͬͯۙࣅ͢Δ͔? ⟨g,x⟩ +ψ(x) ΦϯϥΠϯܕख๏: ˆ i ∈ {1, . . . , n} ΛϥϯμϜʹબ୒͠ɼઢܗۙࣅ. g = ∇ℓˆ i (x) ⇒ E[g] = 1 n ∑ n i=1 ∇ℓi (x) ͭ·Γޯ഑ͷෆภਪఆྔ. ෼ࢄ͸? → ෼ࢄ͕໰୊! → όονͷઃఆͰ͸෼ࢄΛॖখͤ͞Δ͜ͱ͕༰қ. 74 / 167

Slide 84

Slide 84 text

֬཰త෼ࢄॖখޯ഑๏ SVRG (Johnson and Zhang, 2013, Xiao and Zhang, 2014) minx {L(x) + ψ(x)} = minx {1 n ∑ n i=1 ℓi (x) + ψ(x)} ͋Δ x ʹ͍ۙࢀর఺ ˆ x ʹର͠ɼ෼ࢄΛॖখͨ͠ޯ഑Λ࣍ͷΑ͏ʹٻΊΔ: g = ∇ℓi (x)−∇ℓi (ˆ x) + 1 n ∑ n j=1 ∇ℓj (ˆ x) ∇L(ˆ x) . ภΓ: ෆภ E[g] = 1 n n ∑ i=1 [∇ℓi (x) − ∇ℓi (ˆ x) + ∇L(ˆ x)] = 1 n n ∑ i=1 ∇ℓi (x) = ∇L(x). ෼ࢄ͸? 75 / 167

Slide 85

Slide 85 text

෼ࢄͷॖখ g = ∇ℓi (x) − ∇ℓi (ˆ x) + ∇L(ˆ x). ෼ࢄ: Var[g] = 1 n ∑ n i=1 ∥∇ℓi (x) − ∇ℓi (ˆ x) + ∇L(ˆ x) − ∇L(x)∥2 = 1 n ∑ n i=1 ∇ℓi (x) − ∇ℓi (ˆ x)∥2 − ∥∇L(ˆ x) − ∇L(x) 2 (∵ Var[X] = E[∥X∥2] − ∥E[X]∥2) ≤ 1 n ∑ n i=1 ∥∇ℓi (x) − ∇ℓi (ˆ x)∥2 ≤ γ∥x − ˆ x∥2. ℓi ͕ฏ׈Ͱɼx ͱ ˆ x ͕͚ۙΕ͹෼ࢄ΋খ͍͞. 76 / 167

Slide 86

Slide 86 text

}Variance 77 / 167

Slide 87

Slide 87 text

} } } Variance 77 / 167

Slide 88

Slide 88 text

SVRG ͷखॱ SVRG t = 1, 2, . . . ʹ͓͍ͯ, ҎԼͷखॱΛ܁Γฦ͢: 1 ˆ x = ˆ x(t−1), x[0] = ˆ x ͱͯ͠, ˆ g = ∇L(ˆ x) = 1 n ∑ n i=1 ∇ℓi (ˆ x). (શޯ഑) 2 k = 1, . . . , m ʹ͓͍ͯҎԼΛ࣮ߦ: 1 i ∈ {1, . . . , n} ΛҰ༷ʹαϯϓϧɽ 2 ෼ࢄॖখ͞Εͨޯ഑Λܭࢉ: g = ∇ℓi (x[k−1] ) − ∇ℓi (ˆ x) + ˆ g. (෼ࢄॖখ) 3 x[k] Λ࣍ͷΑ͏ʹߋ৽: x[k] = prox(x[k−1] − ηg|ηψ). 3 ˆ x(t) = 1 m ∑ m k=1 x[k] ͱ͢Δ. t ൓෮·Ͱͷޯ഑ͷܭࢉճ਺: t × (n + 2m). 78 / 167

Slide 89

Slide 89 text

ऩଋղੳ Ծఆ: ℓi ͸ γ-ฏ׈Ͱɼψ ͸ λ-ڧತ. Theorem εςοϓαΠζ η ͱ಺෦൓෮ճ਺ m ͕ η < 1/(4γ) Λຬͨ͠ɼ ρ := η−1 λ(1−4γη)m + 4γ(m+1)η (1−4γη)m < 1, ͳΒ, T ൓෮ޙͷ໨తؔ਺஋͸࣍ͷΑ͏ʹ্͔Β཈͑ΒΕΔ: E[P(ˆ x(T)) − P(x∗)] ≤ ρT (P(ˆ x(0)) − P(x∗)). ఆཧͷԾఆ͸ m ͕࣍Λຬͨ͢͜ͱΛཁ੥: m ≥ Ω ( γ λ ) . ಺෦൓෮ͷܭࢉྔ: O(n + m) ޡࠩ ϵ ·Ͱͷ֎෦൓෮ͷճ਺: T = O(log(1/ϵ)) ⇒ શମతͳܭࢉྔ: O ((n + m) log(1/ϵ)) = O (( n + γ λ ) log(1/ϵ) ) 79 / 167

Slide 90

Slide 90 text

ඇ֬཰తख๏ͱͷൺֱ E[P(x(T)) − P(x∗)] ≤ ϵ ·ͰͷܭࢉྔΛൺֱ. κ = γ/λ ͱ͢Δ (৚݅਺). SVRG: (n + κ) log (1/ϵ) Ω((n + κ) log (1/ϵ)) ճͷ൓෮ × ̍൓෮͝ͱ Ω(1) ඇ֬཰తޯ഑๏: nκ log (1/ϵ) Ω(κ log (1/ϵ)) ճͷ൓෮ × ̍൓෮͝ͱ Ω(n) σʔλ਺ n = 100, 000, ਖ਼ଇԽύϥϝʔλ λ = 1/1000, ฏ׈ੑ γ = 1: n×κ = 108, n+κ = 105. 1000 ഒ 80 / 167

Slide 91

Slide 91 text

SVRG ౳όονܕख๏ͷՃ଎ 81 / 167

Slide 92

Slide 92 text

Catalyst: SVRG, SAG, SAGA ͷՃ଎ (൚༻తͳख๏) Catalyst (Lin et al., 2015) For t = 1, 2, . . . : 1 ڧತੑΛڧΊͨҎԼͷ໰୊Λղ͘: x(t) ≃ argminx { P(x) + α 2 ∥x − y(t−1)∥2 } (up to ϵt precision). 2 ղΛ “Ճ଎” ͢Δ: y(t) = x(t) + βt (x(t) − x(t−1)). − Catalyst ͸ۙࣅతۙ઀఺๏ͷՃ଎๏ʹ͋ͨΔɽ − ϵt = C(1 − √ λ/2(λ + α))t ͱ͢Ε͹ɼ P(x(t)) − P(x∗) ≤ C′ ( 1 − √ λ 2(λ+α) )t . − α = max{c γ n , λ} ͱͯ͠ SVRG, SAG, SAGA Λ಺෦ϧʔϓʹద༻͢Ε͹ɼશମ Ͱ (n + √ γn λ ) log(1/ϵ) ͷܭࢉྔͰࡁΉɽ − ൚༻తͳख๏Ͱ͸͋Δ͕ɼ಺෦ϧʔϓͷߋ৽ճ਺΍ α ͷબ୒ʹහײɽ − APPA (Frostig et al., 2015) ΋ࣅͨΑ͏ͳख๏ɽ 82 / 167

Slide 93

Slide 93 text

KatyushaɿSVRG ͷՃ଎๏ 83 / 167

Slide 94

Slide 94 text

Katyusha Katyusha = SVRG + Inner Acceleration + Katyusha momentum (Allen-Zhu, 2016, 2017) (STOC2017) Katyusha ͸ “negative momentum” Λ಺෦ϧʔϓͰ༻͍Δ: yk = θ{xk−1 + β(xk−1 − xk−2 ) acceleration } + (1 − θ) x0 magnet = xk−1 + θβ(xk−1 − xk−2 ) + (1 − θ)(x0 − xk−1 ) Katyusha momentum 84 / 167

Slide 95

Slide 95 text

Katyusha Katyusha (Allen-Zhu (2016)) For s = 0, . . . , S − 1: 1 µ(s) = ∇L(ˆ x(s)). 2 For j = 0, . . . , m − 1: 1 k = (sm) + j. 2 x[k+1] = τ1 zk + τ2 ˆ x(s) + (1 − τ1 − τ2 )y[k] (Ճ଎εςοϓ) 3 i ∈ {1, . . . , n} ΛҰ༷ʹαϯϓϦϯά͠ɼ෼ࢄॖখޯ഑Λܭࢉ: g = µ(s) + ∇ℓi (x[k+1] ) − ∇ℓi (ˆ x(s)) (෼ࢄॖখޯ഑). 4 z[k+1] = arg minz { 1 2α ∥z − z[k] ∥2 + ⟨g, z⟩ + ψ(z)}. 5 y[k+1] = arg miny {3L 2 ∥y − x[k+1] ∥2 + ⟨g, y⟩ + ψ(y)} 3 ˆ x(s+1) = 1 ∑ m−1 j=0 (1+ασ)j ( ∑ m−1 j=0 (1 + ασ)j ysm+j+1 ) ௨ৗͷڧತؔ਺ʹؔ͢ΔՃ଎͸ x[k+1] = τ1 zk + (1 − τ1 )y[k] ͱ͢Δ͕ɼˆ x(s) ੒෼΋ೖΕΔ͜ͱͰ SVRG Ͱ΋Ճ଎Λୡ੒ɽ ˠҎલͷղʹ͚ۙͮΔͷͰʮnegative momentumʯͱݴ͏ɽ τ1, τ2 , α ͸ద੾ʹઃఆʢ࣍ͷϖʔδʣ 85 / 167

Slide 96

Slide 96 text

Katyusha ͷܭࢉྔ (ڧತ) ℓi ͕ γ-ฏ׈͔ͭɼψ ͕ λ-ڧತͷ࣌: m = 2n, τ1 = min{ √ mλ 3γ , 1 2 }, τ2 = 1 2 , α = 1 3τ2γ ͱ͢Δ͜ͱͰ (n + √ nγ λ ) log(1/ϵ) ճͷߋ৽ճ਺Ͱޡࠩ ϵ ·Ͱ౸ୡɽ SVRG (n + γ λ ) log(1/ϵ) → Katyusha (n + √ nγ λ ) log(1/ϵ) ʹվળ ۙ઀ޯ഑๏ͷՃ଎๏͸ O(n √ γ/λ log(1/ϵ)) ͳͷͰ࠷େͰ √ n ഒ଎͍ɽ n = 106 ͳΒ 1000 ഒ଎͍ɽɹ 86 / 167

Slide 97

Slide 97 text

Katyusha ͷܭࢉྔ (ඇڧತ) ℓi ͕ γ-ฏ׈͔ͭɼψ ͕ڧತͱ͸ݶΒͳ͍࣌: m = 2n, τ1 = 2 s+4 , τ2 = 1 2 , α = 1 3τ1γ (τ1 ͱ α ͸֤ s ͝ͱʹมԽͤ͞Δ) ͱ͢ Ε͹ n √ ϵ √ P(ˆ x(0) − P(x∗) + √ nγ ϵ ∥ˆ x(0) − x∗∥ = Ω ( n √ ϵ ) ճͷߋ৽ճ਺Ͱޡࠩ ϵ ·Ͱୡ੒ɽ ͞ΒʹɼAdaptReg (Allen-Zhu and Hazan (2016)) ͱ͍͏ํ๏ͱ૊Έ߹ΘͤΔ͜ ͱͰ n log(1/ϵ) + √ nγ ϵ ·Ͱվળ͢Δ͜ͱ͕Ͱ͖Δɽ ⇒ ۙ઀ޯ഑๏ͷՃ଎๏ͷܭࢉྔ O(n √ γ/ϵ) ͱൺ΂ͯ √ n ഒ଎͍ɽ 87 / 167

Slide 98

Slide 98 text

Katyusha ͷܭࢉྔ·ͱΊ SVRG ͷՃ଎Λୡ੒. ܭࢉྔͷൺֱ µ-strongly convex Non-strongly convex GD O ( n log ( 1 ϵ )) O(n √ L ϵ ) SVRG O ( (n + κ) log ( 1 ϵ )) O ( n+L ϵ ) Katyusha O (( n + √ nκ ) log ( 1 ϵ )) O ( n log ( 1 ϵ ) + √ nL ϵ ) (κ = L/µ: ৚݅਺) ڧತͷՃ଎Ϩʔτ͸֬཰త࠲ඪ߱Լ๏ͷ APCG (Lin et al., 2014) ΍ओ૒ରܕͷ SPDC (Zhang and Lin, 2015) Ͱ΋ୡ੒Ͱ͖Δ. 88 / 167

Slide 99

Slide 99 text

͞Βʹվળɿϛχόον๏ + Ճ଎ ʢೋॏՃ଎෼ࢄॖখ๏ʣ 89 / 167

Slide 100

Slide 100 text

Doubly accelerated stochastic variance reduced dual averaging method (Murata and Suzuki, NIPS2017) minx P(x) = 1 n ∑ n i=1 fi (x) + ψ(x) = F(x) + ψ(x) ϛχόον๏: ֤ߋ৽Ͱ I ⊂ [n] (|I| = b) ͳΔϛχόονΛϥϯμϜʹબ୒ gt = 1 |I| ∑ i∈I (∇x fi (xt ) − ∇x fi (ˆ x)) + 1 n n ∑ i=1 ∇x fi (ˆ x) ϛχόον SVRG + ೋॏՃ଎๏ 90 / 167

Slide 101

Slide 101 text

Doubly accelerated stochastic variance reduced dual averaging method (Murata and Suzuki, NIPS2017) minx P(x) = 1 n ∑ n i=1 fi (x) + ψ(x) = F(x) + ψ(x) ϛχόον๏: ֤ߋ৽Ͱ I ⊂ [n] (|I| = b) ͳΔϛχόονΛϥϯμϜʹબ୒ gt = 1 |I| ∑ i∈I (∇x fi (xt ) − ∇x fi (ˆ x)) + 1 n n ∑ i=1 ∇x fi (ˆ x) ϛχόον SVRG + ೋॏՃ଎๏ ϛχόοναΠζ΁ͷґଘੑΛվળɽ Katyusha ͱൺ΂ͯνϡʔχϯάύϥϝʔλ͕গͳ͍ɽ ޡࠩ ϵ ·Ͱͷܭࢉྔɽ(b = |I| ͸ϛχόοναΠζ) Katyusha DASVRDA Strong conv. O (( n + √ bnκ ) log(1/ϵ) ) O (( n + √ nκ + b √ κ ) log(1/ϵ) ) Non-Strong O ( n log(1/ϵ) + √ b nL ϵ ) O ( n log(1/ϵ) + √ nL ϵ + b √ L ϵ ) ࠷େͰ √ b ഒ଎͍ɽ 90 / 167

Slide 102

Slide 102 text

AccProxSVRG: Nitanda (2014) 91 / 167

Slide 103

Slide 103 text

Algorithm (Murata and Suzuki, 2017) DASVRDA(˜ x0 , η, m, b, S) Iterate the following for s = 1, 2, . . . , S: ˜ ys = ˜ xs−1 + s−2 s+1 (˜ xs−1 − ˜ xs−2 ) + s−1 s+1 (˜ zs−1 − ˜ xs−2 ) (outer acceleration) (˜ xs, ˜ zs ) = One-Pass-AccSVRDA(˜ ys, ˜ xs−1, η, β, m, b) 92 / 167

Slide 104

Slide 104 text

Algorithm (Murata and Suzuki, 2017) DASVRDA(˜ x0 , η, m, b, S) Iterate the following for s = 1, 2, . . . , S: ˜ ys = ˜ xs−1 + s−2 s+1 (˜ xs−1 − ˜ xs−2 ) + s−1 s+1 (˜ zs−1 − ˜ xs−2 ) (outer acceleration) (˜ xs, ˜ zs ) = One-Pass-AccSVRDA(˜ ys, ˜ xs−1, η, β, m, b) ڧತͷ৔߹ɼ͞ΒʹʮϦελʔτ๏ʯΛద༻͢Δɽ DASVRDAsc(ˇ x0 , η, m, b, S, T) Iterate the following for t = 1, 2, . . . , T: ˇ xt = DASVRDA(ˇ xt−1, η, m, b, S). 92 / 167

Slide 105

Slide 105 text

One-Pass-AccSVRDA(x0 , ˜ x, η, m, b) Iterate the following for k = 1, 2, . . . , m: Sample Ik ⊂ {1, . . . , n} with size b uniformly. yk = xk−1 + k−1 k+1 (xk−1 − xk−2 ) (inner acceleration) vk = 1 b ∑ i∈Ik (∇fi (yk ) − ∇fi (˜ x)) + ∇F(˜ x) (variance reduction) ¯ vk = ( 1 − 2 k+1 ¯ vk−1 + 2 k+1 vk ) (dual averaging) zk = prox(x0 − ηk(k+1) 4 ¯ vk |ηk(k+1) 4 ψ) (prox. gradient descent) xk = ( 1 − 2 k+1 ) xk−1 + 2 k+1 zk (inner acceleration) Output: (xm, zm ) ΞΠσΟΞ: ಺෦Ճ଎ ͱ ֎෦Ճ଎ ͷ૊Έ߹Θͤ (Double acceleration) ֎ଆͷՃ଎ɿAPG (Accelerated Proximal Gradient) ɹ ৽͍͠ϞʔϝϯλϜ߲ s−1 s+1 (˜ zs−1 − ˜ xs−2 ) ΛՃ͑Δɽ ಺ଆͷՃ଎ɿAccSVRDA = AccSDA (Xiao, 2009) + ෼ࢄॖখ AccProxSVRG (Nitanda, 2014) ͷٕ๏΋ԉ༻ˠྑ͍ϛχόονޮ཰ੑɼϛχ όονʹΑΔ෼ࢄॖখΛ༻͍ͯΑΓੵۃతʹՃ଎ɽ 93 / 167

Slide 106

Slide 106 text

ऩଋղੳ Ծఆ: ໨తؔ਺ P ͸ µ-ڧತ. ℓi ͸ γ-ฏ׈. Theorem (ڧತ) η = O ( 1 (1+n/b2)γ ) , S = O(1 + b n √ γ µ + √ γ nµ ), T = O(log(1/ϵ)) ͱ͢Δɽ DASVRDAsc(ˇ x0, η, n/b, b, S, T) (+ೋஈ֊Ճ଎๏) ͸ O (( n + √ nγ µ + b √ γ µ ) log ( 1 ϵ )) ճͷޯ഑ܭࢉͰ E[P(ˇ xT ) − P(x∗)] ≤ ϵ Λୡ੒ɽ Katyusha O((n + √ nb √ κ) log(1/ϵ)) DASVRDA O((n + ( √ n + b) √ κ) log(1/ϵ)) (κ := γ/µ) → min{ √ b, √ n/b} ഒ଎͍. b = max{ √ n, n/ √ κ} ͱ͢Ε͹ɼߋ৽ճ਺͸ = √ κ log(1/ϵ). 94 / 167

Slide 107

Slide 107 text

Ծఆ: ℓi ͸ γ-ฏ׈ɽ Theorem (ඇڧತ) η = O ( 1 (1+n/b2)γ ) , S = O(1 + b n √ γ ϵ + √ γ nϵ ) ͱ͢Ε͹ɼ DASVRDA(x0, η, n/b, b, S) ͸ O ( n log(1/ϵ) + √ nγ ϵ + b √ γ ϵ ) ճͷޯ഑ܭࢉͰ E[P(˜ xS ) − P(x∗)] ≤ ϵ Λୡ੒ɽ (͜ͷϨʔτΛୡ੒͢Δʹ͸ warm-start Λ࠷ॳʹೖΕΔඞཁ͕͋Δɽ) Katyusha O((n log(1/ϵ) + √ nb √ γ/ϵ)) DASVRDA O((n log(1/ϵ) + ( √ n + b) √ γ/ϵ)) → min{ √ b, √ n/b} ഒ଎͍. b = √ n ͱ͢Ε͹ɼߋ৽ճ਺͸ = √ γ/ϵɽ 95 / 167

Slide 108

Slide 108 text

96 / 167

Slide 109

Slide 109 text

ϛχόοναΠζͷ࠷దੑ ద੾ͳϛχόοναΠζΛ༻͍ͨ DASVRDA ͸ߋ৽ճ਺ɼશܭࢉྔͱ΋ʹ΄΅ ࠷ద (Nitanda, Murata, and Suzuki, 2019) (see also Nesterov (2004), Woodworth and Srebro (2016), Arjevani and Shamir (2016)). શܭࢉྔ (ޯ഑ͷܭࢉճ਺) = ϛχόοναΠζ ʷ ߋ৽ճ਺ શܭࢉྔ ߋ৽ճ਺ GD n √ κ log(1/ϵ) √ κ log(1/ϵ) DASVRDA O (( n + √ nκ ) log(1/ϵ) ) O ( √ κ log(1/ϵ)) (with b∗) Optimal Ω(n + √ nκ log(1/ϵ)) Ω( √ κ log(1/ϵ)) ʢڧತͷ৔߹ɼ࠷దϛχόοναΠζ b∗ = √ n + n √ κ log(1/ϵ) ʣ ྫ͑͹ n ≤ κ ͷ࣌ɼϛχόοναΠζΛ b ≫ b∗ = Θ( √ n) ͱͯ͠΋ແବɽ ͭ·Γɼ௨ৗͷޯ഑๏ͷΑ͏ͳେόοναΠζͷख๏͸ԿʹͤΑಘΛ͠ͳ͍ɽ NJOJCBUDITJ[F NJOJCBUDITJ[F NJOJCBUDITJ[F NJOJCBUDITJ[F ඇڧತͷ৔߹΋ಉ༷ɽ 97 / 167

Slide 110

Slide 110 text

όονܕ֬཰త࠷దԽख๏ͷ·ͱΊ ֤छख๏ͷੑ࣭ ख๏ SDCA SVRG SAG/SAGA ओ/૒ର ૒ର ओ ओ ϝϞϦޮ཰ △ Ճ଎ (µ > 0) Katyusha/DASVRDA Catalyst ͦͷଞ੍໿ ℓi (β) = fi (x⊤ i β) ೋॏϧʔϓ ฏ׈ͳਖ਼ଇԽ 98 / 167

Slide 111

Slide 111 text

όονܕ֬཰త࠷దԽख๏ͷ·ͱΊ ֤छख๏ͷੑ࣭ ख๏ SDCA SVRG SAG/SAGA ओ/૒ର ૒ର ओ ओ ϝϞϦޮ཰ △ Ճ଎ (µ > 0) Katyusha/DASVRDA Catalyst ͦͷଞ੍໿ ℓi (β) = fi (x⊤ i β) ೋॏϧʔϓ ฏ׈ͳਖ਼ଇԽ ऩଋϨʔτ (γ, µ ͷ log ߲͸আ͘) ख๏ µ > 0 µ = 0 Ճ଎๏ (µ > 0) ۙ઀ޯ഑๏ nγ µ log(1/ϵ) n √ γ/ϵ (Ճ଎) n √ γ µ log(1/ϵ) SDCA & SVRG (n + γ µ ) log(1/ϵ) O(n log(1 ϵ ) + √ nγ ϵ ) (n + √ nγ µ ) log(1/ϵ) SAG/SAGA (n + γ µ ) log(1/ϵ) γn/ϵ(˞) (n + √ nγ µ ) log(1/ϵ) ˞ µ = 0 ͷ࣌ɼCatalyst ͸ O((n + √ nγ/ϵ) log2(1/ϵ))ɼKatyusha ͸ AdaptReg ͱ߹Θͤ ͯ O(n log(1/ϵ) + √ nγ/ϵ) ͷऩଋϨʔτΛୡ੒. ۙ઀ޯ഑๏ ֬཰త࠷దԽ n √ γ µ log(1/ϵ) ≥ (n + √ nγ µ ) log(1/ϵ) 98 / 167

Slide 112

Slide 112 text

༗༻ͳจݙ Yurii Nesterov: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, 2014. ˒ Guanghui Lan: First-order and Stochastic Optimization Methods for Machine Learning. Springer Series in the Data Sciences, Springer, 2020. ۚ৿ ܟจɼླ໦ େ࣊ɼ஛಺ Ұ࿠ɼࠤ౻ Ұ੣:ʰػցֶशͷͨΊͷ࿈ଓ࠷ద Խʱ ɽߨஊࣾɼ2016ɽ ླ໦େ࣊:ʰ֬཰త࠷దԽʱ ɽߨஊࣾɼ2015ɽ 99 / 167

Slide 113

Slide 113 text

Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 100 / 167

Slide 114

Slide 114 text

Convex set Definition (Convex set) A convex set is a set that contains the segment connecting two points in the set: x1, x2 ∈ C =⇒ θx1 + (1 − θ2 )x2 ∈ C (θ ∈ [0, 1]). ತू߹ ඇತू߹ ඇತू߹ 101 / 167

Slide 115

Slide 115 text

Epigraph and domain Let ¯ R := R ∪ {∞}. Definition (Epigraph and domain) The epigraph of a function f : Rp → ¯ R is given by epi(f ) := {(x, µ) ∈ Rp+1 : f (x) ≤ µ}. The domain of a function f : Rp → ¯ R is given by dom(f ) := {x ∈ Rp : f (x) < ∞}. FQJHSBQI EPNBJO > 102 / 167

Slide 116

Slide 116 text

Convex function Let ¯ R := R ∪ {∞}. Definition (Convex function) A function f : Rp → ¯ R is a convex function if f satisfies θf (x) + (1 − θ)f (y) ≥ f (θx + (1 − θ)y) (∀x, y ∈ Rp, θ ∈ [0, 1]), where ∞ + ∞ = ∞, ∞ ≤ ∞. ತ ඇತ f is convex ⇔ epi(f ) is a convex set. 103 / 167

Slide 117

Slide 117 text

Proper and closed convex function If the domain of a function f is not empty (dom(f ) ̸= ∅), f is called proper. If the epigraph of a convex function f is a closed set, then f is called closed. (We are interested in only a proper closed function in this lecture.) Even if f is closed, it’s domain is not necessarily closed (even for 1D). “f is closed” does not imply“f is continuous.” Closed convex function is continuous on a segment in its domain. Closed function is “lower semicontinuity.” 104 / 167

Slide 118

Slide 118 text

Convex loss functions (regression) All well used loss functions are (closed) convex. The followings are convex w.r.t. u with a fixed label y ∈ R. Squared loss: ℓ(y, u) = 1 2 (y − u)2. τ-quantile loss: ℓ(y, u) = (1 − τ) max{u − y, 0} + τ max{y − u, 0}. for some τ ∈ (0, 1). Used for quantile regression. ϵ-sensitive loss: ℓ(y, u) = max{|y − u| − ϵ, 0} for some ϵ > 0. Used for support vector regression. f-y -3 -2 -1 0 1 2 3 0 1 2 3 τRVBOUJMF TFOTJUJWF 4RVBSFE )VCFS 105 / 167

Slide 119

Slide 119 text

Convex surrogate loss (classification) y ∈ {±1} Logistic loss: ℓ(y, u) = log((1 + exp(−yu))/2). Hinge loss: ℓ(y, u) = max{1 − yu, 0}. Exponential loss: ℓ(y, u) = exp(−yu). Smoothed hinge loss: ℓ(y, u) =      0, (yu ≥ 1), 1 2 − yu, (yu < 0), 1 2 (1 − yu)2, (otherwise). yf -3 -2 -1 0 1 2 3 0 1 2 3 4 -PHJTUJD FYQ )JOHF 4NPPUIFEIJOHF 106 / 167

Slide 120

Slide 120 text

Convex regularization functions Ridge regularization: R(x) = ∥x∥2 2 := ∑ p j=1 x2 j . L1 regularization: R(x) = ∥x∥1 := ∑ p j=1 |xj |. Trace norm regularization: R(X) = ∥X∥tr = ∑min{q,r} k=1 σj (X) where σj (X) ≥ 0 is the j-th singular value. -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 #SJEHF - 3JEHF &MBTUJDOFU 1 n ∑ n i=1 (yi − z⊤ i x)2 + λ∥x∥1 : Lasso 1 n ∑ n i=1 log(1 + exp(−yi z⊤ i x)) + λ∥X∥tr : Low rank matrix recovery 107 / 167

Slide 121

Slide 121 text

Other definitions of sets Convex hull: conv(C) is the smallest convex set that contains a set C ⊆ Rp. Affine set: A set A is an affine set if and only if ∀x, y ∈ A, the line that intersects x and y lies in A: λx + (1 − λ)y ∀λ ∈ R. Affine hull: The smallest affine set that contains a set C ⊆ Rp. Relative interior: ri(C). Let A be the affine hull of a convex set C ⊆ Rp. ri(C) is a set of internal points with respect to the relative topology induced by the affine hull A. $POWFYIVMM "⒏OFIVMM 3FMBUJWFJOUFSJPS 108 / 167

Slide 122

Slide 122 text

Continuity of a closed convex function Theorem For a (possibly non-convex) function f : Rp → ¯ R, the following three conditions are equivalent to each other. 1 f is lower semi-continuous. 2 For any converging sequence {xn}∞ n=1 ⊆ Rp s.t. x∞ = limn xn , lim infn f (xn ) ≥ f (x∞ ). 3 f is closed. Remark: Any convex function f is continuous in ri(dom(f )). The continuity could be broken on the boundary of the domain. 109 / 167

Slide 123

Slide 123 text

Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 110 / 167

Slide 124

Slide 124 text

Subgradient We want to deal with non-differentiable function such as L1 regularization. To do so, we need to define something like gradient. Definition (Subdifferential, subgradient) For a proper convex function f : Rp → ¯ R, the subdifferential of f at x ∈ dom(f ) is defined by ∂f (x) := {g ∈ Rp | ⟨x′ − x, g⟩ + f (x) ≤ f (x′) (∀x′ ∈ Rp)}. An element of the subdifferential is called subgradient. f(x) x ྼඍ෼ 111 / 167

Slide 125

Slide 125 text

Properties of subgradient Subgradient does not necessarily exist (∂f (x) could be empty). f (x) = x log(x) (x ≥ 0) is proper convex but not subdifferentiable at x = 0. Subgradient always exists on ri(dom(f )). 112 / 167

Slide 126

Slide 126 text

Properties of subgradient Subgradient does not necessarily exist (∂f (x) could be empty). f (x) = x log(x) (x ≥ 0) is proper convex but not subdifferentiable at x = 0. Subgradient always exists on ri(dom(f )). If f is differentiable at x, its gradient is the unique element of subdiff. ∂f (x) = {∇f (x)}. 112 / 167

Slide 127

Slide 127 text

Properties of subgradient Subgradient does not necessarily exist (∂f (x) could be empty). f (x) = x log(x) (x ≥ 0) is proper convex but not subdifferentiable at x = 0. Subgradient always exists on ri(dom(f )). If f is differentiable at x, its gradient is the unique element of subdiff. ∂f (x) = {∇f (x)}. If ri(dom(f )) ∩ ri(dom(h)) ̸= ∅, then ∂(f + h)(x) = ∂f (x) + ∂h(x) = {g + g′ | g ∈ ∂f (x), g′ ∈ ∂h(x)} (∀x ∈ dom(f ) ∩ dom(h)). 112 / 167

Slide 128

Slide 128 text

Properties of subgradient Subgradient does not necessarily exist (∂f (x) could be empty). f (x) = x log(x) (x ≥ 0) is proper convex but not subdifferentiable at x = 0. Subgradient always exists on ri(dom(f )). If f is differentiable at x, its gradient is the unique element of subdiff. ∂f (x) = {∇f (x)}. If ri(dom(f )) ∩ ri(dom(h)) ̸= ∅, then ∂(f + h)(x) = ∂f (x) + ∂h(x) = {g + g′ | g ∈ ∂f (x), g′ ∈ ∂h(x)} (∀x ∈ dom(f ) ∩ dom(h)). For all g ∈ ∂f (x) and all g′ ∈ ∂f (x′) (x, x′ ∈ dom(f )), ⟨g − g′, x − x′⟩ ≥ 0. 112 / 167

Slide 129

Slide 129 text

Legendre transform Defines the other representation on the dual space (the space of gradients). Definition (Legendre transform) Let f be a (possibly non-convex) function f : Rp → ¯ R s.t. dom(f ) ̸= ∅. Its convex conjugate is given by f ∗(y) := sup x∈Rp {⟨x, y⟩ − f (x)}. The map from f to f ∗ is called Legendre transform. f(x) f*(y) MJOFXJUIHSBEJFOUy x x* 0 f(x*) 113 / 167

Slide 130

Slide 130 text

Examples f (x) f ∗(y) Squared loss 1 2 x2 1 2 y2 Hinge loss max{1 − x, 0} { y (−1 ≤ y ≤ 0), ∞ (otherwise). Logistic loss log(1 + exp(−x)) { (−y) log(−y) + (1 + y) log(1 + y) (−1 ≤ y ≤ 0), ∞ (otherwise). L1 regularization ∥x∥1 { 0 (maxj |yj | ≤ 1), ∞ (otherwise). Lp regularization ∑ d j=1 |xj |p ∑ d j=1 p−1 p p p−1 |yj | p p−1 (p > 1) logistic dual of logistic L1 -norm dual 114 / 167

Slide 131

Slide 131 text

Properties of Legendre transform f ∗ is a convex function even if f is not. f ∗∗ is the closure of the convex hull of f : f ∗∗ = cl(conv(f )). Corollary Legendre transform is a bijection from the set of proper closed convex functions onto that defined on the dual space. f (proper closed convex) ⇔ f ∗ (proper closed convex) 0 1 2 3 4 -6 -4 -2 0 2 4 6 G Y DMDPOW 115 / 167

Slide 132

Slide 132 text

Connection to subgradient Lemma y ∈ ∂f (x) ⇔ f (x) + f ∗(y) = ⟨x, y⟩ ⇔ x ∈ ∂f ∗(y). ∵ y ∈ ∂f (x) ⇒ x = argmax x′∈Rp {⟨x′, y⟩ − f (x′)} (take the “derivative” of ⟨x′, y⟩ − f (x′)) ⇒ f ∗(y) = ⟨x, y⟩ − f (x). Remark: By definition, we always have f (x) + f ∗(y) ≥ ⟨x, y⟩. → Young-Fenchel’s inequality. 116 / 167

Slide 133

Slide 133 text

⋆ Fenchel’s duality theorem Theorem (Fenchel’s duality theorem) Let f : Rp → ¯ R, g : Rq → ¯ R be proper closed convex, and A ∈ Rq×p. Suppose that either of condition (a) or (b) is satisfied, then it holds that inf x∈Rp {f (x) + g(Ax)} = sup y∈Rq {−f ∗(A⊤y) − g∗(−y)}. (a) ∃x ∈ Rp s.t. x ∈ ri(dom(f )) and Ax ∈ ri(dom(g)). (b) ∃y ∈ Rq s.t. A⊤y ∈ ri(dom(f ∗)) and −y ∈ ri(dom(g∗)). If (a) is satisfied, there exists y∗ ∈ Rq that attains sup of the RHS. If (b) is satisfied, there exists x∗ ∈ Rp that attains inf of the LHS. Under (a) and (b), x∗, y∗ are the optimal solutions of the each side iff A⊤y∗ ∈ ∂f (x∗), Ax∗ ∈ ∂g∗(−y∗). → Karush-Kuhn-Tucker condition. 117 / 167

Slide 134

Slide 134 text

Equivalence to the separation theorem $POWFY $PODBWF 118 / 167

Slide 135

Slide 135 text

Applying Fenchel’s duality theorem to RERM RERM (Regularized Empirical Risk Minimizatino): Let ℓi (z⊤ i x) = ℓ(yi , z⊤ i x) where (zi , yi ) is the input-output pair of the i-th observation. (Primal) inf x∈Rp { n ∑ i=1 ℓi (z⊤ i x) + ψ(x) } f (Zx) [Fenchel’s duality theorem] inf x∈Rp {f (Zx) + ψ(x)} = − inf y∈Rn {f ∗(y) + ψ∗(−Z⊤y)} (Dual) sup y∈Rn { n ∑ i=1 ℓ∗ i (yi ) + ψ∗(−Z⊤y) } This fact will be used to derive dual coordinate descent alg. 119 / 167

Slide 136

Slide 136 text

A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transcations on Information Theory, 58(5): 3235–3249, 2012. K. Ahn. From proximal point method to nesterov’s acceleration, 2020. Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. arXiv preprint arXiv:1603.05953, 2016. Z. Allen-Zhu. Katyusha: the first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1200–1205. ACM, 2017. Z. Allen-Zhu and E. Hazan. Optimal black-box reductions between optimization objectives. arXiv preprint arXiv:1603.05642, 2016. Y. Arjevani and O. Shamir. Dimension-free iteration complexity of finite sum optimization problems. In Advances in Neural Information Processing Systems, pages 3540–3548, 2016. P. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. The Annals of Statistics, 33:1487–1537, 2005. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009. L. Bottou. Online algorithms and stochastic approximations. 1998. URL http://leon.bottou.org/papers/bottou-98x. revised, oct 2012. 119 / 167

Slide 137

Slide 137 text

L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010. L. Bottou and Y. LeCun. Large scale online learning. In S. Thrun, L. Saul, and B. Sch¨ olkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004. URL http://leon.bottou.org/papers/bottou-lecun-2004. X. Chen, Q. Lin, and J. Pena. Optimal regularized dual averaging methods for stochastic optimization. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 395–403. Curran Associates, Inc., 2012. S. Dasgupta and A. Gupta. An elementary proof of the johnson-lindenstrauss lemma. Technical Report 99–006, U.C. Berkeley, 1999. A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1646–1654. Curran Associates, Inc., 2014. J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011. 119 / 167

Slide 138

Slide 138 text

R. Frostig, R. Ge, S. Kakade, and A. Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In International Conference on Machine Learning, pages 2540–2548, 2015. S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012. S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089, 2013. C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 781–789. Curran Associates, Inc., 2009. R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 315–323. Curran Associates, Inc., 2013. W. B. Johnson, J. Lindenstrauss, and G. Schechtman. Extensions of lipschitz maps into banach spaces. Israel Journal of Mathematics, 54(2):129–138, 1986. G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, 2012. 119 / 167

Slide 139

Slide 139 text

N. Le Roux, M. Schmidt, and F. R. Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2663–2671. Curran Associates, Inc., 2012. H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. Technical report, 2015. arXiv:1506.02186. Q. Lin, Z. Lu, and L. Xiao. An accelerated proximal coordinate gradient method. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3059–3067. Curran Associates, Inc., 2014. T. Murata and T. Suzuki. Doubly accelerated stochastic variance reduced dual averaging method for regularized empirical risk minimization. In Advances in Neural Information Processing Systems 30, pages 608–617. 2017. A. Nemirovskii and D. Yudin. On cezari ʟ s convergence of the steepest descent method for approximating saddle points of convex-concave functions. Soviet Mathematics Doklady, 19(2):576–601, 1978. A. Nemirovsky and D. Yudin. Problem complexity and method efficiency in optimization. John Wiley, New York, 1983. Y. Nesterov. Introductory lectures on convex optimization: A basic course. Kluwer Academic Publishers, 2004. 119 / 167

Slide 140

Slide 140 text

A. Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1574–1582. Curran Associates, Inc., 2014. A. Nitanda, T. Murata, and T. Suzuki. Sharp characterization of optimal minibatch size for stochastic finite sum convex optimization. In 2019 IEEE International Conference on Data Mining (ICDM), pages 488–497, 2019. B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992. A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In J. Langford and J. Pineau, editors, Proceedings of the 29th International Conference on Machine Learning, pages 449–456. Omnipress, 2012. ISBN 978-1-4503-1285-1. H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951. F. Rosenblatt. The perceptron: A perceiving and recognizing automaton. Technical Report Technical Report 85-460-1, Project PARA, Cornell Aeronautical Lab., 1957. M. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements. IEEE Transactions of Information Theory, 39, 2013. 119 / 167

Slide 141

Slide 141 text

D. Ruppert. Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988. M. Schmidt, N. Le Roux, and F. R. Bach. Minimizing finite sums with the stochastic average gradient, 2013. hal-00860051. S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14: 567–599, 2013. Y. Singer and J. C. Duchi. Efficient learning using forward-backward splitting. In Advances in Neural Information Processing Systems, pages 495–503, 2009. I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th international conference on machine learning (ICML-13), pages 1139–1147, 2013. B. E. Woodworth and N. Srebro. Tight complexity bounds for optimizing composite objectives. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3639–3647. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/ 6058-tight-complexity-bounds-for-optimizing-composite-objectives. pdf. 119 / 167

Slide 142

Slide 142 text

L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems 23. 2009. L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11:2543–2596, 2010. L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24:2057–2075, 2014. Y. Zhang and X. Lin. Stochastic primal-dual coordinate method for regularized empirical risk minimization. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 353–361, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/zhanga15.html. 119 / 167