Slide 1

Slide 1 text

εύʔεਪఆ֓؍ɿϞσϧɾཧ࿦ɾԠ༻ † ླ໦ɹେ࣊ †Tokyo Institute of Technology Department of Mathematical and Computing Sciences 2014 ೥ 9 ݄ 15 ೔ ౷ܭ࿈߹େձ@౦ژେֶ 1 / 56

Slide 2

Slide 2 text

Outline 1 εύʔεਪఆͷϞσϧ 2 ͍Ζ͍Ζͳεύʔεਖ਼ଇԽ 3 εύʔεਪఆͷཧ࿦ n ≫ p ͷཧ࿦ n ≪ p ͷཧ࿦ 4 ߴ࣍ݩઢܗճؼͷݕఆ 5 εύʔεਪఆͷ࠷దԽख๏ 2 / 56

Slide 3

Slide 3 text

ߴ࣍ݩσʔλͰͷ໰୊ҙࣝ ήϊϜσʔλ ۚ༥σʔλ ڠௐϑΟϧλϦϯά ίϯϐϡʔλϏδϣϯ Ի੠ೝࣝ ࣍ݩ d = 10000 ͷ࣌ɼαϯϓϧ਺ n = 1000 Ͱਪఆ͕Ͱ͖Δ͔ʁ ͲͷΑ͏ͳ৚͕݅͋Ε͹ਪఆ͕Մೳ͔ʁ ԿΒ͔ͷ௿࣍ݩੑ (εύʔεੑ) Λར༻ɽ 3 / 56

Slide 4

Slide 4 text

ྺ࢙: εύʔεਪఆͷख๏ͱཧ࿦ 1992 Donoho and Johnstone Wavelet shrinkage (Soft-thresholding) 1996 Tibshirani Lasso ͷఏҊ 2000 Knight and Fu Lasso ͷ઴ۙ෼෍ (n ≫ p) 2006 Candes and Tao, ѹॖηϯγϯά Donoho (੍ݶ౳௕ੑɼ׬શ෮ݩɼp ≫ n) 2009 Bickel et al., Zhang ੍ݶݻ༗஋৚݅ (Lasso ͷϦεΫධՁ, p ≫ n) 2013 van de Geer et al., εύʔεਪఆʹ͓͚Δݕఆ Lockhart et al. (p ≫ n) ͜ΕҎલʹ΋൓ࣹ๏஍਒୳ࠪ΍ը૾ࡶԻআڈɼ๨٫෇͖ߏ଄ֶशʹ L1 ਖ਼ଇԽ͸࢖ΘΕͯ ͍ͨɽৄ͘͠͸ాதར޾ (2010) Λࢀর. 4 / 56

Slide 5

Slide 5 text

Outline 1 εύʔεਪఆͷϞσϧ 2 ͍Ζ͍Ζͳεύʔεਖ਼ଇԽ 3 εύʔεਪఆͷཧ࿦ n ≫ p ͷཧ࿦ n ≪ p ͷཧ࿦ 4 ߴ࣍ݩઢܗճؼͷݕఆ 5 εύʔεਪఆͷ࠷దԽख๏ 5 / 56

Slide 6

Slide 6 text

ߴ࣍ݩσʔλղੳ αϯϓϧ਺ ≪ ࣍ݩ όΠΦΠϯϑΥ ςΩετσʔλ ը૾σʔλ 6 / 56

Slide 7

Slide 7 text

ߴ࣍ݩσʔλղੳ αϯϓϧ਺ ≪ ࣍ݩ ʷ ݹయత਺ཧ౷ܭֶɿαϯϓϧ਺ ≫ ࣍ݩ όΠΦΠϯϑΥ ςΩετσʔλ ը૾σʔλ 6 / 56

Slide 8

Slide 8 text

εύʔεਪఆ αϯϓϧ਺ ≪ ࣍ݩ ແବͳ৘ใΛ੾Γམͱ͢ˠεύʔεੑ Lasso ਪఆྔ R. Tsibshirani (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., Vol. 58, No. 1, pages 267–288. Ҿ༻਺ɿ10185 (2014 ೥ 5 ݄ 25 ೔) 7 / 56

Slide 9

Slide 9 text

ม਺બ୒ͷ໰୊ʢઢܗճؼʣ σβΠϯߦྻ X = (Xij ) ∈ Rn×p. p (࣍ݩ) ≫ n (αϯϓϧ਺). ਅͷϕΫτϧ β∗ ∈ Rp: ඇθϩཁૉͷݸ਺͕͔͔ͨͩ d ݸ (εύʔε). Ϟσϧ : Y = Xβ∗ + ξ. (Y , X) ͔Β β∗ Λਪఆɽ ࣮࣭ਪఆ͠ͳͯ͘͸͍͚ͳ͍ม਺ͷ਺͸ d ݸˠม਺બ୒ɽ 8 / 56

Slide 10

Slide 10 text

ม਺બ୒ͷ໰୊ʢઢܗճؼʣ σβΠϯߦྻ X = (Xij ) ∈ Rn×p. p (࣍ݩ) ≫ n (αϯϓϧ਺). ਅͷϕΫτϧ β∗ ∈ Rp: ඇθϩཁૉͷݸ਺͕͔͔ͨͩ d ݸ (εύʔε). Ϟσϧ : Y = Xβ∗ + ξ. (Y , X) ͔Β β∗ Λਪఆɽ ࣮࣭ਪఆ͠ͳͯ͘͸͍͚ͳ͍ม਺ͷ਺͸ d ݸˠม਺બ୒ɽ Mallows’ Cp , AIC: ˆ βMC = argmin β∈Rp ∥Y − Xβ∥2 + 2σ2∥β∥0 ͨͩ͠ ∥β∥0 = |{j | βj ̸= 0}|. ˠ 2p ݸͷީิΛ୳ࡧɽNP-ࠔ೉ɽ 8 / 56

Slide 11

Slide 11 text

Lasso ਪఆྔ Mallows’ Cp ࠷খԽ: ˆ βMC = argmin β∈Rp ∥Y − Xβ∥2 + 2σ2∥β∥0. ໰୊఺: ∥β∥0 ͸ತؔ਺Ͱ͸ͳ͍ɽ࿈ଓͰ΋ͳ͍ɽ୔ࢁͷہॴ࠷దղɽ ˠ ತؔ਺Ͱۙࣅɽ Lasso [L1 ਖ਼ଇԽ] ˆ βLasso = argmin β∈Rp ∥Y − Xβ∥2 + λ∥β∥1 ͨͩ͠ ∥β∥1 = ∑ p j=1 |βj |. ˠ ತ࠷దԽʂ L1 ϊϧϜ͸ L0 ϊϧϜͷ [−1, 1]p ʹ͓ ͚Δತแ (Լ͔Β཈͑Δ࠷େͷತؔ਺) L1 ϊϧϜ͸ཁૉ਺ؔ਺ͷ Lov´ asz ֦ு 9 / 56

Slide 12

Slide 12 text

Lasso ਪఆྔͷεύʔεੑ p = n, X = I ͷ৔߹ɽ ˆ βLasso = argmin β∈Rp 1 2 ∥Y − β∥2 + C∥β∥1 ⇒ ˆ βLasso,i = argmin b∈R 1 2 (yi − b)2 + C|b| = { sign(yi )(yi − C) (|yi | > C) 0 (|yi | ≤ C). খ͍͞γάφϧ͸ 0 ʹॖখ͞ΕΔˠεύʔεʂ 10 / 56

Slide 13

Slide 13 text

Lasso ਪఆྔͷεύʔεੑ ˆ β = arg min β∈Rp 1 n ∥Xβ − Y ∥2 2 + λn p ∑ j=1 |βj |. 11 / 56

Slide 14

Slide 14 text

εύʔεੑͷԸܙ ˆ β = arg min β∈Rp 1 n ∥Xβ − Y ∥2 2 + λn p ∑ j=1 |βj |. Theorem (Lasso ͷऩଋϨʔτ) ͋Δ৚݅ͷ΋ͱɼఆ਺ C ͕ଘࡏͯ͠ߴ͍֬཰Ͱ࣍ͷෆ౳͕ࣜ੒Γཱͭɿ ∥ˆ β − β∗∥2 2 ≤ C dlog(p) n . ˞࣍ݩ͕ߴͯ͘΋ɼ͔͔ͨͩ log(p) Ͱ͔͠ޮ͍ͯ͜ͳ͍ɽ࣮࣭తͳ࣍ݩ d ͕ࢧ഑తɽ ʢ ʮ͋Δ৚݅ʯʹ͍ͭͯ͸ޙͰৄࡉΛઆ໌ʣ 12 / 56

Slide 15

Slide 15 text

Outline 1 εύʔεਪఆͷϞσϧ 2 ͍Ζ͍Ζͳεύʔεਖ਼ଇԽ 3 εύʔεਪఆͷཧ࿦ n ≫ p ͷཧ࿦ n ≪ p ͷཧ࿦ 4 ߴ࣍ݩઢܗճؼͷݕఆ 5 εύʔεਪఆͷ࠷దԽख๏ 13 / 56

Slide 16

Slide 16 text

Lasso ΛҰൠԽ Lasso: min β∈Rp 1 n n ∑ i=1 (yi − x⊤ i β)2 + ∥β∥1 ਖ਼ଇԽ߲ . 14 / 56

Slide 17

Slide 17 text

Lasso ΛҰൠԽ Lasso: min β∈Rp 1 n n ∑ i=1 (yi − x⊤ i β)2 + ∥β∥1 ਖ਼ଇԽ߲ . ҰൠԽͨ͠εύʔεਖ਼ଇԽਪఆ๏: min w∈Rp 1 n n ∑ i=1 ℓ(zi , β) + ψ(β). L1 ਖ਼ଇԽ߲Ҏ֎ʹͲͷΑ͏ͳਖ਼ଇԽ߲͕༗༻Ͱ͋Ζ͏͔ʁ 14 / 56

Slide 18

Slide 18 text

L1 ਖ਼ଇԽʹΑͬͯεύʔεʹͳΔཧ༝ɿ ࠲ඪ࣠ʹԊͬͯ ઑ͍ͬͯΔɽ ਖ਼ଇԽ߲ͷ ઑΓํ Λ޻෉͢Δ͜ͱͰ༷ʑͳεύʔεੑ͕ಘΒΕΔɽ 15 / 56

Slide 19

Slide 19 text

άϧʔϓਖ਼ଇԽ C ∑ g∈G ∥βg ∥ ॏෳͳ͠ ॏෳ͋Γ άϧʔϓ಺͢΂ͯͷม਺͕ಉ࣌ʹ 0 ʹͳΓ΍͍͢ɽ ΑΓੵۃతʹεύʔεʹͰ͖Δɽ Ԡ༻ྫɿήϊϜϫΠυ૬ؔղੳ 16 / 56

Slide 20

Slide 20 text

άϧʔϓਖ਼ଇԽͷԠ༻ྫ ϚϧνλεΫֶश Lounici et al. (2009) T ݸͷλεΫͰಉ࣌ʹਪఆ: y(t) i ≈ x(t)⊤ i β(t) (i = 1, . . . , n(t), t = 1, . . . , T). min β(t) T ∑ t=1 n(t) ∑ i=1 (yi − x(t)⊤ i β(t))2 + C p ∑ k=1 ∥(β(1) k , . . . , β(T) k )∥ άϧʔϓਖ਼ଇԽ . β(1) β(2) β(T) *URXS *URXS *URXS ؞؞؞ ؞؞؞ λεΫؒڞ௨Ͱඇθϩͳม਺Λબ୒ 17 / 56

Slide 21

Slide 21 text

άϧʔϓਖ਼ଇԽͷԠ༻ྫ ϚϧνλεΫֶश Lounici et al. (2009) T ݸͷλεΫͰಉ࣌ʹਪఆ: y(t) i ≈ x(t)⊤ i β(t) (i = 1, . . . , n(t), t = 1, . . . , T). min β(t) T ∑ t=1 n(t) ∑ i=1 (yi − x(t)⊤ i β(t))2 + C p ∑ k=1 ∥(β(1) k , . . . , β(T) k )∥ άϧʔϓਖ਼ଇԽ . β(1) β(2) β(T) *URXS *URXS *URXS ؞؞؞ ؞؞؞ λεΫؒڞ௨Ͱඇθϩͳม਺Λબ୒ 17 / 56

Slide 22

Slide 22 text

τϨʔεϊϧϜਖ਼ଇԽ W : M × N ߦྻɽ ∥W ∥Tr = Tr[(WW ⊤)1 2 ] = min{M,N} ∑ j=1 σj (W ) σj (W ) ͸ W ͷ j ൪໨ͷಛҟ஋ (ඇෛͱ͢Δ)ɽ ಛҟ஋ͷ࿨ = ಛҟ஋΁ͷ L1 ਖ਼ଇԽ ˠ ಛҟ஋͕εύʔε ಛҟ஋͕εύʔε = ௿ϥϯΫ 18 / 56

Slide 23

Slide 23 text

ྫ: ਪનγεςϜ өը A өը B өը C · · · өը X Ϣʔβ 1 4 8 * · · · 2 Ϣʔβ 2 2 * 2 · · · * Ϣʔβ 3 2 4 * · · · * . . . (e.g., Srebro et al. (2005), NetFlix Bennett and Lanning (2007)) 19 / 56

Slide 24

Slide 24 text

ྫ: ਪનγεςϜ ϥϯΫ 1 ͱԾఆ͢Δ өը A өը B өը C · · · өը X Ϣʔβ 1 4 8 4 · · · 2 Ϣʔβ 2 2 4 2 · · · 1 Ϣʔβ 3 2 4 2 · · · 1 . . . (e.g., Srebro et al. (2005), NetFlix Bennett and Lanning (2007)) 19 / 56

Slide 25

Slide 25 text

ྫ: ਪનγεςϜ W*= N M өը Ϣʔβ ˠ ௿ϥϯΫߦྻิ׬: ௿ϥϯΫߦྻͷ Rademacher Complexity: Srebro et al. (2005). Compressed sensing: Cand` es and Tao (2009), Cand` es and Recht (2009). 20 / 56

Slide 26

Slide 26 text

ྫ: ॖখϥϯΫճؼ ॖখϥϯΫճؼ (Anderson, 1951, Burket, 1964, Izenman, 1975) ϚϧνλεΫֶश (Argyriou et al., 2008) ॖখϥϯΫճؼ = W* n Y X N M N W + W ∗ ͸ ௿ϥϯΫ. 21 / 56

Slide 27

Slide 27 text

εύʔεڞ෼ࢄબ୒ xk ∼ N(0, Σ) (i.i.d., Σ ∈ Rp×p), Σ = 1 n ∑ n k=1 xkx⊤ k . ˆ S = argmin S:൒ਖ਼ఆରশ { − log(det(S)) + Tr[SΣ] + λ p ∑ i,j=1 |Si,j | } . (Meinshausen and B uhlmann, 2006, Yuan and Lin, 2007, Banerjee et al., 2008) Σ ͷٯߦྻ S Λਪఆɽ Si,j = 0 ⇔ X(i) , X(j) ͕৚݅෇͖ಠཱɽ Ψ΢γΞϯάϥϑΟΧϧϞσϧ͕ತ࠷దԽͰਪఆͰ͖Δɽ 22 / 56

Slide 28

Slide 28 text

(ҰൠԽ) Fused Lasso ψ(β) = C ∑ (i,j)∈E |βi − βj |. (Tibshirani et al. (2005), Jacob et al. (2009)) Fused lasso ʹΑΔҨ఻ࢠσʔλղੳ (Tibshirani and Taylor ‘11) TV σϊΠδϯά (Chambolle ‘04) 23 / 56

Slide 29

Slide 29 text

ඇತਖ਼ଇԽ SCAD (Smoothly Clipped Absolute Deviation) (Fan and Li, 2001) MCP (Minimax Concave Penalty) (Zhang, 2010) Lq ਖ਼ଇԽ (q < 1), Bridge ਖ਼ଇԽ (Frank and Friedman, 1993) ΑΓεύʔεͳղɽͦͷ୅ΘΓ࠷దԽ͸೉͘͠ͳΔɽ 24 / 56

Slide 30

Slide 30 text

ͦͷଞ L1 ਖ਼ଇԽͷ֦ு Adaptive Lasso (Zou, 2006) ͋ΔҰகਪఆྔ ˜ β ͕͋Δͱͯ͠ɼͦΕΛར༻ɽ ψ(β) = C p ∑ j=1 |βj | |˜ βj |γ Lasso ΑΓ΋খ͍͞όΠΞε (઴ۙෆภ)ɽ ΦϥΫϧϓϩύςΟɽ εύʔεՃ๏Ϟσϧ (Hastie and Tibshirani, 1999, Ravikumar et al., 2009) f (x) = ∑ p j=1 fj (xj ) ͳΔඇઢܗؔ਺Λਪఆɽ fj ∈ Hj (Hj : ࠶ੜ֩ώϧϕϧτۭؒ) ͱ͢Δɽ ψ(f ) = C p ∑ j=1 ∥fj ∥Hj Group Lasso ͷҰൠԽɽ Multiple Kernel Learning ͱ΋ݺ͹ΕΔɽ 25 / 56

Slide 31

Slide 31 text

Outline 1 εύʔεਪఆͷϞσϧ 2 ͍Ζ͍Ζͳεύʔεਖ਼ଇԽ 3 εύʔεਪఆͷཧ࿦ n ≫ p ͷཧ࿦ n ≪ p ͷཧ࿦ 4 ߴ࣍ݩઢܗճؼͷݕఆ 5 εύʔεਪఆͷ࠷దԽख๏ 26 / 56

Slide 32

Slide 32 text

໰୊ઃఆ ؆୯ͷͨΊઢܗճؼΛߟ͑Δɽ Y = Xβ∗ + ϵ. Y ∈ Rn : Ԡ౴ม਺, X ∈ Rn×p : આ໌ม਺, ϵ = [ϵ1, . . . , ϵn]⊤ ∈ Rn. ҰൠԽઢܗճؼ΁ͷҰൠԽ΋Մೳɽ 27 / 56

Slide 33

Slide 33 text

n ≫ p ͷཧ࿦ 28 / 56

Slide 34

Slide 34 text

Lasso ͷ઴ۙ෼෍ p ͸ݻఆɼn → ∞ ͷ઴ۙతৼΔ෣͍Λߟ͑Δɽ 1 n X⊤X p −→ C ≻ O. ϊΠζ ϵi ͸ฏۉ 0 ෼ࢄ σ2 ͱ͢Δ. Theorem (Lasso ͷ઴ۙ෼෍ (Knight and Fu, 2000)) λn √ n → λ0 ≥ 0 ͳΒ √ n(ˆ β − β∗) d −→ argmin u V (u), ͨͩ͠ɼ V (u) = u⊤Cu − 2u⊤W + λ0 ∑ p j=1 [uj sign(β∗ j )1(β∗ j ̸= 0) + |uj |1(β∗ j = 0)], W ∼ N(0, σ2C). ˆ β ͸ √ n-consistent Ͱ͋Δɽ β∗ j = 0 ͳΔ੒෼Ͱ ˆ βj ͸ ਖ਼ͷ֬཰Ͱ 0 ͱͳΔɽ ୈࡾ߲ͷ͍ͤͰɼ઴ۙతʹόΠΞε͕࢒Δɽ ˆ β = argminβ 1 n ∥Y − Xβ∥2 + λn ∑ p j=1 |βj |. 29 / 56

Slide 35

Slide 35 text

Adaptive Lasso ͷΦϥΫϧϓϩύςΟ ˜ β ͸͋ΔҰகਪఆྔɽ ˆ β = argmin β 1 n ∥Y − Xβ∥2 + λn p ∑ j=1 |βj | |˜ βj |γ . Theorem (Adaptive Lasso ͷΦϥΫϧϓϩύςΟ (Zou, 2006)) λn √ n → 0, λnn1+γ 2 → ∞ ͷͱ͖ɼ 1 limn→∞ P(ˆ J = J) → 1 (ˆ J := {j | |ˆ βj | ̸= 0}, J := {j | |β∗ j | ̸= 0}), 2 √ n(ˆ βJ − β∗ J ) d −→ N(0, σ2C−1 JJ ). ม਺બ୒ͷҰகੑ͋Γɽ ઴ۙෆภɼ઴ۙਖ਼نੑ͋Γɽ ͨͩ͠ɼβ∗ ͷ͋Δ੒෼͕ݪ఺ʹۙͮ͘Α͏ͳہॴతͳٞ࿦ (β∗ j = O(1/ √ n) ͳΔঢ়گ) ʹ͍ͭͯ͸Կ΋ݴ͍ͬͯͳ͍͜ͱʹ஫ҙɽ 30 / 56

Slide 36

Slide 36 text

n ≪ p ͷཧ࿦ 31 / 56

Slide 37

Slide 37 text

Lass ͷϦεΫͷ্ք ˆ β = arg min β∈Rp 1 n ∥Xβ − Y ∥2 2 + λn p ∑ j=1 |βj |. Theorem (Lasso ͷऩଋϨʔτ (Bickel et al., 2009, Zhang, 2009)) σβΠϯߦྻ͕ Restricted eigenvalue condition (Bickel et al., 2009) ͔ͭ maxi,j |Xij | ≤ 1 Λຬͨ͠ɼϊΠζ͕ E[eτξi ] ≤ eσ2τ2/2 (∀τ > 0) Λຬͨ͢ ͳΒ, ֬཰ 1 − δ Ͱ ∥ˆ β − β∗∥2 2 ≤ C d log(p/δ) n . ࣍ݩ͕ߴͯ͘΋ɼ͔͔ͨͩ log(p) Ͱ͔͠ޮ͍ͯ͜ͳ͍ɽ࣮࣭తͳ࣍ ݩ d ͕ࢧ഑తɽ ϊΠζͷ৚݅͸αϒΨ΢γΞϯͷඞཁे෼৚݅ɽ 32 / 56

Slide 38

Slide 38 text

Lasso ͷ minimax ࠷దੑ Theorem (ϛχϚΫε࠷దϨʔτ (Raskutti and Wainwright, 2011)) ͋Δ৚݅ͷ΋ͱɼ֬཰ 1/2 Ҏ্Ͱɼ min ˆ β:ਪఆྔ max β∗:d-εύʔε ∥ˆ β − β∗∥2 ≥ C d log(p/d) n . Lasso ͸ minimax ϨʔτΛୡ੒͢Δ (d log(d) n ͷ߲Λআ͍ͯ)ɽ ͜ͷ݁ՌΛ Multiple Kernel Learning ʹ֦ுͨ݁͠Ռ: Raskutti et al. (2012), Suzuki and Sugiyama (2013). 33 / 56

Slide 39

Slide 39 text

੍ݶݻ༗஋৚݅ (Restricted eigenvalue condition) A = 1 n X⊤X ͱ͢Δɽ Definition (੍ݶݻ༗஋৚݅ (RE(k′, C))) ϕRE (k′, C) = ϕRE (k′, C, A) := inf J⊆{1,...,n},v∈Rp: |J|≤k′,C∥vJ ∥1≥∥vJc ∥1 v⊤Av ∥vJ ∥2 2 ʹର͠ɼϕRE > 0 ͕੒Γཱͭɽ ΄΅εύʔεͳϕΫτϧʹ੍ݶͯ͠ఆٛͨ͠࠷খݻ༗஋ɽ J ৼঢ়ऋ৵औः ইঝছথॡ 34 / 56

Slide 40

Slide 40 text

ద߹ੑ৚݅ (Compatibility condition) A = 1 n X⊤X ͱ͢Δɽ Definition (ద߹ੑ৚݅ (COM(J, C))) ϕCOM(J, C) = ϕCOM(J, C, A) := inf v∈Rp: C∥vJ ∥1≥∥vJc ∥1 k v⊤Av ∥vJ∥2 1 ʹର͠ɼϕCOM > 0 ͕੒Γཱͭɽ |J| ≤ k′ ͳΒɼRE ΑΓ΋ऑ͍৚݅ɽ 35 / 56

Slide 41

Slide 41 text

੍ݶ౳௕ੑ৚݅ (Restricted isometory condition) Definition (੍ݶ౳௕ੑ৚݅ (RI(k′, δ))) ͋Δීวఆ਺ c ͕ଘࡏͯ͠ɼ͋Δ δ > 0 ʹର͠ɼ (1 − δ)∥β∥2 ≤ ∥Xβ∥2 ≤ (1 + δ)∥β∥2 ͕શͯͷ k′-εύʔεͳϕΫτϧ β ∈ Rp ʹରͯ͠੒Γཱͭɽ RE, COM ΑΓ΋ڧ͍৚݅ɽ Johnson-Lindenstrauss ͷิ୊. ѹॖηϯγϯάʹ͓͚Δ׬શ෮ݩͷจ຺ͰΑ͘༻͍ΒΕΔɽ 36 / 56

Slide 42

Slide 42 text

֤৚݅ͷؔ܎ͱऩଋϨʔτ ˆ β : Lasso ਪఆྔ. J := {j | β∗ j ̸= 0}. d := |J|. ڧ͍ 1 n ∥X(ˆ β − β∗)∥2 2 ∥ˆ β − β∗∥2 2 ∥ˆ β − β∗∥2 1 RI(2d, δ) ˠ ѹॖηϯγϯάʹ͓͚Δ׬શ෮ݩ ⇓ RE(2d, 3) ˠ d log(p) n d log(p) n d2 log(p) n ⇓ COM(J, 3) ˠ d log(p) n d2 log(p) n d2 log(p) n ऑ͍ ؔ࿈ࣄ߲ͷৄࡉ͸ B¨ uhlmann and van de Geer (2011) ʹ໢ཏ͞Ε͍ͯΔɽ 37 / 56

Slide 43

Slide 43 text

੍ݶݻ༗஋৚݅ (RE) ͕੒ཱ͢Δ֬཰ ੍ݶݻ༗஋৚݅͸ͲΕ͚ͩ੒Γཱͪ΍͍͔͢ʁ p ࣍ݩ֬཰ม਺ Z ͕౳ํత: E[⟨Z, z⟩2] = ∥z∥2 2 (∀z ∈ Rp). αϒΨ΢γΞϯϊϧϜ ∥Z∥ψ2 Λ࣍ͷΑ͏ʹఆٛ͢Δ: ∥Z∥ψ2 = supz∈Rp,∥z∥=1 inft{t | E[exp(⟨Z, z⟩)2/t2] ≤ 2}. 1. Z = [Z1, Z2, . . . , Zn]⊤ ∈ Rn×p ͷ֤ߦ Zi ∈ Rp ͕ಠཱͳ౳ํతαϒΨ ΢γΞϯ֬཰ม਺ͱ͢Δɽ 2. ͋Δ൒ਖ਼ఆରশߦྻ Σ ∈ Rp×p Λ༻͍ͯ X = ZΣ Ͱ͋Δͱ͢Δɽ Theorem (Rudelson and Zhou (2013)) ∥Zi ∥ψ2 ≤ κ (∀i) ͱ͢Δɽ͋Δීวఆ਺ c0 ͕ଘࡏ͠ m = c0 maxi (Σi,i )2 ϕ2 RE (k,9,Σ) ʹର ͠ɼn ≥ 4c0mκ4 log(60ep/(mκ)) ͳΒ͹ P ( ϕRE(k, 3, Σ) ≥ 1 2 ϕRE(k, 9, Σ) ) ≥ 1 − 2 exp(−n/(4c0κ4)). ͭ·Γɼਅͷ෼ࢄڞ෼ࢄߦྻ੍͕ݶݻ༗஋৚݅Λຬͨ͢ ⇒ ܦݧత෼ࢄڞ෼ࢄߦྻ΋ߴ͍֬཰Ͱಉ৚݅Λຬͨ͢ɽ 38 / 56

Slide 44

Slide 44 text

ತ࠷దԽΛ༻͍ͳ͍εύʔεਪఆ๏ͷੑ࣭ ৘ใྔن४ܕਪఆྔ: Massart (2003), Bunea et al. (2007), Rigollet and Tsybakov (2011). min β∈Rp ∥Y − Xβ∥2 + Cσ2∥β∥0 { 1 + log ( p ∥β∥0 )} . Bayes ਪఆྔ: Dalalyan and Tsybakov (2008), Alquier and Lounici (2011), Suzuki (2012). ΦϥΫϧෆ౳ࣜ: X ʹ Կ΋৚݅Λ՝ͣ͞ʹ ࣍ͷෆ౳͕ࣜ੒Γཱͭ: 1 n ∥Xβ∗ − X ˆ β∥2 ≤ Cσ2 d n log ( 1 + p d ) . ■ ϛχϚοΫε࠷దɽ ■ ತਖ਼ଇԽਪఆ๏ͱେ͖ͳΪϟ οϓɽ ■ ܭࢉྔͱ౷ܭతੑ࣭ͷτϨʔυΦϑɽ 39 / 56

Slide 45

Slide 45 text

Outline 1 εύʔεਪఆͷϞσϧ 2 ͍Ζ͍Ζͳεύʔεਖ਼ଇԽ 3 εύʔεਪఆͷཧ࿦ n ≫ p ͷཧ࿦ n ≪ p ͷཧ࿦ 4 ߴ࣍ݩઢܗճؼͷݕఆ 5 εύʔεਪఆͷ࠷దԽख๏ 40 / 56

Slide 46

Slide 46 text

όΠΞεআڈʹΑΔํ๏ ΞΠσΟΞɿLasso ਪఆྔ ˆ β ͔ΒόΠΞεΛআڈ. (van de Geer et al., 2014, Javanmard and Montanari, 2014) ˜ β = ˆ β + MX⊤(Y − X ˆ β) M ͕ (X⊤X)−1 ͳΒɼ ˜ β = β∗ + (X⊤X)−1X⊤ϵ. ˠ όΠΞεͳ͠ɼ (઴ۙ) ਖ਼نɽ ໰୊఺: n ≫ p ͷͱ͖ɼX⊤X ͸ඇՄٯɽ 41 / 56

Slide 47

Slide 47 text

M ͷٻΊํ: min M∈Rp×p |ΣM⊤ − I|∞. (| · |∞ ͸த਎ΛϕΫτϧͱΈͳͨ͠ແݶେϊϧϜ) Theorem (Javanmard and Montanari (2014)) ϵi ∼ N(0, σ2) (i.i.d.) ͱ͢Δɽ √ n(˜ β−β∗) = Z +∆, Z ∼ N(0, σ2MΣM⊤), ∆ = √ n(MΣ−I)(β∗− ˆ β). ·ͨɼX ͕ϥϯμϜͰ෼ࢄڞ෼ࢄߦྻ͕ਖ਼ఆͳ࣌ɼλn = cσ √ log(p)/n ͱ ͢Δͱɼ ∥∆∥∞ = Op ( d log(p) √ n ) . ˞ n ≫ d2 log2(p) ͳΒ͹ ∆ ≈ 0 Ͱɼ √ n(˜ β − β∗) ͸΄΅ਖ਼ن෼෍ʹै͏ɽ ˠ ৴པ۠ؒͷߏங΍ݕఆ͕Ͱ͖Δɽ 42 / 56

Slide 48

Slide 48 text

M ͷٻΊํ: min M∈Rp×p |ΣM⊤ − I|∞. (| · |∞ ͸த਎ΛϕΫτϧͱΈͳͨ͠ແݶେϊϧϜ) Theorem (Javanmard and Montanari (2014)) √ n(˜ β − β∗) = Z ਖ਼ن෼෍ + ∆ X ͕ඇՄٯͰ͋Δ͜ͱʹΑΔ࢒ΓΧε ৚͕݅ྑ͍࣌ 0 ΁ऩଋ ˞ n ≫ d2 log2(p) ͳΒ͹ ∆ ≈ 0 Ͱɼ √ n(˜ β − β∗) ͸΄΅ਖ਼ن෼෍ʹै͏ɽ ˠ ৴པ۠ؒͷߏங΍ݕఆ͕Ͱ͖Δɽ 42 / 56

Slide 49

Slide 49 text

਺஋࣮ݧ (a) ਓޱσʔλʹ͓͚Δ 95%৴པ۠ؒ. (n, p, d) = (1000, 600, 10). (b) ਓޱσʔλʹ͓͚Δ p ஋ͷ CDF. (n, p, d) = (1000, 600, 10). ਤ͸ Javanmard and Montanari (2014) ͔ΒҾ༻ɽ 43 / 56

Slide 50

Slide 50 text

ڞ෼ࢄݕఆ౷ܭྔ (Lockhart et al., 2014) 0 500 1000 1500 2000 2500 3000 −600 −400 −200 0 200 400 600 L1 Norm Coe cients 0 2 4 6 8 10 10 44 / 56

Slide 51

Slide 51 text

ڞ෼ࢄݕఆ౷ܭྔ (Lockhart et al., 2014) 0 500 1000 1500 2000 2500 3000 −600 −400 −200 0 200 400 600 L1 Norm Coe cients 0 2 4 6 8 10 10 J = supp(ˆ β(λk)), J∗ = supp(β∗), ˜ β(λk+1) := argmin β:βJ ∈R|J|,βJc =0 ∥Y − XJβJ∥2 + λk+1∥βJ∥1. J∗ ⊆ J ͳΒ͹ (β∗ j = 0 Ͱ͋Δ)ɼ Tk = ( ⟨Y , X ˆ β(λk+1)⟩ − ⟨Y , X ˜ β(λk+1)⟩ ) /σ2 d −→ Exp(1) (n, p → ∞). 44 / 56

Slide 52

Slide 52 text

Outline 1 εύʔεਪఆͷϞσϧ 2 ͍Ζ͍Ζͳεύʔεਖ਼ଇԽ 3 εύʔεਪఆͷཧ࿦ n ≫ p ͷཧ࿦ n ≪ p ͷཧ࿦ 4 ߴ࣍ݩઢܗճؼͷݕఆ 5 εύʔεਪఆͷ࠷దԽख๏ 45 / 56

Slide 53

Slide 53 text

εύʔεਪఆʹ͓͚Δ࠷దԽͷ໰୊ҙࣝ R(β) = n ∑ i=1 ℓ(yi , x⊤ i β) f (β):ϩεؔ਺ + ψ(β) ਖ਼ଇԽ߲ = f (β) + ψ(β) ψ ͕ઑ͍ͬͯΔˠඍ෼ෆՄೳɽ ઑ͍ͬͯΔؔ਺ͷ࠷దԽ͸ ೉͍͠ɽ f ͸ͳΊΒ͔ͳ৔߹͕ଟ͍ɽ ψ ͷߏ଄Λར༻ ͢Ε͹ɼ͔͋ͨ΋ R ͕ͳΊΒ͔ Ͱ͋Δ͔ͷΑ͏ʹ ࠷దԽՄೳɽ 46 / 56

Slide 54

Slide 54 text

εύʔεਪఆʹ͓͚Δ࠷దԽͷ໰୊ҙࣝ R(β) = n ∑ i=1 ℓ(yi , x⊤ i β) f (β):ϩεؔ਺ + ψ(β) ਖ਼ଇԽ߲ = f (β) + ψ(β) ψ ͕ઑ͍ͬͯΔˠඍ෼ෆՄೳɽ ઑ͍ͬͯΔؔ਺ͷ࠷దԽ͸ ೉͍͠ɽ f ͸ͳΊΒ͔ͳ৔߹͕ଟ͍ɽ ψ ͷߏ଄Λར༻ ͢Ε͹ɼ͔͋ͨ΋ R ͕ͳΊΒ͔ Ͱ͋Δ͔ͷΑ͏ʹ ࠷దԽՄೳɽ యܕྫɿL1 ਖ਼ଇԽ ψ(β) = C ∑ p j=1 |βj | ˠ ࠲ඪ͝ͱʹ෼͔Ε͍ͯΔ. Ұ࣍ݩͷ࠷దԽ minb{(b − y)2 + C|b|} ͸ ؆୯. 46 / 56

Slide 55

Slide 55 text

࠲ඪ߱Լ๏ ࠲ඪ߱Լ๏ͷखॱ 1 ࠲ඪ j ∈ {1, . . . , p} Λ ԿΒ͔ͷํ๏ Ͱબ୒ɽ 2 j ൪໨ͷ࠲ඪ βj Λߋ৽ɽ ʢҎԼ͸ߋ৽ํ๏ͷྫʣ β(k+1) j ← argminβj R([β(k) 1 , . . . , βj , . . . , β(k) p ]). gj = ∂f (β(k)) ∂βj ͱͯ͠ɼ β(k+1) j ← argminβj ⟨gj , βj ⟩ + ψj (βj ) + ηk 2 ∥βj − β(k) j ∥. ࠲ඪΛҰͭͣͭͰ͸ͳ͘ෳ਺ݸબͿ͜ͱ΋ଟ͍: ϒϩοΫ࠲ඪ߱ Լ๏ɽ ࠲ඪ͸͋ΔϧʔϧʹैͬͯબΜͩΓɼϥϯμϜʹબΜͩΓ͢Δɽ 47 / 56

Slide 56

Slide 56 text

࠲ඪ߱Լ๏ͷऩଋϨʔτ ෼ղՄೳͳਖ਼ଇԽ߲ ψ(β) = ∑ p j=1 ψj (βj ) Λߟ͑Δɽ f ͸ඍ෼ՄೳͰޯ഑͕ L-Lipschitz ࿈ଓ (∇f (β) − ∇f (β′) ≤ L∥β − β′∥) ˠ ͜ͷͱ͖ɼf ͸ʮ׈Β͔ʯͰ͋Δͱݴ͏ɽ αΠΫϦοΫ (Saha and Tewari, 2013) R(β(k)) − R(ˆ β) ≤ L∥β(0) − ˆ β∥2 2k = O(1/k). ϥϯμϜબ୒ (Nesterov, 2012, Peter Richt´ arik, 2014) Ճ଎๏ͳ͠: O(1/k). Nesterov ͷՃ଎๏: O(1/k2) (Fercoq and Richt´ arik, 2013). f ͕ α-ڧತ: O(exp(−C(α/L)k)). f ͕ α-ڧತ+Ճ଎๏: O(exp(−C √ α/Lk)) (Lin et al., 2014). ࠲ඪͷબͼํ͸ϥϯμϜͰྑ͍ɽ 48 / 56

Slide 57

Slide 57 text

େن໛σʔλʹ͓͚Δ࠲ඪ߱Լ๏ Hydra: ฒྻ෼ࢄܭࢉΛ༻͍ͨ࠲ඪ߱Լ๏ (Richt´ arik and Tak´ aˇ c, 2013, Fercoq et al., 2014). େن໛ Lasso (p = 5 × 108, n = 109) ʹ͓͚Δ Hydra ͷܭࢉޮ཰ (Richt´ arik and Tak´ aˇ c, 2013)ɽ128 ϊʔυɼ4,096 ίΞɽ 49 / 56

Slide 58

Slide 58 text

ۙ઀ޯ഑๏ܕख๏ f (β) ઢܗۙࣅ +ψ(β) gk ∈ ∂f (β(k)), ¯ gk = 1 k ∑ k τ=1 gτ . ۙ઀ޯ഑๏: β(k+1) = arg min β∈Rp { g⊤ k β + ψ(β) + ηk 2 ∥β − β(k)∥2 } . ਖ਼ଇԽ૒ରฏۉ๏ (Xiao, 2009, Nesterov, 2009): β(k+1) = arg min β∈Rp { ¯ g⊤ k β + ψ(β) + ηk 2 ∥β∥2 } . ݤͱͳΔܭࢉ͸ۙ઀ࣸ૾: prox(q|ψ) := arg min x { ψ(x) + 1 2 ∥x − q∥2 } . L1 ਖ਼ଇԽͳΒ؆୯ʹܭࢉͰ͖Δ (Soft-thresholding ؔ਺)ɽ 50 / 56

Slide 59

Slide 59 text

ۙ઀ࣸ૾ͷྫ: L1 ਖ਼ଇԽ prox(q|C∥ · ∥1) = arg min x { C∥x∥1 + 1 2 ∥x − q∥2 } = (sign(qj ) max(|qj | − C, 0))j . → Soft-thresholding ؔ਺. ղੳղ! 51 / 56

Slide 60

Slide 60 text

ۙ઀ޯ഑๏ܕख๏ͷऩଋϨʔτ f ͷੑ࣭ ׈Β͔ ඇ׈Β͔ ڧತ exp(− √ α/Lk) 1 k ඇڧತ 1 k2 1 √ k 1 ׈Β͔ͳ৔߹ɼNesterov ͷՃ଎๏Λ࢖ͬͨ࣌ͷऩଋϨʔτΛࣔͯ͠ ͍Δ (Nesterov, 2007, Zhang et al., 2010)ɽ 2 Ճ଎๏Λ࢖Θͳ͚Ε͹ɼͦΕͧΕ exp(−(α/L)k), 1 k ʹͳΔɽ 3 ্ͷΦʔμʔ͸ޯ഑৘ใͷΈΛ༻͍Δํ๏ (First order method) ͷ தͰ࠷దɽ 52 / 56

Slide 61

Slide 61 text

֦ுϥάϥϯδΞϯܕख๏ min β f (β) + ψ(β) ⇔ min x,y f (x) + ψ(y) s.t. x = y. ࠷దԽͷ೉͠͞Λ ෼཭ɽ ֦ுϥάϥϯδΞϯ: L(x, y, λ) = f (x) + ψ(y) + λ⊤(y − x) + ρ 2 ∥y − x∥2. ৐਺๏ (Hestenes, 1969, Powell, 1969, Rockafellar, 1976) 1 (x(k+1), y(k+1)) = argminx,y L(x, y, λ(k)). 2 λ(k+1) = λ(k) − ρ(y(k+1) − x(k+1)). x, y Ͱͷಉ࣌࠷దԽ͸΍΍໘౗ ˠ ަޓํ޲৐਺๏ɽ 53 / 56

Slide 62

Slide 62 text

ަޓํ޲৐਺๏ min x,y f (x) + ψ(y) s.t. x = y. L(x, y, λ) = f (x) + ψ(y) + λ⊤(y − x) + ρ 2 ∥y − x∥2. ަޓํ޲৐਺๏ (Gabay and Mercier, 1976) x(k+1) = arg min x f (x) − λ(k)⊤x + ρ 2 ∥y(k) − x∥2 y(k+1) = arg min y ψ(y) + λ(k)⊤y + ρ 2 ∥y − x(k+1)∥2(= prox(x(k+1) − λ(k)/ρ|ψ/ρ)) λ(k+1) = λ(k) − ρ(x(k+1) − y(k+1)) x, y ͷಉ࣌࠷దԽ͸ަޓʹ࠷దԽ͢Δ͜ͱͰճආɽ y ͷߋ৽͸ۙ઀ࣸ૾ɽL1 ਖ਼ଇԽͷΑ͏ͳ৔߹͸؆୯ɽ ߏ଄తਖ਼ଇԽ΁ͷ֦ு΋༰қɽ ࠷దղʹऩଋ͢Δอূ͋Γɽ Ұൠతʹ͸ O(1/k) (He and Yuan, 2012), ڧತͳΒ͹ઢܗऩଋ (Deng and Yin, 2012, Hong and Luo, 2012)ɽ 54 / 56

Slide 63

Slide 63 text

֬཰త࠷దԽ αϯϓϧ਺ͷଟ͍େن໛σʔλ Ͱ༗༻ɽ Ұճͷߋ৽ʹ͢΂ͯͷαϯϓϧΛಡΈࠐ·ͳ͍Ͱ΋େৎ෉. ■ ΦϯϥΠϯܕ FOBOS (Duchi and Singer, 2009) RDA (Xiao, 2009) ■ όονܕ SVRG (Stochastic Variance Reduced Gradient) (Johnson and Zhang, 2013) SDCA (Stochastic Dual Coordinate Ascent) (Shalev-Shwartz and Zhang, 2013) SAG (Stochastic Averaging Gradient) (Le Roux et al., 2013) ֬཰తަޓํ޲৐਺๏: Suzuki (2013), Ouyang et al. (2013), Suzuki (2014). 55 / 56

Slide 64

Slide 64 text

·ͱΊ ༷ʑͳεύʔεϞσϦϯά L1 ਖ਼ଇԽ άϧʔϓਖ਼ଇԽ τϨʔεϊϧϜਖ਼ଇԽ Lasso ͷ઴ۙతৼΔ෣͍ ઴ۙ෼෍ Adaptive Lasso ͷΦϥΫϧϓϩύςΟ ੍ݶݻ༗஋৚݅ˠ ∥ˆ β − β∗∥2 = Op (d log(p)/n) ݕఆ όΠΞεআڈ๏ɼڞ෼ࢄݕఆ౷ܭྔ ࠷దԽख๏ ࠲ඪ߱Լ๏ ۙ઀ޯ഑๏ (ަޓํ޲) ৐਺๏ 56 / 56

Slide 65

Slide 65 text

P. Alquier and K. Lounici. PAC-Bayesian bounds for sparse regression estimation with exponential weights. Electronic Journal of Statistics, 5: 127–145, 2011. T. Anderson. Estimating linear restrictions on regression coefficients for multivariate normal distributions. Annals of Mathematical Statistics, 22: 327–351, 1951. A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying. A spectral regularization framework for multi-task structure learning. In Y. S. J.C. Platt, D. Koller and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 25–32, Cambridge, MA, 2008. MIT Press. O. Banerjee, L. E. Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine Learning Research, 9:485–516, 2008. J. Bennett and S. Lanning. The netflix prize. In Proceedings of KDD Cup and Workshop 2007, 2007. P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009. 56 / 56

Slide 66

Slide 66 text

P. B¨ uhlmann and S. van de Geer. Statistics for high-dimensional data. Springer, 2011. F. Bunea, A. Tsybakov, and M. Wegkamp. Aggregation for gaussian regression. The Annals of Statistics, 35(4):1674–1697, 2007. G. R. Burket. A study of reduced-rank models for multiple prediction, volume 12 of Psychometric monographs. Psychometric Society, 1964. E. Cand` es and T. Tao. The power of convex relaxations: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56: 2053–2080, 2009. E. J. Cand` es and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6): 717–772, 2009. E. J. Candes and T. Tao. Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Transactions on Information Theory, 52(12):5406–5425, 2006. A. Dalalyan and A. B. Tsybakov. Aggregation by exponential weighting sharp PAC-Bayesian bounds and sparsity. Machine Learning, 72:39–61, 2008. 56 / 56

Slide 67

Slide 67 text

W. Deng and W. Yin. On the global and linear convergence of the generalized alternating direction method of multipliers. Technical report, Rice University CAAM TR12-14, 2012. D. Donoho. Compressed sensing. IEEE Transactions of Information Theory, 52(4):1289–1306, 2006. D. L. Donoho and J. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3):425–455, 1994. J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10: 2873–2908, 2009. J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 2001. O. Fercoq and P. Richt´ arik. Accelerated, parallel and proximal coordinate descent. Technical report, 2013. arXiv:1312.5799. O. Fercoq, Z. Qu, P. Richt´ arik, and M. Tak´ aˇ c. Fast distributed coordinate descent for non-strongly convex losses. In Proceedings of MLSP2014: IEEE International Workshop on Machine Learning for Signal Processing, 2014. 56 / 56

Slide 68

Slide 68 text

I. E. Frank and J. H. Friedman. A statistical view of some chemometrics regression tools. Technometrics, 35(2):109–135, 1993. D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite-element approximations. Computers & Mathematics with Applications, 2:17–40, 1976. T. Hastie and R. Tibshirani. Generalized additive models. Chapman & Hall Ltd, 1999. B. He and X. Yuan. On the O(1/n) convergence rate of the Douglas-Rachford alternating direction method. SIAM J. Numerical Analisis, 50(2):700–709, 2012. M. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory & Applications, 4:303–320, 1969. M. Hong and Z.-Q. Luo. On the linear convergence of the alternating direction method of multipliers. Technical report, 2012. arXiv:1208.3922. A. J. Izenman. Reduced-rank regression for the multivariate linear model. Journal of Multivariate Analysis, pages 248–264, 1975. 56 / 56

Slide 69

Slide 69 text

L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap and graph lasso. In Proceedings of the 26th International Conference on Machine Learning, 2009. A. Javanmard and A. Montanari. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning, page to appear, 2014. R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 315–323. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/ 4937-accelerating-stochastic-gradient-descent-using-predicti pdf. K. Knight and W. Fu. Asymptotics for lasso-type estimators. The Annals of Statistics, 28(5):1356–1378, 2000. N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. In Advances in Neural Information Processing Systems 25, 2013. 56 / 56

Slide 70

Slide 70 text

Q. Lin, Z. Lu, and L. Xiao. An accelerated proximal coordinate gradient method and its application to regularized empirical risk minimization. Technical report, 2014. arXiv:1407.1296. R. Lockhart, J. Taylor, R. J. Tibshirani, and R. Tibshirani. A significance test for the lasso. The Annals of Statistics, 42(2):413–468, 2014. K. Lounici, A. Tsybakov, M. Pontil, and S. van de Geer. Taking advantage of sparsity in multi-task learning. 2009. P. Massart. Concentration Inequalities and Model Selection: Ecole d’´ et´ e de Probabilit´ es de Saint-Flour 23. Springer, 2003. N. Meinshausen and P. B uhlmann. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436–1462, 2006. Y. Nesterov. Gradient methods for minimizing composite objective function. Technical Report 76, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2007. Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, Series B, 120:221–259, 2009. 56 / 56

Slide 71

Slide 71 text

Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012. H. Ouyang, N. He, L. Q. Tran, and A. Gray. Stochastic alternating direction method of multipliers. In Proceedings of the 30th International Conference on Machine Learning, 2013. M. T. Peter Richt´ arik. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, Series A, 144:1–38, 2014. M. Powell. A method for nonlinear constraints in minimization problems. In R. Fletcher, editor, Optimization, pages 283–298. Academic Press, London, New York, 1969. G. Raskutti and M. J. Wainwright. Minimax rates of estimation for high-dimensional linear regression over ℓq-balls. IEEE Transactions on Information Theory, 57(10):6976–6994, 2011. G. Raskutti, M. Wainwright, and B. Yu. Minimax-optimal rates for sparse additive models over kernel classes via convex programming. Journal of Machine Learning Research, 13:389–427, 2012. 56 / 56

Slide 72

Slide 72 text

P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additive models. Journal of the Royal Statistical Society: Series B, 71(5): 1009–1030, 2009. P. Richt´ arik and M. Tak´ aˇ c. Distributed coordinate descent method for learning with big data. Technical report, 2013. arXiv:1310.2059. P. Rigollet and A. Tsybakov. Exponential screening and optimal rates of sparse estimation. The Annals of Statistics, 39(2):731–771, 2011. R. T. Rockafellar. Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of Operations Research, 1:97–116, 1976. M. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements. IEEE Transactions of Information Theory, 39, 2013. A. Saha and A. Tewari. On the non-asymptotic convergence of cyclic coordinate descent methods. SIAM Journal on Optimization, 23(1): 576–601, 2013. S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent. Technical report, 2013. arXiv:1211.2717. 56 / 56

Slide 73

Slide 73 text

N. Srebro, N. Alon, and T. Jaakkola. Generalization error bounds for collaborative prediction with low-rank matrices. In Advances in Neural Information Processing Systems (NIPS) 17, 2005. T. Suzuki. Pac-bayesian bound for gaussian process regression and multiple kernel additive model. In JMLR Workshop and Conference Proceedings, volume 23, pages 8.1–8.20, 2012. Conference on Learning Theory (COLT2012). T. Suzuki. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In Proceedings of the 30th International Conference on Machine Learning, pages 392–400, 2013. T. Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers. In Proceedings of the 31th International Conference on Machine Learning, pages 736–744, 2014. T. Suzuki and M. Sugiyama. Fast learning rate of multiple kernel learning: trade-off between sparsity and smoothness. The Annals of Statistics, 41 (3):1381–1405, 2013. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996. 56 / 56

Slide 74

Slide 74 text

R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso. 67(1):91–108, 2005. S. van de Geer, P. B uehlmann, Y. Ritov, and R. Dezeure. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202, 2014. L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems 23, 2009. M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19–35, 2007. C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statist, 38(2):894–942, 2010. P. Zhang, A. Saha, and S. V. N. Vishwanathan. Regularized risk minimization by nesterov’s accelerated gradient methods: Algorithmic extensions and empirical studies. CoRR, abs/1011.0472, 2010. T. Zhang. Some sharp performance bounds for least squares regression with l1 regularization. The Annals of Statistics, 37(5):2109–2144, 2009. H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006. 56 / 56

Slide 75

Slide 75 text

ాதར޾. ѹॖηϯγϯάͷ਺ཧ. IEICE Fundamentals Review, 4(1): 39–47, 2010. 56 / 56