Slide 1

Slide 1 text

. . . . . . . ౷ܭతֶशཧ࿦νϡʔτϦΞϧ: جૅ͔ΒԠ༻·Ͱ † ླ໦ɹେ࣊ † ౦ژେֶ ৘ใཧ޻ֶݚڀՊ ਺ཧ৘ใֶઐ߈ IBIS 2012@ஜ೾େֶ౦ژΩϟϯύεจژߍࣷ 2012 ೥ 11 ݄ 7 ೔ 1 / 60

Slide 2

Slide 2 text

ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . . 2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 2 / 60

Slide 3

Slide 3 text

ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . . 2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 3 / 60

Slide 4

Slide 4 text

ͦ΋ͦ΋ཧ࿦͸ඞཁʁ VC ࣍ݩͱ͔࠶ੜ֩ͷཧ࿦ͱ͔஌Βͳͯ͘΋ SVM ͸ಈ͔ͤΔɽ ଌ౓࿦ͱ͔஌Βͳͯ͘΋ϊϯύϥϕΠζͷ࣮૷͸Ͱ͖Δɽ ͦ΋ͦ΋ཧ࿦Ո͸ ໾ʹཱͨͳ͍ ೉͍͠࿩Λ͜Ͷ͘Γճ͍ͯ͠Δ͚ͩͰ͸ʁ 4 / 60

Slide 5

Slide 5 text

جૅݚڀͱԠ༻ͷؔ܎ ೥୅ Ԡ༻ جૅ ʢཧ࿦ʣ ৗʹ਺ଟͷجૅݚڀ͕Ԡ༻Խ͞Ε͍ͯΔ 5 / 60

Slide 6

Slide 6 text

جૅݚڀͱԠ༻ͷؔ܎ ೥୅ Ԡ༻ جૅ ೞ఺ܖ፼ ʢཧ࿦ʣ ػցֶशۀքͷมભ 5 / 60

Slide 7

Slide 7 text

جૅݚڀͱԠ༻ͷؔ܎ ೥୅ Ԡ༻ جૅ ೞ఺ܖ፼ ૼẲẟؕஜ২ᘐ ʢཧ࿦ʣ ະདྷͷԠ༻ʹͭͳ͕Δجૅݚڀ 5 / 60

Slide 8

Slide 8 text

جૅݚڀͱԠ༻ͷؔ܎ ೥୅ Ԡ༻ جૅ ೞ఺ܖ፼ ૼẲẟؕஜ২ᘐ ʢཧ࿦ʣ ίϛϡχέʔγϣϯ ݴޠɿ਺ֶ ཧ࿦ ࣌ʹҟͳΔϨϕϧؒͷίϛϡχέʔγϣϯ͕৽͍͠ൃݟΛಋ͘ 5 / 60

Slide 9

Slide 9 text

ྺ࢙ʹΈΔ੒ޭྫ SVM (Vapnik, 1998, Cortes and Vapnik, 1995): VC ࣍ݩ AdaBoost (Freund and Schapire, 1995): ऑֶशػʹΑΔֶशՄೳੑ Dirichlet process (Ferguson, 1973): ֬཰࿦ɼଌ౓࿦ 6 / 60

Slide 10

Slide 10 text

ྺ࢙ʹΈΔ੒ޭྫ SVM (Vapnik, 1998, Cortes and Vapnik, 1995): VC ࣍ݩ AdaBoost (Freund and Schapire, 1995): ऑֶशػʹΑΔֶशՄೳੑ Dirichlet process (Ferguson, 1973): ֬཰࿦ɼଌ౓࿦ Lasso (Tibshirani, 1996) AIC (Akaike, 1974) ѹॖηϯγϯά (Cand` es, Tao and Donoho, 2004) ͳͲͳͲ େࣄͳͷ͸ຊ࣭ͷཧղ ˠ৽͍͠ख๏ (ͦͷͨΊͷຊηογϣϯʂ) 6 / 60

Slide 11

Slide 11 text

ΑΓ௚઀తͳޮೳ ֶशཧ࿦Λ஌Δ͜ͱͷΑΓ௚઀తͳ༗༻ੑ . . . 1 ख๏ͷҙຯ: ͦ΋ͦ΋ԿΛ΍͍ͬͯΔख๏ͳͷ͔ˠਖ਼͍͠࢖͍ํ . . . 2 ख๏ͷਖ਼౰ੑ: ͪΌΜͱͨ͠ղ͕ಘΒΕΔ͔ʢ ʮຊ౰ʹऩଋ͢Δͷ͔ʯ ʣ . . . 3 ख๏ͷ࠷దੑ: ͋Δई౓ʹؔͯ͠࠷దͳख๏͔ˠ҆৺ͯ͠࢖͑Δ 7 / 60

Slide 12

Slide 12 text

Œख๏ͷҙຯ: ྫ. ϩεؔ਺ͷબ୒ ೋ஋൑ผ: ώϯδϩεͱϩδεςΟ οΫϩεɼͲͪΒΛ࢖͏΂͖ʁ min f 1 n n ∑ i=1 ϕ(−yi f (xi )) (yi ∈ {±1}) . . . 1 ྆ऀͱ΋൑ผޡࠩΛ࠷খԽ ϕ ͕ತͷ࣌ ʮϕ ͸൑ผҰகੑΛ΋ͭ ⇔ ϕ ͕ݪ఺Ͱඍ෼Մೳ͔ͭ ϕ′(0) > 0ʯ (Bartlett et al., 2006) ൑ผҰகੑ: ظ଴ϦεΫ࠷খԽؔ਺ (arg minf E[ϕ(−Yf (X))]) ͕ Bayes ࠷దɽ 8 / 60

Slide 13

Slide 13 text

Œख๏ͷҙຯ: ྫ. ϩεؔ਺ͷબ୒ ೋ஋൑ผ: ώϯδϩεͱϩδεςΟ οΫϩεɼͲͪΒΛ࢖͏΂͖ʁ min f 1 n n ∑ i=1 ϕ(−yi f (xi )) (yi ∈ {±1}) . . . 1 ྆ऀͱ΋൑ผޡࠩΛ࠷খԽ ϕ ͕ತͷ࣌ ʮϕ ͸൑ผҰகੑΛ΋ͭ ⇔ ϕ ͕ݪ఺Ͱඍ෼Մೳ͔ͭ ϕ′(0) > 0ʯ (Bartlett et al., 2006) ൑ผҰகੑ: ظ଴ϦεΫ࠷খԽؔ਺ (arg minf E[ϕ(−Yf (X))]) ͕ Bayes ࠷దɽ . . . 2 αϙʔτϕΫλʔͷ਺ vs ৚݅෇͖֬཰ p(Y |X) ͷਪఆೳྗ ώϯδ: εύʔεͳղ (৚݅෇͖֬཰͸શ͘ਪఆͰ͖ͳ͍) ϩδεςΟ οΫ: ৚݅෇͖֬཰͕ٻ·Δ (Ұํɼશαϯϓϧ͕αϙʔτϕΫλʔ) ʮαϙʔτϕΫλʔͷ਺ͱ৚݅෇͖֬཰ͷਪఆೳྗͱͷؒʹ͸τϨʔυΦϑ ͕͋Δɽ྆ऀΛ׬શʹཱ྆ͤ͞Δ͜ͱ͸Ͱ͖ͳ͍ɽ ʯ (Bartlett and Tewari, 2007) 8 / 60

Slide 14

Slide 14 text

ख๏ͷਖ਼౰ੑɾŽख๏ͷ࠷దੑ 2 ख๏ͷਖ਼౰ੑ ྫ: Ұகੑ ˆ f (ਪఆྔ) p −→ f ∗ (ਅͷؔ਺) 3 ख๏ͷ࠷దੑ ऩଋ͢Δͱͯͦ͠ͷ଎͞͸࠷దʁ ڐ༰ੑ minimax ੑ ࠓ೔͸͜͜Β΁ΜΛத৺ʹ࿩͠·͢ɽ 9 / 60

Slide 15

Slide 15 text

. . ౷ܭతֶशཧ࿦ 10 / 60

Slide 16

Slide 16 text

ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . . 2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 11 / 60

Slide 17

Slide 17 text

౷ܭతֶशཧ࿦ͷཱͪҐஔ 12 / 60

Slide 18

Slide 18 text

౷ܭతֶशཧ࿦ͷཱͪҐஔ ˞࣮ࡍͷͱ͜Ζɼڥք͸ඇৗʹᐆດɽ 12 / 60

Slide 19

Slide 19 text

౷ܭతֶशཧ࿦ͷཱͪҐஔ ˞࣮ࡍͷͱ͜Ζɼڥք͸ඇৗʹᐆດɽ 12 / 60

Slide 20

Slide 20 text

ʢࠓճ͓࿩͢͠Δʣֶशཧ࿦ ≈ ܦݧաఔͷཧ࿦ sup f ∈F { 1 n n ∑ i=1 f (xi ) − E[f ] } ͷධՁ͕ॏཁɽ 13 / 60

Slide 21

Slide 21 text

ྺ࢙: ܦݧաఔͷཧ࿦ 1933 Glivenko, Cantelli Glivenko-Catelli ͷఆཧ (Ұ༷େ਺ͷ๏ଇ) 1933 Kolmogorov Kolmogorov-Smirnov ݕఆ (ऩଋϨʔτɼ઴ۙ෼෍) 1952 Donsker Donsker ͷఆཧ (Ұ༷த৺ۃݶఆཧ) 1967 Dudley Dudley ੵ෼ 1968 Vapnik, Chervonenkis VC ࣍ݩ (Ұ༷ऩଋͷඞཁे෼৚݅) 1996a Talagrand Talagrand ͷෆ౳ࣜ 14 / 60

Slide 22

Slide 22 text

ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . . 2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 15 / 60

Slide 23

Slide 23 text

໰୊ઃఆ ڭࢣ༗Γֶश ڭࢣσʔλ: Dn = {(x1, y1 ), . . . , (xn, yn )} ∈ (X × Y)n ೖྗͱग़ྗͷ i.i.d. ܥྻ ϩεؔ਺: ℓ(·, ·) : Y × R → R+ ؒҧ͍΁ͷϖφϧςΟ Ծઆू߹ (Ϟσϧ): F X → R ͳΔؔ਺ͷू߹ . . . . . . . ˆ f : ਪఆྔ. αϯϓϧ (xi , yi )n i=1 ͔Βߏ੒͞ΕΔ F ͷݩ. ཈͍͑ͨྔ (൚Խޡࠩ): E(X,Y ) ςετσʔλ [ℓ(Y , ˆ f (X))] − inf f :Մଌؔ਺ E(X,Y ) [ℓ(Y , f (X))] ൚Խޡࠩ͸ऩଋ͢Δʁ ͦͷ଎͞͸? 16 / 60

Slide 24

Slide 24 text

Bias-Variance ͷ෼ղ ܦݧϦεΫ: ˆ L(f ) = 1 n ∑ n i=1 ℓ(yi , f (xi )), ظ଴ϦεΫ: L(f ) = E(X,Y ) [ℓ(Y , f (X))] ൚Խޡࠩ =L(ˆ f ) − inf f :Մଌؔ਺ L(f ) = L(ˆ f ) − inf f ∈F L(f ) ਪఆޡࠩ + inf f ∈F L(f ) − inf f :Մଌؔ਺ L(f ) Ϟσϧޡࠩ ؆୯ͷͨΊ f ∗ ∈ F ͕ଘࡏͯ͠ inff ∈F L(f ) = L(f ∗) ͱ͢Δɽ 17 / 60

Slide 25

Slide 25 text

Bias-Variance ͷ෼ղ ܦݧϦεΫ: ˆ L(f ) = 1 n ∑ n i=1 ℓ(yi , f (xi )), ظ଴ϦεΫ: L(f ) = E(X,Y ) [ℓ(Y , f (X))] ൚Խޡࠩ =L(ˆ f ) − inf f :Մଌؔ਺ L(f ) = L(ˆ f ) − inf f ∈F L(f ) ਪఆޡࠩ + inf f ∈F L(f ) − inf f :Մଌؔ਺ L(f ) Ϟσϧޡࠩ ؆୯ͷͨΊ f ∗ ∈ F ͕ଘࡏͯ͠ inff ∈F L(f ) = L(f ∗) ͱ͢Δɽ ˞Ϟσϧޡࠩʹ͍ͭͯ͸ࠓճ͸৮Εͳ͍ɽ ͔͠͠ɼϞσϦϯάͷ໰୊͸ ඇৗʹॏཁɽ Sieve ๏, Cross validation, ৘ใྔن४, Ϟσϧฏۉ, ... Χʔωϧ๏ʹ͓͚ΔϞσϧޡࠩͷऔΓѻ͍: interpolation space ͷཧ ࿦ (Steinwart et al., 2009, Eberts and Steinwart, 2012, Bennett and Sharpley, 1988). Ҏ߱ɼϞσϧޡࠩ͸े෼খ͍͞ͱ͢Δɽ 17 / 60

Slide 26

Slide 26 text

ܦݧޡࠩ࠷খԽ ܦݧޡࠩ࠷খԽ (ERM): ˆ f = arg min f ∈F ˆ L(f ) ਖ਼ଇԽ෇͖ܦݧޡࠩ࠷খԽ (RERM): ˆ f = arg min f ∈F ˆ L(f ) + ψ(f ) ਖ਼ଇԽ߲ RERM ʹؔ͢Δݚڀ΋ඇৗʹ୔ࢁ͋Δ (Steinwart and Christmann, 2008, Mukherjee et al., 2002). ERM ͷԆ௕ઢ্ɽ 18 / 60

Slide 27

Slide 27 text

ܦݧޡࠩ࠷খԽ ܦݧޡࠩ࠷খԽ (ERM): ˑ ˆ f = arg min f ∈F ˆ L(f ) ਖ਼ଇԽ෇͖ܦݧޡࠩ࠷খԽ (RERM): ˆ f = arg min f ∈F ˆ L(f ) + ψ(f ) ਖ਼ଇԽ߲ RERM ʹؔ͢Δݚڀ΋ඇৗʹ୔ࢁ͋Δ (Steinwart and Christmann, 2008, Mukherjee et al., 2002). ERM ͷԆ௕ઢ্ɽ 18 / 60

Slide 28

Slide 28 text

ग़ൃ఺ ΄ͱΜͲͷό΢ϯυͷಋग़͸࣍ͷ͔ࣜΒ࢝·Δ: ˆ L(ˆ f ) ≤ ˆ L(f ∗) (∵ ܦݧޡࠩ࠷খԽ) ⇒ L(ˆ f ) − L(f ∗) ≤ L(ˆ f ) − ˆ L(ˆ f ) + ˆ L(f ∗) − L(f ∗) Reminder: ˆ L(f ) = 1 n ∑ n i=1 ℓ(yi , f (xi )), L(f ) = E(X,Y ) [ℓ(Y , f (X))] 19 / 60

Slide 29

Slide 29 text

ग़ൃ఺ ΄ͱΜͲͷό΢ϯυͷಋग़͸࣍ͷ͔ࣜΒ࢝·Δ: ˆ L(ˆ f ) ≤ ˆ L(f ∗) (∵ ܦݧޡࠩ࠷খԽ) ⇒ L(ˆ f ) − L(f ∗) ൚Խޡࠩ ≤ L(ˆ f ) − ˆ L(ˆ f ) ? + ˆ L(f ∗) − L(f ∗) Op(1/ √ n) (ޙड़) Reminder: ˆ L(f ) = 1 n ∑ n i=1 ℓ(yi , f (xi )), L(f ) = E(X,Y ) [ℓ(Y , f (X))] 19 / 60

Slide 30

Slide 30 text

ग़ൃ఺ ΄ͱΜͲͷό΢ϯυͷಋग़͸࣍ͷ͔ࣜΒ࢝·Δ: ˆ L(ˆ f ) ≤ ˆ L(f ∗) (∵ ܦݧޡࠩ࠷খԽ) ⇒ L(ˆ f ) − L(f ∗) ൚Խޡࠩ ≤ L(ˆ f ) − ˆ L(ˆ f ) ? + ˆ L(f ∗) − L(f ∗) Op(1/ √ n) (ޙड़) ҆қͳղੳ L(ˆ f ) − ˆ L(ˆ f ) { → 0 (∵ େ਺ͷ๏ଇ!!) = Op (1/ √ n) (∵ த৺ۃݶఆཧ!!) ָউʂ ʂ ʂ Reminder: ˆ L(f ) = 1 n ∑ n i=1 ℓ(yi , f (xi )), L(f ) = E(X,Y ) [ℓ(Y , f (X))] 19 / 60

Slide 31

Slide 31 text

ग़ൃ఺ ΄ͱΜͲͷό΢ϯυͷಋग़͸࣍ͷ͔ࣜΒ࢝·Δ: ˆ L(ˆ f ) ≤ ˆ L(f ∗) (∵ ܦݧޡࠩ࠷খԽ) ⇒ L(ˆ f ) − L(f ∗) ൚Խޡࠩ ≤ L(ˆ f ) − ˆ L(ˆ f ) ? + ˆ L(f ∗) − L(f ∗) Op(1/ √ n) (ޙड़) ҆қͳղੳ L(ˆ f ) − ˆ L(ˆ f ) { → 0 (∵ େ਺ͷ๏ଇ!!) = Op (1/ √ n) (∵ த৺ۃݶఆཧ!!) ָউʂ ʂ ʂ μϝͰ͢ ˆ f ͱڭࢣσʔλ͸ಠཱͰ͸ͳ͍ Reminder: ˆ L(f ) = 1 n ∑ n i=1 ℓ(yi , f (xi )), L(f ) = E(X,Y ) [ℓ(Y , f (X))] 19 / 60

Slide 32

Slide 32 text

ͳʹ͕໰୊͔ʁ f f* L(f) 20 / 60

Slide 33

Slide 33 text

ͳʹ͕໰୊͔ʁ f f* L(f) L(f) ^ f ^ “ͨ·ͨ·” ͏·͍͘͘΍͕͍ͭΔ (աֶश) ͔΋͠Εͳ͍ɽ ࣮ࡍɼF ͕ෳࡶͳ৔߹ऩଋ͠ͳ͍ྫ͕ 20 / 60

Slide 34

Slide 34 text

ͳʹ͕໰୊͔ʁ f f* L(f) L(f) ^ f ^ Ұ༷ͳό΢ϯυ Ұ༷ͳό΢ϯυʹΑͬͯʮͨ·ͨ·͏·͍͘͘ʯ͕ (΄ͱΜͲ) ͳ͍͜ͱΛอূ ͦΕ͸ࣗ໌Ͱ͸ͳ͍ (ܦݧաఔͷཧ࿦) 20 / 60

Slide 35

Slide 35 text

Ұ༷ό΢ϯυ L(ˆ f ) − ˆ L(ˆ f ) ≤ sup f ∈F { L(f ) − ˆ L(f ) } ≤ (?) Ұ༷ʹ ϦεΫΛ཈͑Δ͜ͱ͕ॏཁ 21 / 60

Slide 36

Slide 36 text

ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . . 2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 22 / 60

Slide 37

Slide 37 text

·ͣ͸༗ݶ͔Β |F| < ∞ 23 / 60

Slide 38

Slide 38 text

༗༻ͳෆ౳ࣜ Hoeffding ͷෆ౳ࣜ Zi (i = 1, . . . , n): ಠཱͰ (ಉҰͱ͸ݶΒͳ͍) ظ଴஋ 0 ͷ֬཰ม਺ s.t. |Zi | ≤ mi P ( | ∑ n i=1 Zi | √ n > t ) ≤ 2 exp ( − t2 2 ∑ n i=1 m2 i /n ) Bernstein ͷෆ౳ࣜ Zi (i = 1, . . . , n): ಠཱͰ (ಉҰͱ͸ݶΒͳ͍) ظ଴஋ 0 ͷ֬཰ม਺ s.t. E[Z2 i ] = σ2 i , |Zi | ≤ M P ( | ∑ n i=1 Zi | √ n > t ) ≤ 2 exp ( − t2 2(1 n ∑ n i=1 σ2 i + 1 √ n Mt) ) ෼ࢄͷ৘ใΛར༻ 24 / 60

Slide 39

Slide 39 text

༗༻ͳෆ౳ࣜ: ֦ு൛ Hoeffding ͷෆ౳ࣜ (sub-Gaussian tail) Zi (i = 1, . . . , n): ಠཱͰ (ಉҰͱ͸ݶΒͳ͍) ظ଴஋ 0 ͷ֬཰ม਺ s.t. E[eτZi ] ≤ eσ2 i τ2/2 (∀τ > 0) P ( | ∑ n i=1 Zi | √ n > t ) ≤ 2 exp ( − t2 2 ∑ n i=1 σ2 i /n ) Bernstein ͷෆ౳ࣜ Zi (i = 1, . . . , n): ಠཱͰ (ಉҰͱ͸ݶΒͳ͍) ظ଴஋ 0 ͷ֬཰ม਺ s.t. E[Z2 i ] = σ2 i , E|Zi |k ≤ k! 2 σ2Mk−2 (∀k ≥ 2) P ( | ∑ n i=1 Zi | √ n > t ) ≤ 2 exp ( − t2 2(1 n ∑ n i=1 σ2 i + 1 √ n Mt) ) (ώϧϕϧτۭؒ൛΋͋Δ) 25 / 60

Slide 40

Slide 40 text

༗ݶू߹ͷҰ༷ό΢ϯυ 1: Hoeffding ͷෆ౳ࣜ൛ ͜Ε͚ͩͰ΋஌͍ͬͯΔͱ༗༻ɽ(f ← ℓ(y, g(x)) − Eℓ(Y , g(X)) ͱͯ͠ߟ͑Δ) F = {fm (m = 1, . . . , M)} ༗ݶݸͷؔ਺ू߹: ͲΕ΋ظ଴஋ 0 (E[fm (X)] = 0). Hoeffding ͷෆ౳ࣜ (Zi = fm (Xi ) Λ୅ೖ) P ( | ∑ n i=1 fm(Xi )| √ n > t ) ≤ 2 exp ( − t2 2∥fm∥2 ∞ ) . Ұ༷ό΢ϯυ . . . . . . . . • P ( max 1≤m≤M | ∑ n i=1 fm (Xi )| √ n > max m ∥fm∥∞ √ 2 log (2M/δ) ) ≤ δ • E [ max 1≤m≤M | ∑ n i=1 fm (Xi )| √ n ] ≤ C max m ∥fm∥∞ √ log(1 + M) (ಋग़) P ( max 1≤m≤M | ∑ n i=1 fm(Xi )| √ n > t ) = P   ∪ 1≤m≤M | ∑ n i=1 fm(Xi )| √ n > t   ≤ 2 M ∑ m=1 exp ( − t2 2∥fm∥2 ∞ ) 26 / 60

Slide 41

Slide 41 text

༗ݶू߹ͷҰ༷ό΢ϯυ 2: Bernstein ͷෆ౳ࣜ൛ F = {fm (m = 1, . . . , M)} ༗ݶݸͷؔ਺ू߹: ͲΕ΋ظ଴஋ 0 (E[fm (X)] = 0). Bernstein ͷෆ౳ࣜ P ( | ∑ n i=1 fm(Xi )| √ n > t ) ≤ 2 exp ( − t2 2(∥fm∥2 L2 + 1 √ n ∥fm∥∞t) ) . Ұ༷ό΢ϯυ . . . . . . . . E [ max 1≤m≤M | ∑ n i=1 fm (Xi )| √ n ] ≲ 1 √ n max m ∥fm∥∞ log(1 + M) + max m ∥fm∥L2 √ log(1 + M) ˞ Ұ༷ό΢ϯυ͸͍͍ͤͥ √ log(M) ΦʔμͰ૿͑Δɽ 27 / 60

Slide 42

Slide 42 text

ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . . 2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 28 / 60

Slide 43

Slide 43 text

༗ݶ͔Βແݶ΁ Ծઆू߹ͷཁૉ͕ແݶݸ͋ͬͨΒʁ ࿈ଓೱ౓Λ΋͍ͬͯͨΒʁ F = {x⊤β | β ∈ Rd , ∥β∥ ≤ 1} F = {f ∈ H | ∥f ∥H ≤ 1} 29 / 60

Slide 44

Slide 44 text

جຊతͳΞΠσΟΞ ༗ݶݸͷݩͰ୅දͤ͞Δ F 30 / 60

Slide 45

Slide 45 text

Rademacher ෳࡶ͞ ϵ1, ϵ2, . . . , ϵn : Rademacher ม਺, i.e., P(ϵi = 1) = P(ϵi = −1) = 1 2 . Rademacher ෳࡶ͞ R(F) := E{ϵi },{xi } [ sup f ∈F 1 n n ∑ i=1 ϵi f (xi ) ] ରশԽ: (ظ଴஋) E [ sup f ∈F 1 n n ∑ i=1 (f (xi ) − E[f ]) ] ≤ 2R(F). ΋͠ ∥f ∥∞ ≤ 1 (∀f ∈ F) ͳΒ (੄֬཰) P ( sup f ∈F 1 n n ∑ i=1 (f (xi ) − E[f ]) ≥ 2R(F) + √ t 2n ) ≤ 1 − e−t. Rademacher ෳࡶ͞Λ཈͑Ε͹Ұ༷ό΢ϯυ͕ಘΒΕΔʂ 31 / 60

Slide 46

Slide 46 text

Rademacher ෳࡶ͞ͷ֤छੑ࣭ Contraction inequality: ΋͠ ψ ͕ Lipschitz ࿈ଓͳΒ, i.e., |ψ(f ) − ψ(f ′)| ≤ B|f − f ′|, R({ψ(f ) | f ∈ F}) ≤ BR(F). ತแ: conv(F) Λ F ͷݩͷತ݁߹શମ͔ΒͳΔू߹ͱ͢Δ. R(conv(F)) = R(F) 32 / 60

Slide 47

Slide 47 text

Rademacher ෳࡶ͞ͷ֤छੑ࣭ Contraction inequality: ΋͠ ψ ͕ Lipschitz ࿈ଓͳΒ, i.e., |ψ(f ) − ψ(f ′)| ≤ B|f − f ′|, R({ψ(f ) | f ∈ F}) ≤ BR(F). ತแ: conv(F) Λ F ͷݩͷತ݁߹શମ͔ΒͳΔू߹ͱ͢Δ. R(conv(F)) = R(F) ಛʹ࠷ॳͷੑ࣭͕༗Γ೉͍ɽ . . . . . . . |ℓ(y, f ) − ℓ(y, f ′)| ≤ |f − f ′| ͳΒɼ E [ sup f ∈F |ˆ L(f ) − L(f )| ] ≤ 2R(ℓ(F)) ≤ 2R(F), ͨͩ͠ɼℓ(F) = {ℓ(·, f (·)) | f ∈ F}. Αͬͯ F ͷ Rademacher complexity Λ཈͑Ε͹े෼ʂ Lipschitz ࿈ଓੑ͸ώϯδϩε, ϩδεςΟ οΫϩεͳͲͰ੒Γཱͭɽ͞Βʹ y ͱ F ͕༗քͳΒೋ৐ϩεͳͲͰ΋੒Γཱͭɽ Reminder: ˆ L(f ) = 1 n ∑ n i=1 ℓ(yi , f (xi )), L(f ) = E(X,Y ) [ℓ(Y , f (X))] 32 / 60

Slide 48

Slide 48 text

ΧόϦϯάφϯόʔ Rademacher complexity Λ཈͑Δํ๏ɽ ΧόϦϯάφϯόʔ: Ծઆू߹ F ͷෳࡶ͞ɾ༰ྔɽ . ϵ-ΧόϦϯάφϯόʔ . . . . . . . . N(F, ϵ, d) ϊϧϜ d Ͱఆ·Δ൒ܘ ϵ ͷϘʔϧͰ F Λ෴͏ͨΊ ʹඞཁͳ࠷খͷϘʔϧͷ਺ɽ F ༗ݶݸͷݩͰ F Λۙࣅ͢Δͷʹ࠷௿ݶඞཁͳݸ਺ɽ . Theorem (Dudley ੵ෼) . . . . . . . . ∥f ∥2 n := 1 n ∑ n i=1 f (xi )2 ͱ͢Δͱɼ R(F) ≤ C √ n EDn [∫ ∞ 0 √ log(N(F, ϵ, ∥ · ∥n ))dϵ ] . 33 / 60

Slide 49

Slide 49 text

Dudley ੵ෼ͷΠϝʔδ R(F) ≤ C √ n EDn [∫ ∞ 0 √ log(N(F, ϵ, ∥ · ∥n ))dϵ ] . ༗ݶݸͷݩͰ F Λۙࣅ͢Δɽ ͦͷղ૾౓Λࡉ͔͍ͯͬͯ͘͠ɼࣅ ͍ͯΔݩΛ·ͱΊ্͛ͯΏ͘Πϝʔ δɽ νΣΠχϯά ͱ͍͏ɽ 34 / 60

Slide 50

Slide 50 text

͜Ε·Ͱͷ·ͱΊ ˆ L(ˆ f ) ≤ ˆ L(f ∗) (∵ ܦݧޡࠩ࠷খԽ) ⇒ L(ˆ f ) − L(f ∗) ≤ L(ˆ f ) − ˆ L(ˆ f ) ͜ΕΛ཈͍͑ͨ + ˆ L(f ∗) − L(f ∗) Op(1/ √ n) (Hoeffding) ℓ ͕ 1-Lipschitz (|ℓ(y, f ) − ℓ(y, f ′)| ≤ |f − f ′|) ͔ͭ ∥f ∥∞ ≤ 1 (∀f ∈ F) ͷͱ͖, L(ˆ f ) − ˆ L(ˆ f ) ≤ sup f ∈F (L(f ) − ˆ L(f )) ≤ R(ℓ(F)) + √ t n (with prob. 1 − e−t) ≤ R(F) + √ t n (contraction ineq., Lipschitz ࿈ଓ) ≤ 1 √ n EDn [∫ ∞ 0 √ log N(F, ϵ, ∥ · ∥n )dϵ ] + √ t n (Dudley ੵ෼). ˞ΧόϦϯάφϯόʔ͕খ͍͞΄ͲϦεΫ͸খ͍͞ˠ Occam’s Razor 35 / 60

Slide 51

Slide 51 text

ྫ: ઢܗ൑ผؔ਺ F = {f (x) = sign(x⊤β + c) | β ∈ Rd , c ∈ R} N(F, ϵ, ∥ · ∥n ) ≤ C(d + 2) (c ϵ )2(d+1) ͢Δͱɼ0-1 ϩε ℓ ʹର͠ L(ˆ f ) − ˆ L(ˆ f ) ≤ Op ( 1 √ n EDn [∫ 1 0 √ log N(F, ϵ, ∥ · ∥n )dϵ ]) ≤ Op ( 1 √ n ∫ 1 0 C √ d log(1/ϵ) + log(d)dϵ ) ≤ Op (√ d n ) . 36 / 60

Slide 52

Slide 52 text

ྫ: VC ࣍ݩ F ͸ࢦࣔؔ਺ͷू߹: F = {1C | C ∈ C}. C ͸͋Δू߹଒ (ྫ: ൒ۭؒͷू߹) ࡉ෼: F ͕͋Δ༩͑ΒΕͨ༗ݶू߹ Xn = {x1, . . . , xn} Λࡉ෼͢Δ ⇔ ೚ҙͷϥϕϧ Yn = {y1, . . . , yn} (yi ∈ {±1}) ʹରͯ͠ Xn Λ F ͕ਖ਼͘͠ ൑ผͰ͖Δɽ VC ࣍ݩ VF : F ͕ࡉ෼Ͱ͖Δू߹͕ଘࡏ͠ͳ͍ n ͷ࠷খ஋. N(F, ϵ, ∥ · ∥n ) ≤ KVF (4e)VF ( 1 ϵ )2(VF −1) ⇒ ൚Խޡࠩ = Op ( √ VF /n) http://www.tcs.fudan.edu.cn/rudolf/Courses/Algorithms/Alg_ss_07w/Webprojects/Qinbo_diameter/e_net.htm ͔Βഈआ VC ࣍ݩ༗ݶ͕Ұ༷ऩଋͷඞཁे෼৚݅ (ҰൠԽ Glivenko-Cantelli ఆཧͷඞཁे෼৚݅) 37 / 60

Slide 53

Slide 53 text

ྫ: Χʔωϧ๏ F = {f ∈ H | ∥f ∥H ≤ 1} Χʔωϧؔ਺ k ࠶ੜ֩ώϧϕϧτۭؒ H k(x, x) ≤ 1 (∀x ∈ X) ΛԾఆ, e.g., Ψ΢εΧʔ ωϧ. ௚઀ Rademacher ෳࡶ͞ΛධՁͯ͠ΈΔɽ ∑ n i=1 ϵi f (xi ) = ⟨ ∑ n i=1 ϵi k(xi , ·), f ⟩H ≤ ∥ ∑ n i=1 ϵi k(xi , ·)∥H ∥f ∥H ≤ ∥ ∑ n i=1 ϵi k(xi , ·)∥H Λ࢖͏ɽ R(F) = E [ sup f ∈F | ∑ n i=1 ϵi f (xi )| n ] ≤ E [ ∥ ∑ n i=1 ϵi k(xi , ·)∥H n ] = E   √∑ n i,j=1 ϵi ϵj k(xi , xj ) n   ≤ √ E [∑ n i,j=1 ϵi ϵj k(xi , xj ) ] n (Jensen) = √∑ n i=1 k(xi , xi ) n ≤ 1 √ n 38 / 60

Slide 54

Slide 54 text

ྫ: ϥϯμϜߦྻͷ࡞༻ૉϊϧϜ A = (aij ): p × q ߦྻͰ֤ aij ͸ಠཱͳظ଴஋ 0 ͔ͭ |aij | ≤ 1 ͳΔ֬཰ม਺ɽ A ͷ࡞༻ૉϊϧϜ ∥A∥ := max ∥z∥≤1 z∈Rq ∥Az∥ = max ∥w∥≤1,∥z∥≤1 w∈Rp,z∈Rq w⊤Az. F = {fw,z (aij , (i, j)) = aij wi zj | w ∈ Rp, z ∈ Rq} ⇒ ∥A∥ = sup f ∈F ∑ i,j f (aij , (i, j)) n = pq ݸͷαϯϓϧ͕͋ΔͱΈͳ͢ɽ ∥fw,z − fw′,z′ ∥2 n = 1 pq ∑ p,q i,j=1 |aij (wi zj − w′ i z′ j )|2 ≤ 2 pq (∥w − w′∥2 + ∥z − z′∥2) ∴ N(F, ϵ, ∥ · ∥n ) { ≤ C( √ pqϵ)−(p+q), (ϵ ≤ 2/ √ pq), = 1, (otherwise). . . . . . . . E [ 1 pq sup w,z w⊤Az ] ≤ C √ pq ∫ 1 √ pq 0 √ (p + q) log(C/ √ pqϵ)dϵ ≤ √ p + q pq ΑͬͯɼA ͷ࡞༻ૉϊϧϜ͸ Op ( √ p + q). ˠ ௿ϥϯΫߦྻਪఆ, Robust PCA, ... ৄ͘͠͸ Tao (2012), Davidson and Szarek (2001) Λࢀরɽ 39 / 60

Slide 55

Slide 55 text

ྫ: Lasso ͷऩଋϨʔτ σβΠϯߦྻ X = (Xij ) ∈ Rn×p. p (࣍ݩ) ≫ n (αϯϓϧ਺). ਅͷϕΫτϧ β∗ ∈ Rp: ඇθϩཁૉͷݸ਺͕͔͔ͨͩ d ݸ (εύʔε). Ϟσϧ : Y = Xβ∗ + ξ. ˆ β ← arg min β∈Rp 1 n ∥Xβ − Y ∥2 2 + λn∥β∥1. . Theorem (Lasso ͷऩଋϨʔτ (Bickel et al., 2009, Zhang, 2009)) . . . . . . . . σβΠϯߦྻ͕ Restricted eigenvalue condition (Bickel et al., 2009) ͔ͭ maxi,j |Xij | ≤ 1 Λຬͨ͠ɼϊΠζ͕ E[eτξi ] ≤ eσ2τ2/2 (∀τ > 0) Λຬͨ͢ͳΒ, ֬ ཰ 1 − δ Ͱ ∥ˆ β − β∗∥2 2 ≤ C d log(p/δ) n . ˞࣍ݩ͕ߴͯ͘΋ɼ͔͔ͨͩ log(p) Ͱ͔͠ޮ͍ͯ͜ͳ͍ɽ࣮࣭తͳ࣍ݩ d ͕ࢧ ഑తɽ 40 / 60

Slide 56

Slide 56 text

log(p) ͸Ͳ͔͜Β΍͖͔ͬͯͨʁ ༗ݶݸͷҰ༷ό΢ϯυ͔Β΍͖ͬͯͨɽ 1 n ∥X ˆ β − Y ∥2 2 + λn∥ˆ β∥1 ≤ 1 n ∥Xβ∗ − Y ∥2 2 + λn∥β∗∥1 ⇒ 1 n ∥X(ˆ β − β∗)∥2 2 + λn∥ˆ β∥1 ≤ 2 n ∥X⊤ξ∥∞ ͜Ε ∥ˆ β − β∗∥1 + λn∥β∗∥1 1 n ∥X⊤ξ∥∞ = max 1≤j≤p | 1 n n ∑ i=1 Xij ξi | 41 / 60

Slide 57

Slide 57 text

log(p) ͸Ͳ͔͜Β΍͖͔ͬͯͨʁ ༗ݶݸͷҰ༷ό΢ϯυ͔Β΍͖ͬͯͨɽ 1 n ∥X ˆ β − Y ∥2 2 + λn∥ˆ β∥1 ≤ 1 n ∥Xβ∗ − Y ∥2 2 + λn∥β∗∥1 ⇒ 1 n ∥X(ˆ β − β∗)∥2 2 + λn∥ˆ β∥1 ≤ 2 n ∥X⊤ξ∥∞ ͜Ε ∥ˆ β − β∗∥1 + λn∥β∗∥1 1 n ∥X⊤ξ∥∞ = max 1≤j≤p | 1 n n ∑ i=1 Xij ξi | Hoeffding ͷෆ౳ࣜ༝དྷͷҰ༷ό΢ϯυʹΑΓ, ֬཰ 1 − δ Ͱ max 1≤j≤p | 1 n n ∑ i=1 Xij ξi | ≤ σ √ 2 log(2p/δ) n . 41 / 60

Slide 58

Slide 58 text

Talagrand ͷ concentration inequality ൚༻ੑͷߴ͍ෆ౳ࣜɽ . Theorem (Talagrand (1996b), Massart (2000), Bousquet (2002)) . . . . . . . . σ2 := supf ∈F E[f (X)2], Pn f := 1 n ∑ n i=1 f (xi ), Pf := E[f (X)] ͱ͢Δ. P [ sup f ∈F (Pn f − Pf ) ≥ C ( E [ sup f ∈F (Pn f − Pf ) ] + √ t n σ + t n )] ≤ e−t Fast learning rate Λࣔ͢ͷʹ༗༻ɽ 42 / 60

Slide 59

Slide 59 text

ͦͷଞͷτϐοΫ Johnson-Lindenstrauss ͷิ୊ (Johnson and Lindenstrauss, 1984, Dasgupta and Gupta, 1999) n ݸͷ఺ {x1, . . . , xn} ∈ Rd Λ k ࣍ݩۭؒ΁ࣹӨ͢Δ. k ≥ cδ log(n) ͳΒ, k ࣍ݩ΁ͷϥϯμϜϓϩδΣΫγϣϯ A ∈ Rk×d (ϥϯμϜߦྻ) ͸ (1 − δ)∥xi − xj ∥ ≤ ∥Axi − Axj ∥ ≤ (1 + δ)∥xi − xj ∥ Λߴ͍֬཰Ͱຬͨ͢ɽ ˠ restricted isometory (Baraniuk et al., 2008, Cand` es, 2008) Gaussian concentration inequality, concentration inequality on product space (Ledoux, 2001) sup f ∈F 1 n n ∑ i=1 ξi f (xi ) (ξi : Ψ΢ε෼෍ͳͲ) Majorizing measure: Ψ΢γΞϯϓϩηεʹ·ͭΘΔ্ք, Լք (Talagrand, 2000). 43 / 60

Slide 60

Slide 60 text

ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . . 2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 44 / 60

Slide 61

Slide 61 text

Op (1/ √ n) ΦʔμΑΓ଎͍Ϩʔτ͸ࣔͤΔʁ 45 / 60

Slide 62

Slide 62 text

ϩεؔ਺ͷڧತੑΛੵۃతʹར༻ f f* L(f) f ^ Ұ༷ͳό΢ϯυ ϩεͷڧತੑΛ࢖͏ͱ ˆ f ͷଘࡏൣғ੍͕ݶ͞ΕΔˠΑΓ͖͍ͭό΢ϯυ 46 / 60

Slide 63

Slide 63 text

ϩεؔ਺ͷڧತੑΛੵۃతʹར༻ f L(f) f ^ Ұ༷ͳό΢ϯυ ಉ͡࿦ཧΛԿ౓΋ద༻ͤ͞Δ͜ͱʹΑͬͯ ˆ f ͷϦεΫ͕খ͍͜͞ͱΛࣔ͢ɽ ˆ f ͕ f ∗ ʹ͍ۙ͜ͱΛར༻ˠ “ہॴ”Rademacher ෳࡶ͞ 46 / 60

Slide 64

Slide 64 text

ہॴ Rademacher ෳࡶ͞ . . . . . . . ہॴ Rademacher ෳࡶ͞: Rδ (F) := R({f ∈ F | E[(f − f ∗)2] ≤ δ}). ࣍ͷ৚݅ΛԾఆͯ͠ΈΔ. F ͸ 1 Ͱ্͔Β཈͑ΒΕ͍ͯΔ: ∥f ∥∞ ≤ 1 (∀f ∈ F). ℓ ͸ Lipschitz ࿈ଓ͔ͭ ڧತ: E[ℓ(Y , f (X))] − E[ℓ(Y , f ∗(X))] ≥ BE[(f − f ∗)2] (∀f ∈ F). . Theorem (Fast learning rate (Bartlett et al., 2005)) . . . . . . . . δ∗ = inf{δ | δ ≥ Rδ (F)} ͱ͢Δͱɼ֬཰ 1 − e−t Ͱ L(ˆ f ) − L(f ∗) ≤ C ( δ∗ + t n ) . δ∗ ≤ R(F) ͸ৗʹ੒Γཱͭ (ӈਤࢀর). ͜ΕΛ Fast learning rate ͱݴ͏ɽ R± (F) ± ± ±* 47 / 60

Slide 65

Slide 65 text

Fast learning rate ͷྫ log N(F, ϵ, ∥ · ∥n ) ≤ Cϵ−2ρ ͷͱ͖ɼ Rδ (F) ≤ C ( δ 1−ρ 2 √ n ∨ n− 1 1+ρ ) , ͕ࣔ͞Εɼδ∗ ͷఆ͔ٛΒ֬཰ 1 − e−t Ͱ͕࣍੒Γཱͭ: L(ˆ f ) − L(f ∗) ≤ C ( n− 1 1+ρ + t n ) . ˞ 1/ √ n ΑΓλΠτʂ ࢀߟจݙ ہॴ Rademacher ෳࡶ͞ͷҰൠ࿦: Bartlett et al. (2005), Koltchinskii (2006) ൑ผ໰୊, Tsybakov ͷ৚݅: Tsybakov (2004), Bartlett et al. (2006) Χʔωϧ๏ʹ͓͚Δ fast learning rate: Steinwart and Christmann (2008) Peeling device: van de Geer (2000) 48 / 60

Slide 66

Slide 66 text

ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . . 2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 49 / 60

Slide 67

Slide 67 text

࠷దੑ ͋Δֶशํ๏͕ʮ࠷దʯͱ͸ʁ Ͳͷֶशํ๏΋σʔλͷ෼෍ʹԠͯ͡ಘҙෆಘҙ͕͋Δɽ ʮ͜ͷ৔߹͸͏·͍͕͘͘͜ͷ৔߹͸͏·͍͔͘ͳ͍ʯ ओͳ࠷దੑͷن४ ڐ༰ੑ ৗʹੑೳΛվળͤ͞Δํ๏͕ଞʹͳ͍ɽ minimax ࠷దੑ Ұ൪ෆಘҙͳ৔໘ͰͷϦεΫ͕࠷খɽ 50 / 60

Slide 68

Slide 68 text

ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . . 2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 51 / 60

Slide 69

Slide 69 text

ڐ༰ੑ ෼෍ͷϞσϧ: {Pθ |θ ∈ Θ} Pθ ʹ͓͚Δਪఆྔ ˇ f ͷϦεΫͷظ଴஋: ¯ Lθ (ˇ f ) := EDn∼Pθ [E(X,Y )∼Pθ [ℓ(Y , ˇ f (X))]] . Definition (ڐ༰ੑ) . . . . . . . . ˆ f ͕ڐ༰త (admissible) ⇔ ¯ Lθ (ˇ f ) ≤ ¯ Lθ (ˆ f ) (∀θ ∈ Θ) ͔ͭ, ͋Δ θ′ ∈ Θ Ͱ ¯ Lθ′ (ˇ f ) < ¯ Lθ′ (ˆ f ) ͳΔਪఆྔ ˇ f ͕ ଘࡏ͠ͳ͍ɽ θ ¹ Lµ (· f ) ¹ Lµ (^ f ) θ ¹ Lµ (^ f ) 52 / 60

Slide 70

Slide 70 text

ྫ ؆୯ͷͨΊαϯϓϧ Dn = {(x1, . . . , xn )} ∼ Pn θ ͔Β Pθ (θ ∈ Θ) Λਪఆ͢Δ໰୊ Λߟ͑Δɽ Ұ఺ౌ͚: ͋Δ θ0 Λৗʹ༻͍Δɽͦͷ θ0 ʹର͢Δ౰ͯ͸·Γ͸࠷ྑ͕ͩଞ ͷ θ ʹ͸ѱ͍ɽ ϕΠζਪఆྔ: ࣄલ෼෍ π(θ), ϦεΫ L(θ0, ˆ P) ˆ P = arg min ˆ P:ਪఆྔ ∫ EDn∼Pθ0 [L(θ0, ˆ P)]π(θ0 )dθ0. ೋ৐ϦεΫ L(θ, ˆ θ) = ∥θ − ˆ θ∥2: ˆ θ = ∫ θπ(θ|Dn )dθ (ࣄޙฏۉ) KL-ϦεΫ L(θ, ˆ P) = KL(Pθ ||ˆ P): ˆ P = ∫ P(·|θ)π(θ|Dn )dθ (ϕΠζ༧ଌ෼෍) ϕΠζਪఆྔͷఆٛΑΓɼϦεΫ L(θ, ˆ P) Λৗʹվળ͢Δਪఆྔ͸ଘࡏ͠ ͳ͍ɽ 53 / 60

Slide 71

Slide 71 text

ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . . 2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 54 / 60

Slide 72

Slide 72 text

minimax ࠷దੑ . Definition (minimax ࠷దੑ) . . . . . . . . ˆ f ͕ minimax ࠷ద ⇔ max θ∈Θ ¯ Lθ (ˆ f ) = min ˇ f :ਪఆྔ max θ∈Θ ¯ Lθ (ˇ f )ɽ ֶशཧ࿦Ͱ͸ఆ਺ഒΛڐ͢͜ͱ͕ଟ͍: ∃C Ͱ max θ∈Θ ¯ Lθ (ˆ f ) ≤ C min ˇ f :ਪఆྔ max θ∈Θ ¯ Lθ (ˇ f ) (∀n). ͦ͏͍͏ҙຯͰʮminimax ϨʔτΛୡ੒͢ΔʯͱݴͬͨΓ͢Δɽ θ ¹ Lµ (^ f ) 55 / 60

Slide 73

Slide 73 text

minimax ϨʔτΛٻΊΔํ๏ Introduction to nonparametric estimation (Tsybakov, 2008) ʹৄ͍͠هड़. F Λ༗ݶݸͷݩͰ୅දͤ͞ɼͦͷ͏ͪҰͭ࠷ྑͳ΋ͷΛબͿ໰୊Λߟ͑Δɽ (΋ͱͷ໰୊ΑΓ؆୯ˠϦεΫͷԼݶΛ༩͑Δ) {f1, . . . , fMn } ⊆ F F fj εn ݸ਺ Mn ͱޡࠩ εn ͷτϨʔυΦϑ: Mn ͕খ͍͞ํ͕࠷దͳݩΛબͿͷ͕؆୯ʹ ͳΔ͕ޡࠩ εn ͕େ͖͘ͳΔ. cf. Fano ͷෆ౳ࣜ, Assouad ͷิ୊. 56 / 60

Slide 74

Slide 74 text

εύʔεਪఆͷ minimax Ϩʔτ . Theorem (Raskutti and Wainwright (2011)) . . . . . . . . ͋Δ৚݅ͷ΋ͱɼ֬཰ 1/2 Ҏ্Ͱɼ min ˆ β:ਪఆྔ max β∗:d-εύʔε ∥ˆ β − β∗∥2 ≥ C d log(p/d) n . Lasso ͸ minimax ϨʔτΛୡ੒͢Δ (d log(d) n ͷ߲Λআ͍ͯ)ɽ ͜ͷ݁ՌΛ Multiple Kernel Learning ʹ֦ுͨ݁͠Ռ: Raskutti et al. (2012), Suzuki and Sugiyama (2012). 57 / 60

Slide 75

Slide 75 text

ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . . 2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 58 / 60

Slide 76

Slide 76 text

ϕΠζͷֶशཧ࿦ ϊϯύϥϕΠζͷ౷ܭతੑ࣭ ڭՊॻ: Ghosh and Ramamoorthi (2003), Bayesian Nonparametrics. Springer, 2003. ऩଋϨʔτ Ұൠ࿦: Ghosal et al. (2000) Dirichlet mixture: Ghosal and van der Vaart (2007) Gaussian process: van der Vaart and van Zanten (2008a,b, 2011). 59 / 60

Slide 77

Slide 77 text

ϕΠζͷֶशཧ࿦ ϊϯύϥϕΠζͷ౷ܭతੑ࣭ ڭՊॻ: Ghosh and Ramamoorthi (2003), Bayesian Nonparametrics. Springer, 2003. ऩଋϨʔτ Ұൠ࿦: Ghosal et al. (2000) Dirichlet mixture: Ghosal and van der Vaart (2007) Gaussian process: van der Vaart and van Zanten (2008a,b, 2011). PAC-Bayes L(ˆ fπ ) ≤ infρ {∫ L(f )ρ(df ) + 2 [ λC2 n + KL(ρ||π)+log 2 ϵ λ ]} (Catoni, 2007) ݩ࿦จ: McAllester (1998, 1999) ΦϥΫϧෆ౳ࣜ: Catoni (2004, 2007) εύʔεਪఆ΁ͷԠ༻: Dalalyan and Tsybakov (2008), Alquier and Lounici (2011), Suzuki (2012) 59 / 60

Slide 78

Slide 78 text

·ͱΊ Ұ༷ό΢ϯυ͕ॏཁ sup f ∈F { 1 n n ∑ i=1 ℓ(yi , f (xi )) − E[ℓ(Y , f (X))] } Rademacher ෳࡶ͞ ΧόϦϯάφϯόʔ Ծઆू߹͕୯७Ͱ͋Ε͹͋Δ΄Ͳɼ଎͍ऩଋɽ ࠷దੑن४ ڐ༰ੑ minimax ࠷దੑ f f* L(f) L(f) ^ f ^ Ұ༷ͳό΢ϯυ 60 / 60

Slide 79

Slide 79 text

H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716–723, 1974. P. Alquier and K. Lounici. PAC-Bayesian bounds for sparse regression estimation with exponential weights. Electronic Journal of Statistics, 5:127–145, 2011. R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry property for random matrices. Constructive Approximation, 28(3):253–263, 2008. P. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. The Annals of Statistics, 33:1487–1537, 2005. P. Bartlett, M. Jordan, and D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101:138–156, 2006. P. L. Bartlett and A. Tewari. Sparseness vs estimating conditional probabilities: Some asymptotic results. Journal of Machine Learning Research, 8:775–790, 2007. C. Bennett and R. Sharpley. Interpolation of Operators. Academic Press, Boston, 1988. P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009. O. Bousquet. A Bennett concentration inequality and its application to suprema of empirical process. C. R. Acad. Sci. Paris Ser. I Math., 334:495–500, 2002. 60 / 60

Slide 80

Slide 80 text

E. Cand` es. The restricted isometry property and its implications for compressed sensing. Compte Rendus de l’Academie des Sciences, Paris, Serie I, 346: 589–592, 2008. F. P. Cantelli. Sulla determinazione empirica della leggi di probabilit` a. G. Inst. Ital. Attuari, 4:221–424, 1933. O. Catoni. Statistical Learning Theory and Stochastic Optimization. Lecture Notes in Mathematics. Springer, 2004. Saint-Flour Summer School on Probability Theory 2001. O. Catoni. PAC-Bayesian Supervised Classification (The Thermodynamics of Statistical Learning). Lecture Notes in Mathematics. IMS, 2007. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3): 273–297, 1995. A. Dalalyan and A. B. Tsybakov. Aggregation by exponential weighting sharp PAC-Bayesian bounds and sparsity. Machine Learning, 72:39–61, 2008. S. Dasgupta and A. Gupta. An elementary proof of the johnson-lindenstrauss lemma. Technical Report 99–006, U.C. Berkeley, 1999. K. R. Davidson and S. J. Szarek. Local operator theory, random matrices and Banach spaces, volume 1, chapter 8, pages 317–366. North Holland, 2001. M. Donsker. Justification and extension of doob’s heuristic approach to the kolmogorov-smirnov theorems. Annals of Mathematical Statistics, 23:277–281, 1952. 60 / 60

Slide 81

Slide 81 text

R. M. Dudley. The sizes of compact subsets of hilbert space and continuity of gaussian processes. J. Functional Analysis, 1:290–330, 1967. M. Eberts and I. Steinwart. Optimal learning rates for least squares svms using gaussian kernels. In Advances in Neural Information Processing Systems 25, 2012. T. S. Ferguson. A bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209–230, 1973. Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT ’95, pages 23–37, 1995. S. Ghosal and A. W. van der Vaart. Posterior convergence rates of dirichlet mixtures at smooth densities. The Annals of Statistics, 35(2):697–723, 2007. S. Ghosal, J. K. Ghosh, and A. W. van der Vaart. Convergence rates of posterior distributions. The Annals of Statistics, 28(2):500–531, 2000. J. Ghosh and R. Ramamoorthi. Bayesian Nonparametrics. Springer, 2003. V. I. Glivenko. Sulla determinazione empirica di probabilit` a. G. Inst. Ital. Attuari, 4:92–99, 1933. W. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. In Conference in Modern Analysis and Probability, volume 26, pages 186–206, 1984. A. Kolmogorov. Sulla determinazione empirica di una legge di distribuzione. G. Inst. Ital. Attuari, 4:83–91, 1933. 60 / 60

Slide 82

Slide 82 text

V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34:2593–2656, 2006. M. Ledoux. The concentration of measure phenomenon. American Mathematical Society, 2001. P. Massart. About the constants in talagrand’s concentration inequalities for empirical processes. The Annals of Probability, 28(2):863–884, 2000. D. McAllester. Some PAC-Bayesian theorems. In the Anual Conference on Computational Learning Theory, pages 230–234, 1998. D. McAllester. PAC-Bayesian model averaging. In the Anual Conference on Computational Learning Theory, pages 164–170, 1999. S. Mukherjee, R. Rifkin, and T. Poggio. Regression and classification with regularization. In D. D. Denison, M. H. Hansen, C. C. Holmes, B. Mallick, and B. Yu, editors, Lecture Notes in Statistics: Nonlinear Estimation and Classification, pages 107–124. Springer-Verlag, New York, 2002. G. Raskutti and M. J. Wainwright. Minimax rates of estimation for high-dimensional linear regression over ℓq -balls. IEEE Transactions on Information Theory, 57(10):6976–6994, 2011. G. Raskutti, M. Wainwright, and B. Yu. Minimax-optimal rates for sparse additive models over kernel classes via convex programming. Journal of Machine Learning Research, 13:389–427, 2012. 60 / 60

Slide 83

Slide 83 text

I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008. I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In Proceedings of the Annual Conference on Learning Theory, pages 79–93, 2009. T. Suzuki. Pac-bayesian bound for gaussian process regression and multiple kernel additive model. In JMLR Workshop and Conference Proceedings, volume 23, pages 8.1–8.20, 2012. Conference on Learning Theory (COLT2012). T. Suzuki and M. Sugiyama. Fast learning rate of multiple kernel learning: Trade-off between sparsity and smoothness. In JMLR Workshop and Conference Proceedings 22, pages 1152–1183, 2012. Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS2012). M. Talagrand. New concentration inequalities in product spaces. Invent. Math., 126:505–563, 1996a. M. Talagrand. New concentration inequalities in product spaces. Inventiones Mathematicae, 126:505–563, 1996b. M. Talagrand. The generic chaining. Springer, 2000. T. Tao. Topics in random matrix theory. American Mathematical Society, 2012. R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58(1):267–288, 1996. A. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statistics, 35:135–166, 2004. 60 / 60

Slide 84

Slide 84 text

A. B. Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics. Springer, 2008. S. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000. A. W. van der Vaart and J. H. van Zanten. Rates of contraction of posterior distributions based on Gaussian process priors. The Annals of Statistics, 36(3): 1435–1463, 2008a. A. W. van der Vaart and J. H. van Zanten. Reproducing kernel Hilbert spaces of Gaussian priors. Pushing the Limits of Contemporary Statistics: Contributions in Honor of Jayanta K. Ghosh, 3:200–222, 2008b. IMS Collections. A. W. van der Vaart and J. H. van Zanten. Information rates of nonparametric gaussian process methods. Journal of Machine Learning Research, 12: 2095–2119, 2011. V. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Soviet Math. Dokl., 9:915–918, 1968. V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. T. Zhang. Some sharp performance bounds for least squares regression with l1 regularization. The Annals of Statistics, 37(5):2109–2144, 2009. 60 / 60