Taiji Suzuki
June 18, 2023
140

# 統計的学習理論チュートリアル: 基礎から応用まで (IBIS2012)

June 18, 2023

## Transcript

1. ### . . . . . . . ౷ܭతֶशཧ࿦νϡʔτϦΞϧ: جૅ͔ΒԠ༻·Ͱ †

ླ໦ɹେ࣊ † ౦ژେֶ ৘ใཧ޻ֶݚڀՊ ਺ཧ৘ใֶઐ߈ IBIS 2012@ஜ೾େֶ౦ژΩϟϯύεจژߍࣷ 2012 ೥ 11 ݄ 7 ೔ 1 / 60
2. ### ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . .

2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 2 / 60
3. ### ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . .

2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 3 / 60

/ 60

60
8. ### جૅݚڀͱԠ༻ͷؔ܎ ೥୅ Ԡ༻ جૅ ೞ఺ܖ፼ ૼẲẟؕஜ২ᘐ ʢཧ࿦ʣ ίϛϡχέʔγϣϯ ݴޠɿ਺ֶ ཧ࿦

࣌ʹҟͳΔϨϕϧؒͷίϛϡχέʔγϣϯ͕৽͍͠ൃݟΛಋ͘ 5 / 60
9. ### ྺ࢙ʹΈΔ੒ޭྫ SVM (Vapnik, 1998, Cortes and Vapnik, 1995): VC ࣍ݩ

AdaBoost (Freund and Schapire, 1995): ऑֶशػʹΑΔֶशՄೳੑ Dirichlet process (Ferguson, 1973): ֬཰࿦ɼଌ౓࿦ 6 / 60
10. ### ྺ࢙ʹΈΔ੒ޭྫ SVM (Vapnik, 1998, Cortes and Vapnik, 1995): VC ࣍ݩ

AdaBoost (Freund and Schapire, 1995): ऑֶशػʹΑΔֶशՄೳੑ Dirichlet process (Ferguson, 1973): ֬཰࿦ɼଌ౓࿦ Lasso (Tibshirani, 1996) AIC (Akaike, 1974) ѹॖηϯγϯά (Cand` es, Tao and Donoho, 2004) ͳͲͳͲ େࣄͳͷ͸ຊ࣭ͷཧղ ˠ৽͍͠ख๏ (ͦͷͨΊͷຊηογϣϯʂ) 6 / 60
11. ### ΑΓ௚઀తͳޮೳ ֶशཧ࿦Λ஌Δ͜ͱͷΑΓ௚઀తͳ༗༻ੑ . . . 1 ख๏ͷҙຯ: ͦ΋ͦ΋ԿΛ΍͍ͬͯΔख๏ͳͷ͔ˠਖ਼͍͠࢖͍ํ . .

. 2 ख๏ͷਖ਼౰ੑ: ͪΌΜͱͨ͠ղ͕ಘΒΕΔ͔ʢ ʮຊ౰ʹऩଋ͢Δͷ͔ʯ ʣ . . . 3 ख๏ͷ࠷దੑ: ͋Δई౓ʹؔͯ͠࠷దͳख๏͔ˠ҆৺ͯ͠࢖͑Δ 7 / 60
12. ### ख๏ͷҙຯ: ྫ. ϩεؔ਺ͷબ୒ ೋ஋൑ผ: ώϯδϩεͱϩδεςΟ οΫϩεɼͲͪΒΛ࢖͏΂͖ʁ min f 1 n

n ∑ i=1 ϕ(−yi f (xi )) (yi ∈ {±1}) . . . 1 ྆ऀͱ΋൑ผޡࠩΛ࠷খԽ ϕ ͕ತͷ࣌ ʮϕ ͸൑ผҰகੑΛ΋ͭ ⇔ ϕ ͕ݪ఺Ͱඍ෼Մೳ͔ͭ ϕ′(0) > 0ʯ (Bartlett et al., 2006) ൑ผҰகੑ: ظ଴ϦεΫ࠷খԽؔ਺ (arg minf E[ϕ(−Yf (X))]) ͕ Bayes ࠷దɽ 8 / 60
13. ### ख๏ͷҙຯ: ྫ. ϩεؔ਺ͷબ୒ ೋ஋൑ผ: ώϯδϩεͱϩδεςΟ οΫϩεɼͲͪΒΛ࢖͏΂͖ʁ min f 1 n

n ∑ i=1 ϕ(−yi f (xi )) (yi ∈ {±1}) . . . 1 ྆ऀͱ΋൑ผޡࠩΛ࠷খԽ ϕ ͕ತͷ࣌ ʮϕ ͸൑ผҰகੑΛ΋ͭ ⇔ ϕ ͕ݪ఺Ͱඍ෼Մೳ͔ͭ ϕ′(0) > 0ʯ (Bartlett et al., 2006) ൑ผҰகੑ: ظ଴ϦεΫ࠷খԽؔ਺ (arg minf E[ϕ(−Yf (X))]) ͕ Bayes ࠷దɽ . . . 2 αϙʔτϕΫλʔͷ਺ vs ৚݅෇͖֬཰ p(Y |X) ͷਪఆೳྗ ώϯδ: εύʔεͳղ (৚݅෇͖֬཰͸શ͘ਪఆͰ͖ͳ͍) ϩδεςΟ οΫ: ৚݅෇͖֬཰͕ٻ·Δ (Ұํɼશαϯϓϧ͕αϙʔτϕΫλʔ) ʮαϙʔτϕΫλʔͷ਺ͱ৚݅෇͖֬཰ͷਪఆೳྗͱͷؒʹ͸τϨʔυΦϑ ͕͋Δɽ྆ऀΛ׬શʹཱ྆ͤ͞Δ͜ͱ͸Ͱ͖ͳ͍ɽ ʯ (Bartlett and Tewari, 2007) 8 / 60
14. ### ख๏ͷਖ਼౰ੑɾख๏ͷ࠷దੑ 2 ख๏ͷਖ਼౰ੑ ྫ: Ұகੑ ˆ f (ਪఆྔ) p −→

f ∗ (ਅͷؔ਺) 3 ख๏ͷ࠷దੑ ऩଋ͢Δͱͯͦ͠ͷ଎͞͸࠷దʁ ڐ༰ੑ minimax ੑ ࠓ೔͸͜͜Β΁ΜΛத৺ʹ࿩͠·͢ɽ 9 / 60

16. ### ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . .

2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 11 / 60

20. ### ʢࠓճ͓࿩͢͠Δʣֶशཧ࿦ ≈ ܦݧաఔͷཧ࿦ sup f ∈F { 1 n n

∑ i=1 f (xi ) − E[f ] } ͷධՁ͕ॏཁɽ 13 / 60
21. ### ྺ࢙: ܦݧաఔͷཧ࿦ 1933 Glivenko, Cantelli Glivenko-Catelli ͷఆཧ (Ұ༷େ਺ͷ๏ଇ) 1933 Kolmogorov

Kolmogorov-Smirnov ݕఆ (ऩଋϨʔτɼ઴ۙ෼෍) 1952 Donsker Donsker ͷఆཧ (Ұ༷த৺ۃݶఆཧ) 1967 Dudley Dudley ੵ෼ 1968 Vapnik, Chervonenkis VC ࣍ݩ (Ұ༷ऩଋͷඞཁे෼৚݅) 1996a Talagrand Talagrand ͷෆ౳ࣜ 14 / 60
22. ### ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . .

2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 15 / 60
23. ### ໰୊ઃఆ ڭࢣ༗Γֶश ڭࢣσʔλ: Dn = {(x1, y1 ), . .

. , (xn, yn )} ∈ (X × Y)n ೖྗͱग़ྗͷ i.i.d. ܥྻ ϩεؔ਺: ℓ(·, ·) : Y × R → R+ ؒҧ͍΁ͷϖφϧςΟ Ծઆू߹ (Ϟσϧ): F X → R ͳΔؔ਺ͷू߹ . . . . . . . ˆ f : ਪఆྔ. αϯϓϧ (xi , yi )n i=1 ͔Βߏ੒͞ΕΔ F ͷݩ. ཈͍͑ͨྔ (൚Խޡࠩ): E(X,Y ) ςετσʔλ [ℓ(Y , ˆ f (X))] − inf f :Մଌؔ਺ E(X,Y ) [ℓ(Y , f (X))] ൚Խޡࠩ͸ऩଋ͢Δʁ ͦͷ଎͞͸? 16 / 60
24. ### Bias-Variance ͷ෼ղ ܦݧϦεΫ: ˆ L(f ) = 1 n ∑

n i=1 ℓ(yi , f (xi )), ظ଴ϦεΫ: L(f ) = E(X,Y ) [ℓ(Y , f (X))] ൚Խޡࠩ =L(ˆ f ) − inf f :Մଌؔ਺ L(f ) = L(ˆ f ) − inf f ∈F L(f ) ਪఆޡࠩ + inf f ∈F L(f ) − inf f :Մଌؔ਺ L(f ) Ϟσϧޡࠩ ؆୯ͷͨΊ f ∗ ∈ F ͕ଘࡏͯ͠ inff ∈F L(f ) = L(f ∗) ͱ͢Δɽ 17 / 60
25. ### Bias-Variance ͷ෼ղ ܦݧϦεΫ: ˆ L(f ) = 1 n ∑

n i=1 ℓ(yi , f (xi )), ظ଴ϦεΫ: L(f ) = E(X,Y ) [ℓ(Y , f (X))] ൚Խޡࠩ =L(ˆ f ) − inf f :Մଌؔ਺ L(f ) = L(ˆ f ) − inf f ∈F L(f ) ਪఆޡࠩ + inf f ∈F L(f ) − inf f :Մଌؔ਺ L(f ) Ϟσϧޡࠩ ؆୯ͷͨΊ f ∗ ∈ F ͕ଘࡏͯ͠ inff ∈F L(f ) = L(f ∗) ͱ͢Δɽ ˞Ϟσϧޡࠩʹ͍ͭͯ͸ࠓճ͸৮Εͳ͍ɽ ͔͠͠ɼϞσϦϯάͷ໰୊͸ ඇৗʹॏཁɽ Sieve ๏, Cross validation, ৘ใྔن४, Ϟσϧฏۉ, ... Χʔωϧ๏ʹ͓͚ΔϞσϧޡࠩͷऔΓѻ͍: interpolation space ͷཧ ࿦ (Steinwart et al., 2009, Eberts and Steinwart, 2012, Bennett and Sharpley, 1988). Ҏ߱ɼϞσϧޡࠩ͸े෼খ͍͞ͱ͢Δɽ 17 / 60
26. ### ܦݧޡࠩ࠷খԽ ܦݧޡࠩ࠷খԽ (ERM): ˆ f = arg min f ∈F

ˆ L(f ) ਖ਼ଇԽ෇͖ܦݧޡࠩ࠷খԽ (RERM): ˆ f = arg min f ∈F ˆ L(f ) + ψ(f ) ਖ਼ଇԽ߲ RERM ʹؔ͢Δݚڀ΋ඇৗʹ୔ࢁ͋Δ (Steinwart and Christmann, 2008, Mukherjee et al., 2002). ERM ͷԆ௕ઢ্ɽ 18 / 60
27. ### ܦݧޡࠩ࠷খԽ ܦݧޡࠩ࠷খԽ (ERM): ˑ ˆ f = arg min f

∈F ˆ L(f ) ਖ਼ଇԽ෇͖ܦݧޡࠩ࠷খԽ (RERM): ˆ f = arg min f ∈F ˆ L(f ) + ψ(f ) ਖ਼ଇԽ߲ RERM ʹؔ͢Δݚڀ΋ඇৗʹ୔ࢁ͋Δ (Steinwart and Christmann, 2008, Mukherjee et al., 2002). ERM ͷԆ௕ઢ্ɽ 18 / 60
28. ### ग़ൃ఺ ΄ͱΜͲͷό΢ϯυͷಋग़͸࣍ͷ͔ࣜΒ࢝·Δ: ˆ L(ˆ f ) ≤ ˆ L(f ∗)

(∵ ܦݧޡࠩ࠷খԽ) ⇒ L(ˆ f ) − L(f ∗) ≤ L(ˆ f ) − ˆ L(ˆ f ) + ˆ L(f ∗) − L(f ∗) Reminder: ˆ L(f ) = 1 n ∑ n i=1 ℓ(yi , f (xi )), L(f ) = E(X,Y ) [ℓ(Y , f (X))] 19 / 60
29. ### ग़ൃ఺ ΄ͱΜͲͷό΢ϯυͷಋग़͸࣍ͷ͔ࣜΒ࢝·Δ: ˆ L(ˆ f ) ≤ ˆ L(f ∗)

(∵ ܦݧޡࠩ࠷খԽ) ⇒ L(ˆ f ) − L(f ∗) ൚Խޡࠩ ≤ L(ˆ f ) − ˆ L(ˆ f ) ? + ˆ L(f ∗) − L(f ∗) Op(1/ √ n) (ޙड़) Reminder: ˆ L(f ) = 1 n ∑ n i=1 ℓ(yi , f (xi )), L(f ) = E(X,Y ) [ℓ(Y , f (X))] 19 / 60
30. ### ग़ൃ఺ ΄ͱΜͲͷό΢ϯυͷಋग़͸࣍ͷ͔ࣜΒ࢝·Δ: ˆ L(ˆ f ) ≤ ˆ L(f ∗)

(∵ ܦݧޡࠩ࠷খԽ) ⇒ L(ˆ f ) − L(f ∗) ൚Խޡࠩ ≤ L(ˆ f ) − ˆ L(ˆ f ) ? + ˆ L(f ∗) − L(f ∗) Op(1/ √ n) (ޙड़) ҆қͳղੳ L(ˆ f ) − ˆ L(ˆ f ) { → 0 (∵ େ਺ͷ๏ଇ!!) = Op (1/ √ n) (∵ த৺ۃݶఆཧ!!) ָউʂ ʂ ʂ Reminder: ˆ L(f ) = 1 n ∑ n i=1 ℓ(yi , f (xi )), L(f ) = E(X,Y ) [ℓ(Y , f (X))] 19 / 60
31. ### ग़ൃ఺ ΄ͱΜͲͷό΢ϯυͷಋग़͸࣍ͷ͔ࣜΒ࢝·Δ: ˆ L(ˆ f ) ≤ ˆ L(f ∗)

(∵ ܦݧޡࠩ࠷খԽ) ⇒ L(ˆ f ) − L(f ∗) ൚Խޡࠩ ≤ L(ˆ f ) − ˆ L(ˆ f ) ? + ˆ L(f ∗) − L(f ∗) Op(1/ √ n) (ޙड़) ҆қͳղੳ L(ˆ f ) − ˆ L(ˆ f ) { → 0 (∵ େ਺ͷ๏ଇ!!) = Op (1/ √ n) (∵ த৺ۃݶఆཧ!!) ָউʂ ʂ ʂ μϝͰ͢ ˆ f ͱڭࢣσʔλ͸ಠཱͰ͸ͳ͍ Reminder: ˆ L(f ) = 1 n ∑ n i=1 ℓ(yi , f (xi )), L(f ) = E(X,Y ) [ℓ(Y , f (X))] 19 / 60

33. ### ͳʹ͕໰୊͔ʁ f f* L(f) L(f) ^ f ^ “ͨ·ͨ·” ͏·͍͘͘΍͕͍ͭΔ

(աֶश) ͔΋͠Εͳ͍ɽ ࣮ࡍɼF ͕ෳࡶͳ৔߹ऩଋ͠ͳ͍ྫ͕ 20 / 60
34. ### ͳʹ͕໰୊͔ʁ f f* L(f) L(f) ^ f ^ Ұ༷ͳό΢ϯυ Ұ༷ͳό΢ϯυʹΑͬͯʮͨ·ͨ·͏·͍͘͘ʯ͕

(΄ͱΜͲ) ͳ͍͜ͱΛอূ ͦΕ͸ࣗ໌Ͱ͸ͳ͍ (ܦݧաఔͷཧ࿦) 20 / 60
35. ### Ұ༷ό΢ϯυ L(ˆ f ) − ˆ L(ˆ f ) ≤

sup f ∈F { L(f ) − ˆ L(f ) } ≤ (?) Ұ༷ʹ ϦεΫΛ཈͑Δ͜ͱ͕ॏཁ 21 / 60
36. ### ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . .

2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 22 / 60

38. ### ༗༻ͳෆ౳ࣜ Hoeﬀding ͷෆ౳ࣜ Zi (i = 1, . . .

, n): ಠཱͰ (ಉҰͱ͸ݶΒͳ͍) ظ଴஋ 0 ͷ֬཰ม਺ s.t. |Zi | ≤ mi P ( | ∑ n i=1 Zi | √ n > t ) ≤ 2 exp ( − t2 2 ∑ n i=1 m2 i /n ) Bernstein ͷෆ౳ࣜ Zi (i = 1, . . . , n): ಠཱͰ (ಉҰͱ͸ݶΒͳ͍) ظ଴஋ 0 ͷ֬཰ม਺ s.t. E[Z2 i ] = σ2 i , |Zi | ≤ M P ( | ∑ n i=1 Zi | √ n > t ) ≤ 2 exp ( − t2 2(1 n ∑ n i=1 σ2 i + 1 √ n Mt) ) ෼ࢄͷ৘ใΛར༻ 24 / 60
39. ### ༗༻ͳෆ౳ࣜ: ֦ு൛ Hoeﬀding ͷෆ౳ࣜ (sub-Gaussian tail) Zi (i = 1,

. . . , n): ಠཱͰ (ಉҰͱ͸ݶΒͳ͍) ظ଴஋ 0 ͷ֬཰ม਺ s.t. E[eτZi ] ≤ eσ2 i τ2/2 (∀τ > 0) P ( | ∑ n i=1 Zi | √ n > t ) ≤ 2 exp ( − t2 2 ∑ n i=1 σ2 i /n ) Bernstein ͷෆ౳ࣜ Zi (i = 1, . . . , n): ಠཱͰ (ಉҰͱ͸ݶΒͳ͍) ظ଴஋ 0 ͷ֬཰ม਺ s.t. E[Z2 i ] = σ2 i , E|Zi |k ≤ k! 2 σ2Mk−2 (∀k ≥ 2) P ( | ∑ n i=1 Zi | √ n > t ) ≤ 2 exp ( − t2 2(1 n ∑ n i=1 σ2 i + 1 √ n Mt) ) (ώϧϕϧτۭؒ൛΋͋Δ) 25 / 60
40. ### ༗ݶू߹ͷҰ༷ό΢ϯυ 1: Hoeﬀding ͷෆ౳ࣜ൛ ͜Ε͚ͩͰ΋஌͍ͬͯΔͱ༗༻ɽ(f ← ℓ(y, g(x)) − Eℓ(Y

, g(X)) ͱͯ͠ߟ͑Δ) F = {fm (m = 1, . . . , M)} ༗ݶݸͷؔ਺ू߹: ͲΕ΋ظ଴஋ 0 (E[fm (X)] = 0). Hoeﬀding ͷෆ౳ࣜ (Zi = fm (Xi ) Λ୅ೖ) P ( | ∑ n i=1 fm(Xi )| √ n > t ) ≤ 2 exp ( − t2 2∥fm∥2 ∞ ) . Ұ༷ό΢ϯυ . . . . . . . . • P ( max 1≤m≤M | ∑ n i=1 fm (Xi )| √ n > max m ∥fm∥∞ √ 2 log (2M/δ) ) ≤ δ • E [ max 1≤m≤M | ∑ n i=1 fm (Xi )| √ n ] ≤ C max m ∥fm∥∞ √ log(1 + M) (ಋग़) P ( max 1≤m≤M | ∑ n i=1 fm(Xi )| √ n > t ) = P   ∪ 1≤m≤M | ∑ n i=1 fm(Xi )| √ n > t   ≤ 2 M ∑ m=1 exp ( − t2 2∥fm∥2 ∞ ) 26 / 60
41. ### ༗ݶू߹ͷҰ༷ό΢ϯυ 2: Bernstein ͷෆ౳ࣜ൛ F = {fm (m = 1,

. . . , M)} ༗ݶݸͷؔ਺ू߹: ͲΕ΋ظ଴஋ 0 (E[fm (X)] = 0). Bernstein ͷෆ౳ࣜ P ( | ∑ n i=1 fm(Xi )| √ n > t ) ≤ 2 exp ( − t2 2(∥fm∥2 L2 + 1 √ n ∥fm∥∞t) ) . Ұ༷ό΢ϯυ . . . . . . . . E [ max 1≤m≤M | ∑ n i=1 fm (Xi )| √ n ] ≲ 1 √ n max m ∥fm∥∞ log(1 + M) + max m ∥fm∥L2 √ log(1 + M) ˞ Ұ༷ό΢ϯυ͸͍͍ͤͥ √ log(M) ΦʔμͰ૿͑Δɽ 27 / 60
42. ### ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . .

2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 28 / 60
43. ### ༗ݶ͔Βແݶ΁ Ծઆू߹ͷཁૉ͕ແݶݸ͋ͬͨΒʁ ࿈ଓೱ౓Λ΋͍ͬͯͨΒʁ F = {x⊤β | β ∈ Rd

, ∥β∥ ≤ 1} F = {f ∈ H | ∥f ∥H ≤ 1} 29 / 60

45. ### Rademacher ෳࡶ͞ ϵ1, ϵ2, . . . , ϵn :

Rademacher ม਺, i.e., P(ϵi = 1) = P(ϵi = −1) = 1 2 . Rademacher ෳࡶ͞ R(F) := E{ϵi },{xi } [ sup f ∈F 1 n n ∑ i=1 ϵi f (xi ) ] ରশԽ: (ظ଴஋) E [ sup f ∈F 1 n n ∑ i=1 (f (xi ) − E[f ]) ] ≤ 2R(F). ΋͠ ∥f ∥∞ ≤ 1 (∀f ∈ F) ͳΒ (੄֬཰) P ( sup f ∈F 1 n n ∑ i=1 (f (xi ) − E[f ]) ≥ 2R(F) + √ t 2n ) ≤ 1 − e−t. Rademacher ෳࡶ͞Λ཈͑Ε͹Ұ༷ό΢ϯυ͕ಘΒΕΔʂ 31 / 60
46. ### Rademacher ෳࡶ͞ͷ֤छੑ࣭ Contraction inequality: ΋͠ ψ ͕ Lipschitz ࿈ଓͳΒ, i.e.,

|ψ(f ) − ψ(f ′)| ≤ B|f − f ′|, R({ψ(f ) | f ∈ F}) ≤ BR(F). ತแ: conv(F) Λ F ͷݩͷತ݁߹શମ͔ΒͳΔू߹ͱ͢Δ. R(conv(F)) = R(F) 32 / 60
47. ### Rademacher ෳࡶ͞ͷ֤छੑ࣭ Contraction inequality: ΋͠ ψ ͕ Lipschitz ࿈ଓͳΒ, i.e.,

|ψ(f ) − ψ(f ′)| ≤ B|f − f ′|, R({ψ(f ) | f ∈ F}) ≤ BR(F). ತแ: conv(F) Λ F ͷݩͷತ݁߹શମ͔ΒͳΔू߹ͱ͢Δ. R(conv(F)) = R(F) ಛʹ࠷ॳͷੑ࣭͕༗Γ೉͍ɽ . . . . . . . |ℓ(y, f ) − ℓ(y, f ′)| ≤ |f − f ′| ͳΒɼ E [ sup f ∈F |ˆ L(f ) − L(f )| ] ≤ 2R(ℓ(F)) ≤ 2R(F), ͨͩ͠ɼℓ(F) = {ℓ(·, f (·)) | f ∈ F}. Αͬͯ F ͷ Rademacher complexity Λ཈͑Ε͹े෼ʂ Lipschitz ࿈ଓੑ͸ώϯδϩε, ϩδεςΟ οΫϩεͳͲͰ੒Γཱͭɽ͞Βʹ y ͱ F ͕༗քͳΒೋ৐ϩεͳͲͰ΋੒Γཱͭɽ Reminder: ˆ L(f ) = 1 n ∑ n i=1 ℓ(yi , f (xi )), L(f ) = E(X,Y ) [ℓ(Y , f (X))] 32 / 60
48. ### ΧόϦϯάφϯόʔ Rademacher complexity Λ཈͑Δํ๏ɽ ΧόϦϯάφϯόʔ: Ծઆू߹ F ͷෳࡶ͞ɾ༰ྔɽ . ϵ-ΧόϦϯάφϯόʔ

. . . . . . . . N(F, ϵ, d) ϊϧϜ d Ͱఆ·Δ൒ܘ ϵ ͷϘʔϧͰ F Λ෴͏ͨΊ ʹඞཁͳ࠷খͷϘʔϧͷ਺ɽ F ༗ݶݸͷݩͰ F Λۙࣅ͢Δͷʹ࠷௿ݶඞཁͳݸ਺ɽ . Theorem (Dudley ੵ෼) . . . . . . . . ∥f ∥2 n := 1 n ∑ n i=1 f (xi )2 ͱ͢Δͱɼ R(F) ≤ C √ n EDn [∫ ∞ 0 √ log(N(F, ϵ, ∥ · ∥n ))dϵ ] . 33 / 60
49. ### Dudley ੵ෼ͷΠϝʔδ R(F) ≤ C √ n EDn [∫ ∞

0 √ log(N(F, ϵ, ∥ · ∥n ))dϵ ] . ༗ݶݸͷݩͰ F Λۙࣅ͢Δɽ ͦͷղ૾౓Λࡉ͔͍ͯͬͯ͘͠ɼࣅ ͍ͯΔݩΛ·ͱΊ্͛ͯΏ͘Πϝʔ δɽ νΣΠχϯά ͱ͍͏ɽ 34 / 60
50. ### ͜Ε·Ͱͷ·ͱΊ ˆ L(ˆ f ) ≤ ˆ L(f ∗) (∵

ܦݧޡࠩ࠷খԽ) ⇒ L(ˆ f ) − L(f ∗) ≤ L(ˆ f ) − ˆ L(ˆ f ) ͜ΕΛ཈͍͑ͨ + ˆ L(f ∗) − L(f ∗) Op(1/ √ n) (Hoeﬀding) ℓ ͕ 1-Lipschitz (|ℓ(y, f ) − ℓ(y, f ′)| ≤ |f − f ′|) ͔ͭ ∥f ∥∞ ≤ 1 (∀f ∈ F) ͷͱ͖, L(ˆ f ) − ˆ L(ˆ f ) ≤ sup f ∈F (L(f ) − ˆ L(f )) ≤ R(ℓ(F)) + √ t n (with prob. 1 − e−t) ≤ R(F) + √ t n (contraction ineq., Lipschitz ࿈ଓ) ≤ 1 √ n EDn [∫ ∞ 0 √ log N(F, ϵ, ∥ · ∥n )dϵ ] + √ t n (Dudley ੵ෼). ˞ΧόϦϯάφϯόʔ͕খ͍͞΄ͲϦεΫ͸খ͍͞ˠ Occam’s Razor 35 / 60
51. ### ྫ: ઢܗ൑ผؔ਺ F = {f (x) = sign(x⊤β + c)

| β ∈ Rd , c ∈ R} N(F, ϵ, ∥ · ∥n ) ≤ C(d + 2) (c ϵ )2(d+1) ͢Δͱɼ0-1 ϩε ℓ ʹର͠ L(ˆ f ) − ˆ L(ˆ f ) ≤ Op ( 1 √ n EDn [∫ 1 0 √ log N(F, ϵ, ∥ · ∥n )dϵ ]) ≤ Op ( 1 √ n ∫ 1 0 C √ d log(1/ϵ) + log(d)dϵ ) ≤ Op (√ d n ) . 36 / 60
52. ### ྫ: VC ࣍ݩ F ͸ࢦࣔؔ਺ͷू߹: F = {1C | C

∈ C}. C ͸͋Δू߹଒ (ྫ: ൒ۭؒͷू߹) ࡉ෼: F ͕͋Δ༩͑ΒΕͨ༗ݶू߹ Xn = {x1, . . . , xn} Λࡉ෼͢Δ ⇔ ೚ҙͷϥϕϧ Yn = {y1, . . . , yn} (yi ∈ {±1}) ʹରͯ͠ Xn Λ F ͕ਖ਼͘͠ ൑ผͰ͖Δɽ VC ࣍ݩ VF : F ͕ࡉ෼Ͱ͖Δू߹͕ଘࡏ͠ͳ͍ n ͷ࠷খ஋. N(F, ϵ, ∥ · ∥n ) ≤ KVF (4e)VF ( 1 ϵ )2(VF −1) ⇒ ൚Խޡࠩ = Op ( √ VF /n) http://www.tcs.fudan.edu.cn/rudolf/Courses/Algorithms/Alg_ss_07w/Webprojects/Qinbo_diameter/e_net.htm ͔Βഈआ VC ࣍ݩ༗ݶ͕Ұ༷ऩଋͷඞཁे෼৚݅ (ҰൠԽ Glivenko-Cantelli ఆཧͷඞཁे෼৚݅) 37 / 60
53. ### ྫ: Χʔωϧ๏ F = {f ∈ H | ∥f ∥H

≤ 1} Χʔωϧؔ਺ k ࠶ੜ֩ώϧϕϧτۭؒ H k(x, x) ≤ 1 (∀x ∈ X) ΛԾఆ, e.g., Ψ΢εΧʔ ωϧ. ௚઀ Rademacher ෳࡶ͞ΛධՁͯ͠ΈΔɽ ∑ n i=1 ϵi f (xi ) = ⟨ ∑ n i=1 ϵi k(xi , ·), f ⟩H ≤ ∥ ∑ n i=1 ϵi k(xi , ·)∥H ∥f ∥H ≤ ∥ ∑ n i=1 ϵi k(xi , ·)∥H Λ࢖͏ɽ R(F) = E [ sup f ∈F | ∑ n i=1 ϵi f (xi )| n ] ≤ E [ ∥ ∑ n i=1 ϵi k(xi , ·)∥H n ] = E   √∑ n i,j=1 ϵi ϵj k(xi , xj ) n   ≤ √ E [∑ n i,j=1 ϵi ϵj k(xi , xj ) ] n (Jensen) = √∑ n i=1 k(xi , xi ) n ≤ 1 √ n 38 / 60
54. ### ྫ: ϥϯμϜߦྻͷ࡞༻ૉϊϧϜ A = (aij ): p × q ߦྻͰ֤

aij ͸ಠཱͳظ଴஋ 0 ͔ͭ |aij | ≤ 1 ͳΔ֬཰ม਺ɽ A ͷ࡞༻ૉϊϧϜ ∥A∥ := max ∥z∥≤1 z∈Rq ∥Az∥ = max ∥w∥≤1,∥z∥≤1 w∈Rp,z∈Rq w⊤Az. F = {fw,z (aij , (i, j)) = aij wi zj | w ∈ Rp, z ∈ Rq} ⇒ ∥A∥ = sup f ∈F ∑ i,j f (aij , (i, j)) n = pq ݸͷαϯϓϧ͕͋ΔͱΈͳ͢ɽ ∥fw,z − fw′,z′ ∥2 n = 1 pq ∑ p,q i,j=1 |aij (wi zj − w′ i z′ j )|2 ≤ 2 pq (∥w − w′∥2 + ∥z − z′∥2) ∴ N(F, ϵ, ∥ · ∥n ) { ≤ C( √ pqϵ)−(p+q), (ϵ ≤ 2/ √ pq), = 1, (otherwise). . . . . . . . E [ 1 pq sup w,z w⊤Az ] ≤ C √ pq ∫ 1 √ pq 0 √ (p + q) log(C/ √ pqϵ)dϵ ≤ √ p + q pq ΑͬͯɼA ͷ࡞༻ૉϊϧϜ͸ Op ( √ p + q). ˠ ௿ϥϯΫߦྻਪఆ, Robust PCA, ... ৄ͘͠͸ Tao (2012), Davidson and Szarek (2001) Λࢀরɽ 39 / 60
55. ### ྫ: Lasso ͷऩଋϨʔτ σβΠϯߦྻ X = (Xij ) ∈ Rn×p.

p (࣍ݩ) ≫ n (αϯϓϧ਺). ਅͷϕΫτϧ β∗ ∈ Rp: ඇθϩཁૉͷݸ਺͕͔͔ͨͩ d ݸ (εύʔε). Ϟσϧ : Y = Xβ∗ + ξ. ˆ β ← arg min β∈Rp 1 n ∥Xβ − Y ∥2 2 + λn∥β∥1. . Theorem (Lasso ͷऩଋϨʔτ (Bickel et al., 2009, Zhang, 2009)) . . . . . . . . σβΠϯߦྻ͕ Restricted eigenvalue condition (Bickel et al., 2009) ͔ͭ maxi,j |Xij | ≤ 1 Λຬͨ͠ɼϊΠζ͕ E[eτξi ] ≤ eσ2τ2/2 (∀τ > 0) Λຬͨ͢ͳΒ, ֬ ཰ 1 − δ Ͱ ∥ˆ β − β∗∥2 2 ≤ C d log(p/δ) n . ˞࣍ݩ͕ߴͯ͘΋ɼ͔͔ͨͩ log(p) Ͱ͔͠ޮ͍ͯ͜ͳ͍ɽ࣮࣭తͳ࣍ݩ d ͕ࢧ ഑తɽ 40 / 60
56. ### log(p) ͸Ͳ͔͜Β΍͖͔ͬͯͨʁ ༗ݶݸͷҰ༷ό΢ϯυ͔Β΍͖ͬͯͨɽ 1 n ∥X ˆ β − Y

∥2 2 + λn∥ˆ β∥1 ≤ 1 n ∥Xβ∗ − Y ∥2 2 + λn∥β∗∥1 ⇒ 1 n ∥X(ˆ β − β∗)∥2 2 + λn∥ˆ β∥1 ≤ 2 n ∥X⊤ξ∥∞ ͜Ε ∥ˆ β − β∗∥1 + λn∥β∗∥1 1 n ∥X⊤ξ∥∞ = max 1≤j≤p | 1 n n ∑ i=1 Xij ξi | 41 / 60
57. ### log(p) ͸Ͳ͔͜Β΍͖͔ͬͯͨʁ ༗ݶݸͷҰ༷ό΢ϯυ͔Β΍͖ͬͯͨɽ 1 n ∥X ˆ β − Y

∥2 2 + λn∥ˆ β∥1 ≤ 1 n ∥Xβ∗ − Y ∥2 2 + λn∥β∗∥1 ⇒ 1 n ∥X(ˆ β − β∗)∥2 2 + λn∥ˆ β∥1 ≤ 2 n ∥X⊤ξ∥∞ ͜Ε ∥ˆ β − β∗∥1 + λn∥β∗∥1 1 n ∥X⊤ξ∥∞ = max 1≤j≤p | 1 n n ∑ i=1 Xij ξi | Hoeﬀding ͷෆ౳ࣜ༝དྷͷҰ༷ό΢ϯυʹΑΓ, ֬཰ 1 − δ Ͱ max 1≤j≤p | 1 n n ∑ i=1 Xij ξi | ≤ σ √ 2 log(2p/δ) n . 41 / 60
58. ### Talagrand ͷ concentration inequality ൚༻ੑͷߴ͍ෆ౳ࣜɽ . Theorem (Talagrand (1996b), Massart

(2000), Bousquet (2002)) . . . . . . . . σ2 := supf ∈F E[f (X)2], Pn f := 1 n ∑ n i=1 f (xi ), Pf := E[f (X)] ͱ͢Δ. P [ sup f ∈F (Pn f − Pf ) ≥ C ( E [ sup f ∈F (Pn f − Pf ) ] + √ t n σ + t n )] ≤ e−t Fast learning rate Λࣔ͢ͷʹ༗༻ɽ 42 / 60
59. ### ͦͷଞͷτϐοΫ Johnson-Lindenstrauss ͷิ୊ (Johnson and Lindenstrauss, 1984, Dasgupta and Gupta,

1999) n ݸͷ఺ {x1, . . . , xn} ∈ Rd Λ k ࣍ݩۭؒ΁ࣹӨ͢Δ. k ≥ cδ log(n) ͳΒ, k ࣍ݩ΁ͷϥϯμϜϓϩδΣΫγϣϯ A ∈ Rk×d (ϥϯμϜߦྻ) ͸ (1 − δ)∥xi − xj ∥ ≤ ∥Axi − Axj ∥ ≤ (1 + δ)∥xi − xj ∥ Λߴ͍֬཰Ͱຬͨ͢ɽ ˠ restricted isometory (Baraniuk et al., 2008, Cand` es, 2008) Gaussian concentration inequality, concentration inequality on product space (Ledoux, 2001) sup f ∈F 1 n n ∑ i=1 ξi f (xi ) (ξi : Ψ΢ε෼෍ͳͲ) Majorizing measure: Ψ΢γΞϯϓϩηεʹ·ͭΘΔ্ք, Լք (Talagrand, 2000). 43 / 60
60. ### ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . .

2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 44 / 60

62. ### ϩεؔ਺ͷڧತੑΛੵۃతʹར༻ f f* L(f) f ^ Ұ༷ͳό΢ϯυ ϩεͷڧತੑΛ࢖͏ͱ ˆ f

ͷଘࡏൣғ੍͕ݶ͞ΕΔˠΑΓ͖͍ͭό΢ϯυ 46 / 60
63. ### ϩεؔ਺ͷڧತੑΛੵۃతʹར༻ f L(f) f ^ Ұ༷ͳό΢ϯυ ಉ͡࿦ཧΛԿ౓΋ద༻ͤ͞Δ͜ͱʹΑͬͯ ˆ f ͷϦεΫ͕খ͍͜͞ͱΛࣔ͢ɽ

ˆ f ͕ f ∗ ʹ͍ۙ͜ͱΛར༻ˠ “ہॴ”Rademacher ෳࡶ͞ 46 / 60
64. ### ہॴ Rademacher ෳࡶ͞ . . . . . . .

ہॴ Rademacher ෳࡶ͞: Rδ (F) := R({f ∈ F | E[(f − f ∗)2] ≤ δ}). ࣍ͷ৚݅ΛԾఆͯ͠ΈΔ. F ͸ 1 Ͱ্͔Β཈͑ΒΕ͍ͯΔ: ∥f ∥∞ ≤ 1 (∀f ∈ F). ℓ ͸ Lipschitz ࿈ଓ͔ͭ ڧತ: E[ℓ(Y , f (X))] − E[ℓ(Y , f ∗(X))] ≥ BE[(f − f ∗)2] (∀f ∈ F). . Theorem (Fast learning rate (Bartlett et al., 2005)) . . . . . . . . δ∗ = inf{δ | δ ≥ Rδ (F)} ͱ͢Δͱɼ֬཰ 1 − e−t Ͱ L(ˆ f ) − L(f ∗) ≤ C ( δ∗ + t n ) . δ∗ ≤ R(F) ͸ৗʹ੒Γཱͭ (ӈਤࢀর). ͜ΕΛ Fast learning rate ͱݴ͏ɽ R± (F) ± ± ±* 47 / 60
65. ### Fast learning rate ͷྫ log N(F, ϵ, ∥ · ∥n

) ≤ Cϵ−2ρ ͷͱ͖ɼ Rδ (F) ≤ C ( δ 1−ρ 2 √ n ∨ n− 1 1+ρ ) , ͕ࣔ͞Εɼδ∗ ͷఆ͔ٛΒ֬཰ 1 − e−t Ͱ͕࣍੒Γཱͭ: L(ˆ f ) − L(f ∗) ≤ C ( n− 1 1+ρ + t n ) . ˞ 1/ √ n ΑΓλΠτʂ ࢀߟจݙ ہॴ Rademacher ෳࡶ͞ͷҰൠ࿦: Bartlett et al. (2005), Koltchinskii (2006) ൑ผ໰୊, Tsybakov ͷ৚݅: Tsybakov (2004), Bartlett et al. (2006) Χʔωϧ๏ʹ͓͚Δ fast learning rate: Steinwart and Christmann (2008) Peeling device: van de Geer (2000) 48 / 60
66. ### ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . .

2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 49 / 60

50 / 60
68. ### ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . .

2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 51 / 60
69. ### ڐ༰ੑ ෼෍ͷϞσϧ: {Pθ |θ ∈ Θ} Pθ ʹ͓͚Δਪఆྔ ˇ f

ͷϦεΫͷظ଴஋: ¯ Lθ (ˇ f ) := EDn∼Pθ [E(X,Y )∼Pθ [ℓ(Y , ˇ f (X))]] . Deﬁnition (ڐ༰ੑ) . . . . . . . . ˆ f ͕ڐ༰త (admissible) ⇔ ¯ Lθ (ˇ f ) ≤ ¯ Lθ (ˆ f ) (∀θ ∈ Θ) ͔ͭ, ͋Δ θ′ ∈ Θ Ͱ ¯ Lθ′ (ˇ f ) < ¯ Lθ′ (ˆ f ) ͳΔਪఆྔ ˇ f ͕ ଘࡏ͠ͳ͍ɽ θ ¹ Lµ (· f ) ¹ Lµ (^ f ) θ ¹ Lµ (^ f ) 52 / 60
70. ### ྫ ؆୯ͷͨΊαϯϓϧ Dn = {(x1, . . . , xn

)} ∼ Pn θ ͔Β Pθ (θ ∈ Θ) Λਪఆ͢Δ໰୊ Λߟ͑Δɽ Ұ఺ౌ͚: ͋Δ θ0 Λৗʹ༻͍Δɽͦͷ θ0 ʹର͢Δ౰ͯ͸·Γ͸࠷ྑ͕ͩଞ ͷ θ ʹ͸ѱ͍ɽ ϕΠζਪఆྔ: ࣄલ෼෍ π(θ), ϦεΫ L(θ0, ˆ P) ˆ P = arg min ˆ P:ਪఆྔ ∫ EDn∼Pθ0 [L(θ0, ˆ P)]π(θ0 )dθ0. ೋ৐ϦεΫ L(θ, ˆ θ) = ∥θ − ˆ θ∥2: ˆ θ = ∫ θπ(θ|Dn )dθ (ࣄޙฏۉ) KL-ϦεΫ L(θ, ˆ P) = KL(Pθ ||ˆ P): ˆ P = ∫ P(·|θ)π(θ|Dn )dθ (ϕΠζ༧ଌ෼෍) ϕΠζਪఆྔͷఆٛΑΓɼϦεΫ L(θ, ˆ P) Λৗʹվળ͢Δਪఆྔ͸ଘࡏ͠ ͳ͍ɽ 53 / 60
71. ### ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . .

2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 54 / 60
72. ### minimax ࠷దੑ . Deﬁnition (minimax ࠷దੑ) . . . .

. . . . ˆ f ͕ minimax ࠷ద ⇔ max θ∈Θ ¯ Lθ (ˆ f ) = min ˇ f :ਪఆྔ max θ∈Θ ¯ Lθ (ˇ f )ɽ ֶशཧ࿦Ͱ͸ఆ਺ഒΛڐ͢͜ͱ͕ଟ͍: ∃C Ͱ max θ∈Θ ¯ Lθ (ˆ f ) ≤ C min ˇ f :ਪఆྔ max θ∈Θ ¯ Lθ (ˇ f ) (∀n). ͦ͏͍͏ҙຯͰʮminimax ϨʔτΛୡ੒͢ΔʯͱݴͬͨΓ͢Δɽ θ ¹ Lµ (^ f ) 55 / 60
73. ### minimax ϨʔτΛٻΊΔํ๏ Introduction to nonparametric estimation (Tsybakov, 2008) ʹৄ͍͠هड़. F

Λ༗ݶݸͷݩͰ୅දͤ͞ɼͦͷ͏ͪҰͭ࠷ྑͳ΋ͷΛબͿ໰୊Λߟ͑Δɽ (΋ͱͷ໰୊ΑΓ؆୯ˠϦεΫͷԼݶΛ༩͑Δ) {f1, . . . , fMn } ⊆ F F fj εn ݸ਺ Mn ͱޡࠩ εn ͷτϨʔυΦϑ: Mn ͕খ͍͞ํ͕࠷దͳݩΛબͿͷ͕؆୯ʹ ͳΔ͕ޡࠩ εn ͕େ͖͘ͳΔ. cf. Fano ͷෆ౳ࣜ, Assouad ͷิ୊. 56 / 60
74. ### εύʔεਪఆͷ minimax Ϩʔτ . Theorem (Raskutti and Wainwright (2011)) .

. . . . . . . ͋Δ৚݅ͷ΋ͱɼ֬཰ 1/2 Ҏ্Ͱɼ min ˆ β:ਪఆྔ max β∗:d-εύʔε ∥ˆ β − β∗∥2 ≥ C d log(p/d) n . Lasso ͸ minimax ϨʔτΛୡ੒͢Δ (d log(d) n ͷ߲Λআ͍ͯ)ɽ ͜ͷ݁ՌΛ Multiple Kernel Learning ʹ֦ுͨ݁͠Ռ: Raskutti et al. (2012), Suzuki and Sugiyama (2012). 57 / 60
75. ### ߏ੒ . . . 1 ͸͡Ίʹ: ཧ࿦ͷ໾ׂ . . .

2 ౷ܭతֶशཧ࿦ͱܦݧաఔ . . . 3 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼ ہॴ Rademacher ෳࡶ͞ . . . 4 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ . . . 5 ϕΠζͷֶशཧ࿦ 58 / 60
76. ### ϕΠζͷֶशཧ࿦ ϊϯύϥϕΠζͷ౷ܭతੑ࣭ ڭՊॻ: Ghosh and Ramamoorthi (2003), Bayesian Nonparametrics. Springer,

2003. ऩଋϨʔτ Ұൠ࿦: Ghosal et al. (2000) Dirichlet mixture: Ghosal and van der Vaart (2007) Gaussian process: van der Vaart and van Zanten (2008a,b, 2011). 59 / 60
77. ### ϕΠζͷֶशཧ࿦ ϊϯύϥϕΠζͷ౷ܭతੑ࣭ ڭՊॻ: Ghosh and Ramamoorthi (2003), Bayesian Nonparametrics. Springer,

2003. ऩଋϨʔτ Ұൠ࿦: Ghosal et al. (2000) Dirichlet mixture: Ghosal and van der Vaart (2007) Gaussian process: van der Vaart and van Zanten (2008a,b, 2011). PAC-Bayes L(ˆ fπ ) ≤ infρ {∫ L(f )ρ(df ) + 2 [ λC2 n + KL(ρ||π)+log 2 ϵ λ ]} (Catoni, 2007) ݩ࿦จ: McAllester (1998, 1999) ΦϥΫϧෆ౳ࣜ: Catoni (2004, 2007) εύʔεਪఆ΁ͷԠ༻: Dalalyan and Tsybakov (2008), Alquier and Lounici (2011), Suzuki (2012) 59 / 60
78. ### ·ͱΊ Ұ༷ό΢ϯυ͕ॏཁ sup f ∈F { 1 n n ∑

i=1 ℓ(yi , f (xi )) − E[ℓ(Y , f (X))] } Rademacher ෳࡶ͞ ΧόϦϯάφϯόʔ Ծઆू߹͕୯७Ͱ͋Ε͹͋Δ΄Ͳɼ଎͍ऩଋɽ ࠷దੑن४ ڐ༰ੑ minimax ࠷దੑ f f* L(f) L(f) ^ f ^ Ұ༷ͳό΢ϯυ 60 / 60
79. ### H. Akaike. A new look at the statistical model identiﬁcation.

IEEE Transactions on Automatic Control, 19(6):716–723, 1974. P. Alquier and K. Lounici. PAC-Bayesian bounds for sparse regression estimation with exponential weights. Electronic Journal of Statistics, 5:127–145, 2011. R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry property for random matrices. Constructive Approximation, 28(3):253–263, 2008. P. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. The Annals of Statistics, 33:1487–1537, 2005. P. Bartlett, M. Jordan, and D. McAuliﬀe. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101:138–156, 2006. P. L. Bartlett and A. Tewari. Sparseness vs estimating conditional probabilities: Some asymptotic results. Journal of Machine Learning Research, 8:775–790, 2007. C. Bennett and R. Sharpley. Interpolation of Operators. Academic Press, Boston, 1988. P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009. O. Bousquet. A Bennett concentration inequality and its application to suprema of empirical process. C. R. Acad. Sci. Paris Ser. I Math., 334:495–500, 2002. 60 / 60
80. ### E. Cand` es. The restricted isometry property and its implications

for compressed sensing. Compte Rendus de l’Academie des Sciences, Paris, Serie I, 346: 589–592, 2008. F. P. Cantelli. Sulla determinazione empirica della leggi di probabilit` a. G. Inst. Ital. Attuari, 4:221–424, 1933. O. Catoni. Statistical Learning Theory and Stochastic Optimization. Lecture Notes in Mathematics. Springer, 2004. Saint-Flour Summer School on Probability Theory 2001. O. Catoni. PAC-Bayesian Supervised Classiﬁcation (The Thermodynamics of Statistical Learning). Lecture Notes in Mathematics. IMS, 2007. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3): 273–297, 1995. A. Dalalyan and A. B. Tsybakov. Aggregation by exponential weighting sharp PAC-Bayesian bounds and sparsity. Machine Learning, 72:39–61, 2008. S. Dasgupta and A. Gupta. An elementary proof of the johnson-lindenstrauss lemma. Technical Report 99–006, U.C. Berkeley, 1999. K. R. Davidson and S. J. Szarek. Local operator theory, random matrices and Banach spaces, volume 1, chapter 8, pages 317–366. North Holland, 2001. M. Donsker. Justiﬁcation and extension of doob’s heuristic approach to the kolmogorov-smirnov theorems. Annals of Mathematical Statistics, 23:277–281, 1952. 60 / 60
81. ### R. M. Dudley. The sizes of compact subsets of hilbert

space and continuity of gaussian processes. J. Functional Analysis, 1:290–330, 1967. M. Eberts and I. Steinwart. Optimal learning rates for least squares svms using gaussian kernels. In Advances in Neural Information Processing Systems 25, 2012. T. S. Ferguson. A bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209–230, 1973. Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT ’95, pages 23–37, 1995. S. Ghosal and A. W. van der Vaart. Posterior convergence rates of dirichlet mixtures at smooth densities. The Annals of Statistics, 35(2):697–723, 2007. S. Ghosal, J. K. Ghosh, and A. W. van der Vaart. Convergence rates of posterior distributions. The Annals of Statistics, 28(2):500–531, 2000. J. Ghosh and R. Ramamoorthi. Bayesian Nonparametrics. Springer, 2003. V. I. Glivenko. Sulla determinazione empirica di probabilit` a. G. Inst. Ital. Attuari, 4:92–99, 1933. W. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. In Conference in Modern Analysis and Probability, volume 26, pages 186–206, 1984. A. Kolmogorov. Sulla determinazione empirica di una legge di distribuzione. G. Inst. Ital. Attuari, 4:83–91, 1933. 60 / 60
82. ### V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk

minimization. The Annals of Statistics, 34:2593–2656, 2006. M. Ledoux. The concentration of measure phenomenon. American Mathematical Society, 2001. P. Massart. About the constants in talagrand’s concentration inequalities for empirical processes. The Annals of Probability, 28(2):863–884, 2000. D. McAllester. Some PAC-Bayesian theorems. In the Anual Conference on Computational Learning Theory, pages 230–234, 1998. D. McAllester. PAC-Bayesian model averaging. In the Anual Conference on Computational Learning Theory, pages 164–170, 1999. S. Mukherjee, R. Rifkin, and T. Poggio. Regression and classiﬁcation with regularization. In D. D. Denison, M. H. Hansen, C. C. Holmes, B. Mallick, and B. Yu, editors, Lecture Notes in Statistics: Nonlinear Estimation and Classiﬁcation, pages 107–124. Springer-Verlag, New York, 2002. G. Raskutti and M. J. Wainwright. Minimax rates of estimation for high-dimensional linear regression over ℓq -balls. IEEE Transactions on Information Theory, 57(10):6976–6994, 2011. G. Raskutti, M. Wainwright, and B. Yu. Minimax-optimal rates for sparse additive models over kernel classes via convex programming. Journal of Machine Learning Research, 13:389–427, 2012. 60 / 60
83. ### I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.

I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In Proceedings of the Annual Conference on Learning Theory, pages 79–93, 2009. T. Suzuki. Pac-bayesian bound for gaussian process regression and multiple kernel additive model. In JMLR Workshop and Conference Proceedings, volume 23, pages 8.1–8.20, 2012. Conference on Learning Theory (COLT2012). T. Suzuki and M. Sugiyama. Fast learning rate of multiple kernel learning: Trade-oﬀ between sparsity and smoothness. In JMLR Workshop and Conference Proceedings 22, pages 1152–1183, 2012. Fifteenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS2012). M. Talagrand. New concentration inequalities in product spaces. Invent. Math., 126:505–563, 1996a. M. Talagrand. New concentration inequalities in product spaces. Inventiones Mathematicae, 126:505–563, 1996b. M. Talagrand. The generic chaining. Springer, 2000. T. Tao. Topics in random matrix theory. American Mathematical Society, 2012. R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58(1):267–288, 1996. A. Tsybakov. Optimal aggregation of classiﬁers in statistical learning. Annals of Statistics, 35:135–166, 2004. 60 / 60
84. ### A. B. Tsybakov. Introduction to nonparametric estimation. Springer Series in

Statistics. Springer, 2008. S. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000. A. W. van der Vaart and J. H. van Zanten. Rates of contraction of posterior distributions based on Gaussian process priors. The Annals of Statistics, 36(3): 1435–1463, 2008a. A. W. van der Vaart and J. H. van Zanten. Reproducing kernel Hilbert spaces of Gaussian priors. Pushing the Limits of Contemporary Statistics: Contributions in Honor of Jayanta K. Ghosh, 3:200–222, 2008b. IMS Collections. A. W. van der Vaart and J. H. van Zanten. Information rates of nonparametric gaussian process methods. Journal of Machine Learning Research, 12: 2095–2119, 2011. V. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Soviet Math. Dokl., 9:915–918, 1968. V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. T. Zhang. Some sharp performance bounds for least squares regression with l1 regularization. The Annals of Statistics, 37(5):2109–2144, 2009. 60 / 60