# 統計的学習理論チュートリアル: 基礎から応用まで (IBIS2012)

౷ܭతֶशཧ࿦ͱܦݧաఔ

Ұ༷ό΢ϯυ
جຊతͳෆ౳ࣜ
Rademacher ෳࡶ͞ͱ Dudley ੵ෼
ہॴ Rademacher ෳࡶ͞

࠷దੑ
ڐ༰ੑ
minimax ࠷దੑ

ϕΠζͷֶशཧ࿦
౷ܭతֶशཧ࿦ͱܦݧաఔ

Ұ༷ό΢ϯυ
جຊతͳෆ౳ࣜ
Rademacher ෳࡶ͞ͱ Dudley ੵ෼
ہॴ Rademacher ෳࡶ͞

࠷దੑ
ڐ༰ੑ
minimax ࠷దੑ

ϕΠζͷֶशཧ࿦

8. ### جૅݚڀͱԠ༻ͷؔ܎ ೥୅ Ԡ༻ جૅ ೞ఺ܖ፼ ૼẲẟؕஜ২ᘐ ʢཧ࿦ʣ ίϛϡχέʔγϣϯ ݴޠɿ਺ֶ ཧ࿦

࣌ʹҟͳΔϨϕϧؒͷίϛϡχέʔγϣϯ͕৽͍͠ൃݟΛಋ͘ 5 / 60
9. ### ྺ࢙ʹΈΔ੒ޭྫ SVM (Vapnik, 1998, Cortes and Vapnik, 1995): VC ࣍ݩ

AdaBoost (Freund and Schapire, 1995): ऑֶशػʹΑΔֶशՄೳੑ Dirichlet process (Ferguson, 1973): ֬཰࿦ɼଌ౓࿦ 6 / 60
AdaBoost (Freund and Schapire, 1995): ऑֶशػʹΑΔֶशՄೳੑ Dirichlet process (Ferguson, 1973): ֬཰࿦ɼଌ౓࿦ Lasso (Tibshirani, 1996) AIC (Akaike, 1974) ѹॖηϯγϯά (Cand` es, Tao and Donoho, 2004) ͳͲͳͲ େࣄͳͷ͸ຊ࣭ͷཧղ ˠ৽͍͠ख๏ (ͦͷͨΊͷຊηογϣϯʂ) 6 / 60
11. ### ΑΓ௚઀తͳޮೳ ֶशཧ࿦Λ஌Δ͜ͱͷΑΓ௚઀తͳ༗༻ੑ . . . 1 ख๏ͷҙຯ: ͦ΋ͦ΋ԿΛ΍͍ͬͯΔख๏ͳͷ͔ˠਖ਼͍͠࢖͍ํ . .

. 2 ख๏ͷਖ਼౰ੑ: ͪΌΜͱͨ͠ղ͕ಘΒΕΔ͔ʢ ʮຊ౰ʹऩଋ͢Δͷ͔ʯ ʣ . . . 3 ख๏ͷ࠷దੑ: ͋Δई౓ʹؔͯ͠࠷దͳख๏͔ˠ҆৺ͯ͠࢖͑Δ 7 / 60
12. ### ख๏ͷҙຯ: ྫ. ϩεؔ਺ͷબ୒ ೋ஋൑ผ: ώϯδϩεͱϩδεςΟ οΫϩεɼͲͪΒΛ࢖͏΂͖ʁ min f 1 n

n ∑ i=1 ϕ(−yi f (xi )) (yi ∈ {±1}) . . . 1 ྆ऀͱ΋൑ผޡࠩΛ࠷খԽ ϕ ͕ತͷ࣌ ʮϕ ͸൑ผҰகੑΛ΋ͭ ⇔ ϕ ͕ݪ఺Ͱඍ෼Մೳ͔ͭ ϕ′(0) > 0ʯ (Bartlett et al., 2006) ൑ผҰகੑ: ظ଴ϦεΫ࠷খԽؔ਺ (arg minf E[ϕ(−Yf (X))]) ͕ Bayes ࠷దɽ 8 / 60
n ∑ i=1 ϕ(−yi f (xi )) (yi ∈ {±1}) . . . 1 ྆ऀͱ΋൑ผޡࠩΛ࠷খԽ ϕ ͕ತͷ࣌ ʮϕ ͸൑ผҰகੑΛ΋ͭ ⇔ ϕ ͕ݪ఺Ͱඍ෼Մೳ͔ͭ ϕ′(0) > 0ʯ (Bartlett et al., 2006) ൑ผҰகੑ: ظ଴ϦεΫ࠷খԽؔ਺ (arg minf E[ϕ(−Yf (X))]) ͕ Bayes ࠷దɽ . . . 2 αϙʔτϕΫλʔͷ਺ vs ৚݅෇͖֬཰ p(Y |X) ͷਪఆೳྗ ώϯδ: εύʔεͳղ (৚݅෇͖֬཰͸શ͘ਪఆͰ͖ͳ͍) ϩδεςΟ οΫ: ৚݅෇͖֬཰͕ٻ·Δ (Ұํɼશαϯϓϧ͕αϙʔτϕΫλʔ) ʮαϙʔτϕΫλʔͷ਺ͱ৚݅෇͖֬཰ͷਪఆೳྗͱͷؒʹ͸τϨʔυΦϑ ͕͋Δɽ྆ऀΛ׬શʹཱ྆ͤ͞Δ͜ͱ͸Ͱ͖ͳ͍ɽ ʯ (Bartlett and Tewari, 2007) 8 / 60
14. ### ख๏ͷਖ਼౰ੑɾख๏ͷ࠷దੑ 2 ख๏ͷਖ਼౰ੑ ྫ: Ұகੑ ˆ f (ਪఆྔ) p −→

f ∗ (ਅͷؔ਺) 3 ख๏ͷ࠷దੑ ऩଋ͢Δͱͯͦ͠ͷ଎͞͸࠷దʁ ڐ༰ੑ minimax ੑ ࠓ೔͸͜͜Β΁ΜΛத৺ʹ࿩͠·͢ɽ 9 / 60

౷ܭతֶशཧ࿦ͱܦݧաఔ

Ұ༷ό΢ϯυ
جຊతͳෆ౳ࣜ
Rademacher ෳࡶ͞ͱ Dudley ੵ෼
ہॴ Rademacher ෳࡶ͞

࠷దੑ
ڐ༰ੑ
minimax ࠷దੑ

ϕΠζͷֶशཧ࿦

20. ### ʢࠓճ͓࿩͢͠Δʣֶशཧ࿦ ≈ ܦݧաఔͷཧ࿦ sup f ∈F { 1 n n

∑ i=1 f (xi ) − E[f ] } ͷධՁ͕ॏཁɽ 13 / 60
21. ### ྺ࢙: ܦݧաఔͷཧ࿦ 1933 Glivenko, Cantelli Glivenko-Catelli ͷఆཧ (Ұ༷େ਺ͷ๏ଇ) 1933 Kolmogorov

Kolmogorov-Smirnov ݕఆ (ऩଋϨʔτɼ઴ۙ෼෍) 1952 Donsker Donsker ͷఆཧ (Ұ༷த৺ۃݶఆཧ) 1967 Dudley Dudley ੵ෼ 1968 Vapnik, Chervonenkis VC ࣍ݩ (Ұ༷ऩଋͷඞཁे෼৚݅) 1996a Talagrand Talagrand ͷෆ౳ࣜ 14 / 60
౷ܭతֶशཧ࿦ͱܦݧաఔ

Ұ༷ό΢ϯυ
جຊతͳෆ౳ࣜ
Rademacher ෳࡶ͞ͱ Dudley ੵ෼
ہॴ Rademacher ෳࡶ͞

࠷దੑ
ڐ༰ੑ
minimax ࠷దੑ

ϕΠζͷֶशཧ࿦
23. ### ໰୊ઃఆ ڭࢣ༗Γֶश ڭࢣσʔλ: Dn = {(x1, y1 ), . .

. , (xn, yn )} ∈ (X × Y)n ೖྗͱग़ྗͷ i.i.d. ܥྻ ϩεؔ਺: ℓ(·, ·) : Y × R → R+ ؒҧ͍΁ͷϖφϧςΟ Ծઆू߹ (Ϟσϧ): F X → R ͳΔؔ਺ͷू߹ . . . . . . . ˆ f : ਪఆྔ. αϯϓϧ (xi , yi )n i=1 ͔Βߏ੒͞ΕΔ F ͷݩ. ཈͍͑ͨྔ (൚Խޡࠩ): E(X,Y ) ςετσʔλ [ℓ(Y , ˆ f (X))] − inf f :Մଌؔ਺ E(X,Y ) [ℓ(Y , f (X))] ൚Խޡࠩ͸ऩଋ͢Δʁ ͦͷ଎͞͸? 16 / 60
24. ### Bias-Variance ͷ෼ղ ܦݧϦεΫ: ˆ L(f ) = 1 n ∑

n i=1 ℓ(yi , f (xi )), ظ଴ϦεΫ: L(f ) = E(X,Y ) [ℓ(Y , f (X))] ൚Խޡࠩ =L(ˆ f ) − inf f :Մଌؔ਺ L(f ) = L(ˆ f ) − inf f ∈F L(f ) ਪఆޡࠩ + inf f ∈F L(f ) − inf f :Մଌؔ਺ L(f ) Ϟσϧޡࠩ ؆୯ͷͨΊ f ∗ ∈ F ͕ଘࡏͯ͠ inff ∈F L(f ) = L(f ∗) ͱ͢Δɽ 17 / 60
n i=1 ℓ(yi , f (xi )), ظ଴ϦεΫ: L(f ) = E(X,Y ) [ℓ(Y , f (X))] ൚Խޡࠩ =L(ˆ f ) − inf f :Մଌؔ਺ L(f ) = L(ˆ f ) − inf f ∈F L(f ) ਪఆޡࠩ + inf f ∈F L(f ) − inf f :Մଌؔ਺ L(f ) Ϟσϧޡࠩ ؆୯ͷͨΊ f ∗ ∈ F ͕ଘࡏͯ͠ inff ∈F L(f ) = L(f ∗) ͱ͢Δɽ ˞Ϟσϧޡࠩʹ͍ͭͯ͸ࠓճ͸৮Εͳ͍ɽ ͔͠͠ɼϞσϦϯάͷ໰୊͸ ඇৗʹॏཁɽ Sieve ๏, Cross validation, ৘ใྔن४, Ϟσϧฏۉ, ... Χʔωϧ๏ʹ͓͚ΔϞσϧޡࠩͷऔΓѻ͍: interpolation space ͷཧ ࿦ (Steinwart et al., 2009, Eberts and Steinwart, 2012, Bennett and Sharpley, 1988). Ҏ߱ɼϞσϧޡࠩ͸े෼খ͍͞ͱ͢Δɽ 17 / 60
26. ### ܦݧޡࠩ࠷খԽ ܦݧޡࠩ࠷খԽ (ERM): ˆ f = arg min f ∈F

ˆ L(f ) ਖ਼ଇԽ෇͖ܦݧޡࠩ࠷খԽ (RERM): ˆ f = arg min f ∈F ˆ L(f ) + ψ(f ) ਖ਼ଇԽ߲ RERM ʹؔ͢Δݚڀ΋ඇৗʹ୔ࢁ͋Δ (Steinwart and Christmann, 2008, Mukherjee et al., 2002). ERM ͷԆ௕ઢ্ɽ 18 / 60
∈F ˆ L(f ) ਖ਼ଇԽ෇͖ܦݧޡࠩ࠷খԽ (RERM): ˆ f = arg min f ∈F ˆ L(f ) + ψ(f ) ਖ਼ଇԽ߲ RERM ʹؔ͢Δݚڀ΋ඇৗʹ୔ࢁ͋Δ (Steinwart and Christmann, 2008, Mukherjee et al., 2002). ERM ͷԆ௕ઢ্ɽ 18 / 60
28. ### ग़ൃ఺ ΄ͱΜͲͷό΢ϯυͷಋग़͸࣍ͷ͔ࣜΒ࢝·Δ: ˆ L(ˆ f ) ≤ ˆ L(f ∗)

(∵ ܦݧޡࠩ࠷খԽ) ⇒ L(ˆ f ) − L(f ∗) ≤ L(ˆ f ) − ˆ L(ˆ f ) + ˆ L(f ∗) − L(f ∗) Reminder: ˆ L(f ) = 1 n ∑ n i=1 ℓ(yi , f (xi )), L(f ) = E(X,Y ) [ℓ(Y , f (X))] 19 / 60
(∵ ܦݧޡࠩ࠷খԽ) ⇒ L(ˆ f ) − L(f ∗) ൚Խޡࠩ ≤ L(ˆ f ) − ˆ L(ˆ f ) ? + ˆ L(f ∗) − L(f ∗) Op(1/ √ n) (ޙड़) Reminder: ˆ L(f ) = 1 n ∑ n i=1 ℓ(yi , f (xi )), L(f ) = E(X,Y ) [ℓ(Y , f (X))] 19 / 60
(∵ ܦݧޡࠩ࠷খԽ) ⇒ L(ˆ f ) − L(f ∗) ൚Խޡࠩ ≤ L(ˆ f ) − ˆ L(ˆ f ) ? + ˆ L(f ∗) − L(f ∗) Op(1/ √ n) (ޙड़) ҆қͳղੳ L(ˆ f ) − ˆ L(ˆ f ) { → 0 (∵ େ਺ͷ๏ଇ!!) = Op (1/ √ n) (∵ த৺ۃݶఆཧ!!) ָউʂ ʂ ʂ Reminder: ˆ L(f ) = 1 n ∑ n i=1 ℓ(yi , f (xi )), L(f ) = E(X,Y ) [ℓ(Y , f (X))] 19 / 60
(∵ ܦݧޡࠩ࠷খԽ) ⇒ L(ˆ f ) − L(f ∗) ൚Խޡࠩ ≤ L(ˆ f ) − ˆ L(ˆ f ) ? + ˆ L(f ∗) − L(f ∗) Op(1/ √ n) (ޙड़) ҆қͳղੳ L(ˆ f ) − ˆ L(ˆ f ) { → 0 (∵ େ਺ͷ๏ଇ!!) = Op (1/ √ n) (∵ த৺ۃݶఆཧ!!) ָউʂ ʂ ʂ μϝͰ͢ ˆ f ͱڭࢣσʔλ͸ಠཱͰ͸ͳ͍ Reminder: ˆ L(f ) = 1 n ∑ n i=1 ℓ(yi , f (xi )), L(f ) = E(X,Y ) [ℓ(Y , f (X))] 19 / 60

33. ### ͳʹ͕໰୊͔ʁ f f* L(f) L(f) ^ f ^ “ͨ·ͨ·” ͏·͍͘͘΍͕͍ͭΔ

(աֶश) ͔΋͠Εͳ͍ɽ ࣮ࡍɼF ͕ෳࡶͳ৔߹ऩଋ͠ͳ͍ྫ͕ 20 / 60
34. ### ͳʹ͕໰୊͔ʁ f f* L(f) L(f) ^ f ^ Ұ༷ͳό΢ϯυ Ұ༷ͳό΢ϯυʹΑͬͯʮͨ·ͨ·͏·͍͘͘ʯ͕

(΄ͱΜͲ) ͳ͍͜ͱΛอূ ͦΕ͸ࣗ໌Ͱ͸ͳ͍ (ܦݧաఔͷཧ࿦) 20 / 60
35. ### Ұ༷ό΢ϯυ L(ˆ f ) − ˆ L(ˆ f ) ≤

sup f ∈F { L(f ) − ˆ L(f ) } ≤ (?) Ұ༷ʹ ϦεΫΛ཈͑Δ͜ͱ͕ॏཁ 21 / 60
౷ܭతֶशཧ࿦ͱܦݧաఔ

Ұ༷ό΢ϯυ
جຊతͳෆ౳ࣜ
Rademacher ෳࡶ͞ͱ Dudley ੵ෼
ہॴ Rademacher ෳࡶ͞

࠷దੑ
ڐ༰ੑ
minimax ࠷దੑ

ϕΠζͷֶशཧ࿦

38. ### ༗༻ͳෆ౳ࣜ Hoeﬀding ͷෆ౳ࣜ Zi (i = 1, . . .

, n): ಠཱͰ (ಉҰͱ͸ݶΒͳ͍) ظ଴஋ 0 ͷ֬཰ม਺ s.t. |Zi | ≤ mi P ( | ∑ n i=1 Zi | √ n > t ) ≤ 2 exp ( − t2 2 ∑ n i=1 m2 i /n ) Bernstein ͷෆ౳ࣜ Zi (i = 1, . . . , n): ಠཱͰ (ಉҰͱ͸ݶΒͳ͍) ظ଴஋ 0 ͷ֬཰ม਺ s.t. E[Z2 i ] = σ2 i , |Zi | ≤ M P ( | ∑ n i=1 Zi | √ n > t ) ≤ 2 exp ( − t2 2(1 n ∑ n i=1 σ2 i + 1 √ n Mt) ) ෼ࢄͷ৘ใΛར༻ 24 / 60
39. ### ༗༻ͳෆ౳ࣜ: ֦ு൛ Hoeﬀding ͷෆ౳ࣜ (sub-Gaussian tail) Zi (i = 1,

. . . , n): ಠཱͰ (ಉҰͱ͸ݶΒͳ͍) ظ଴஋ 0 ͷ֬཰ม਺ s.t. E[eτZi ] ≤ eσ2 i τ2/2 (∀τ > 0) P ( | ∑ n i=1 Zi | √ n > t ) ≤ 2 exp ( − t2 2 ∑ n i=1 σ2 i /n ) Bernstein ͷෆ౳ࣜ Zi (i = 1, . . . , n): ಠཱͰ (ಉҰͱ͸ݶΒͳ͍) ظ଴஋ 0 ͷ֬཰ม਺ s.t. E[Z2 i ] = σ2 i , E|Zi |k ≤ k! 2 σ2Mk−2 (∀k ≥ 2) P ( | ∑ n i=1 Zi | √ n > t ) ≤ 2 exp ( − t2 2(1 n ∑ n i=1 σ2 i + 1 √ n Mt) ) (ώϧϕϧτۭؒ൛΋͋Δ) 25 / 60
40. ### ༗ݶू߹ͷҰ༷ό΢ϯυ 1: Hoeﬀding ͷෆ౳ࣜ൛ ͜Ε͚ͩͰ΋஌͍ͬͯΔͱ༗༻ɽ(f ← ℓ(y, g(x)) − Eℓ(Y

, g(X)) ͱͯ͠ߟ͑Δ) F = {fm (m = 1, . . . , M)} ༗ݶݸͷؔ਺ू߹: ͲΕ΋ظ଴஋ 0 (E[fm (X)] = 0). Hoeﬀding ͷෆ౳ࣜ (Zi = fm (Xi ) Λ୅ೖ) P ( | ∑ n i=1 fm(Xi )| √ n > t ) ≤ 2 exp ( − t2 2∥fm∥2 ∞ ) . Ұ༷ό΢ϯυ . . . . . . . . • P ( max 1≤m≤M | ∑ n i=1 fm (Xi )| √ n > max m ∥fm∥∞ √ 2 log (2M/δ) ) ≤ δ • E [ max 1≤m≤M | ∑ n i=1 fm (Xi )| √ n ] ≤ C max m ∥fm∥∞ √ log(1 + M) (ಋग़) P ( max 1≤m≤M | ∑ n i=1 fm(Xi )| √ n > t ) = P   ∪ 1≤m≤M | ∑ n i=1 fm(Xi )| √ n > t   ≤ 2 M ∑ m=1 exp ( − t2 2∥fm∥2 ∞ ) 26 / 60
41. ### ༗ݶू߹ͷҰ༷ό΢ϯυ 2: Bernstein ͷෆ౳ࣜ൛ F = {fm (m = 1,

. . . , M)} ༗ݶݸͷؔ਺ू߹: ͲΕ΋ظ଴஋ 0 (E[fm (X)] = 0). Bernstein ͷෆ౳ࣜ P ( | ∑ n i=1 fm(Xi )| √ n > t ) ≤ 2 exp ( − t2 2(∥fm∥2 L2 + 1 √ n ∥fm∥∞t) ) . Ұ༷ό΢ϯυ . . . . . . . . E [ max 1≤m≤M | ∑ n i=1 fm (Xi )| √ n ] ≲ 1 √ n max m ∥fm∥∞ log(1 + M) + max m ∥fm∥L2 √ log(1 + M) ˞ Ұ༷ό΢ϯυ͸͍͍ͤͥ √ log(M) ΦʔμͰ૿͑Δɽ 27 / 60
౷ܭతֶशཧ࿦ͱܦݧաఔ

Ұ༷ό΢ϯυ
جຊతͳෆ౳ࣜ
Rademacher ෳࡶ͞ͱ Dudley ੵ෼
ہॴ Rademacher ෳࡶ͞

࠷దੑ
ڐ༰ੑ
minimax ࠷దੑ

ϕΠζͷֶशཧ࿦
43. ### ༗ݶ͔Βແݶ΁ Ծઆू߹ͷཁૉ͕ແݶݸ͋ͬͨΒʁ ࿈ଓೱ౓Λ΋͍ͬͯͨΒʁ F = {x⊤β | β ∈ Rd

, ∥β∥ ≤ 1} F = {f ∈ H | ∥f ∥H ≤ 1} 29 / 60

45. ### Rademacher ෳࡶ͞ ϵ1, ϵ2, . . . , ϵn :

Rademacher ม਺, i.e., P(ϵi = 1) = P(ϵi = −1) = 1 2 . Rademacher ෳࡶ͞ R(F) := E{ϵi },{xi } [ sup f ∈F 1 n n ∑ i=1 ϵi f (xi ) ] ରশԽ: (ظ଴஋) E [ sup f ∈F 1 n n ∑ i=1 (f (xi ) − E[f ]) ] ≤ 2R(F). ΋͠ ∥f ∥∞ ≤ 1 (∀f ∈ F) ͳΒ (੄֬཰) P ( sup f ∈F 1 n n ∑ i=1 (f (xi ) − E[f ]) ≥ 2R(F) + √ t 2n ) ≤ 1 − e−t. Rademacher ෳࡶ͞Λ཈͑Ε͹Ұ༷ό΢ϯυ͕ಘΒΕΔʂ 31 / 60
46. ### Rademacher ෳࡶ͞ͷ֤छੑ࣭ Contraction inequality: ΋͠ ψ ͕ Lipschitz ࿈ଓͳΒ, i.e.,

|ψ(f ) − ψ(f ′)| ≤ B|f − f ′|, R({ψ(f ) | f ∈ F}) ≤ BR(F). ತแ: conv(F) Λ F ͷݩͷತ݁߹શମ͔ΒͳΔू߹ͱ͢Δ. R(conv(F)) = R(F) 32 / 60
47. ### Rademacher ෳࡶ͞ͷ֤छੑ࣭ Contraction inequality: ΋͠ ψ ͕ Lipschitz ࿈ଓͳΒ, i.e.,

|ψ(f ) − ψ(f ′)| ≤ B|f − f ′|, R({ψ(f ) | f ∈ F}) ≤ BR(F). ತแ: conv(F) Λ F ͷݩͷತ݁߹શମ͔ΒͳΔू߹ͱ͢Δ. R(conv(F)) = R(F) ಛʹ࠷ॳͷੑ࣭͕༗Γ೉͍ɽ . . . . . . . |ℓ(y, f ) − ℓ(y, f ′)| ≤ |f − f ′| ͳΒɼ E [ sup f ∈F |ˆ L(f ) − L(f )| ] ≤ 2R(ℓ(F)) ≤ 2R(F), ͨͩ͠ɼℓ(F) = {ℓ(·, f (·)) | f ∈ F}. Αͬͯ F ͷ Rademacher complexity Λ཈͑Ε͹े෼ʂ Lipschitz ࿈ଓੑ͸ώϯδϩε, ϩδεςΟ οΫϩεͳͲͰ੒Γཱͭɽ͞Βʹ y ͱ F ͕༗քͳΒೋ৐ϩεͳͲͰ΋੒Γཱͭɽ Reminder: ˆ L(f ) = 1 n ∑ n i=1 ℓ(yi , f (xi )), L(f ) = E(X,Y ) [ℓ(Y , f (X))] 32 / 60
48. ### ΧόϦϯάφϯόʔ Rademacher complexity Λ཈͑Δํ๏ɽ ΧόϦϯάφϯόʔ: Ծઆू߹ F ͷෳࡶ͞ɾ༰ྔɽ . ϵ-ΧόϦϯάφϯόʔ

. . . . . . . . N(F, ϵ, d) ϊϧϜ d Ͱఆ·Δ൒ܘ ϵ ͷϘʔϧͰ F Λ෴͏ͨΊ ʹඞཁͳ࠷খͷϘʔϧͷ਺ɽ F ༗ݶݸͷݩͰ F Λۙࣅ͢Δͷʹ࠷௿ݶඞཁͳݸ਺ɽ . Theorem (Dudley ੵ෼) . . . . . . . . ∥f ∥2 n := 1 n ∑ n i=1 f (xi )2 ͱ͢Δͱɼ R(F) ≤ C √ n EDn [∫ ∞ 0 √ log(N(F, ϵ, ∥ · ∥n ))dϵ ] . 33 / 60
49. ### Dudley ੵ෼ͷΠϝʔδ R(F) ≤ C √ n EDn [∫ ∞

0 √ log(N(F, ϵ, ∥ · ∥n ))dϵ ] . ༗ݶݸͷݩͰ F Λۙࣅ͢Δɽ ͦͷղ૾౓Λࡉ͔͍ͯͬͯ͘͠ɼࣅ ͍ͯΔݩΛ·ͱΊ্͛ͯΏ͘Πϝʔ δɽ νΣΠχϯά ͱ͍͏ɽ 34 / 60
50. ### ͜Ε·Ͱͷ·ͱΊ ˆ L(ˆ f ) ≤ ˆ L(f ∗) (∵

ܦݧޡࠩ࠷খԽ) ⇒ L(ˆ f ) − L(f ∗) ≤ L(ˆ f ) − ˆ L(ˆ f ) ͜ΕΛ཈͍͑ͨ + ˆ L(f ∗) − L(f ∗) Op(1/ √ n) (Hoeﬀding) ℓ ͕ 1-Lipschitz (|ℓ(y, f ) − ℓ(y, f ′)| ≤ |f − f ′|) ͔ͭ ∥f ∥∞ ≤ 1 (∀f ∈ F) ͷͱ͖, L(ˆ f ) − ˆ L(ˆ f ) ≤ sup f ∈F (L(f ) − ˆ L(f )) ≤ R(ℓ(F)) + √ t n (with prob. 1 − e−t) ≤ R(F) + √ t n (contraction ineq., Lipschitz ࿈ଓ) ≤ 1 √ n EDn [∫ ∞ 0 √ log N(F, ϵ, ∥ · ∥n )dϵ ] + √ t n (Dudley ੵ෼). ˞ΧόϦϯάφϯόʔ͕খ͍͞΄ͲϦεΫ͸খ͍͞ˠ Occam’s Razor 35 / 60
51. ### ྫ: ઢܗ൑ผؔ਺ F = {f (x) = sign(x⊤β + c)

| β ∈ Rd , c ∈ R} N(F, ϵ, ∥ · ∥n ) ≤ C(d + 2) (c ϵ )2(d+1) ͢Δͱɼ0-1 ϩε ℓ ʹର͠ L(ˆ f ) − ˆ L(ˆ f ) ≤ Op ( 1 √ n EDn [∫ 1 0 √ log N(F, ϵ, ∥ · ∥n )dϵ ]) ≤ Op ( 1 √ n ∫ 1 0 C √ d log(1/ϵ) + log(d)dϵ ) ≤ Op (√ d n ) . 36 / 60
52. ### ྫ: VC ࣍ݩ F ͸ࢦࣔؔ਺ͷू߹: F = {1C | C

∈ C}. C ͸͋Δू߹଒ (ྫ: ൒ۭؒͷू߹) ࡉ෼: F ͕͋Δ༩͑ΒΕͨ༗ݶू߹ Xn = {x1, . . . , xn} Λࡉ෼͢Δ ⇔ ೚ҙͷϥϕϧ Yn = {y1, . . . , yn} (yi ∈ {±1}) ʹରͯ͠ Xn Λ F ͕ਖ਼͘͠ ൑ผͰ͖Δɽ VC ࣍ݩ VF : F ͕ࡉ෼Ͱ͖Δू߹͕ଘࡏ͠ͳ͍ n ͷ࠷খ஋. N(F, ϵ, ∥ · ∥n ) ≤ KVF (4e)VF ( 1 ϵ )2(VF −1) ⇒ ൚Խޡࠩ = Op ( √ VF /n) http://www.tcs.fudan.edu.cn/rudolf/Courses/Algorithms/Alg_ss_07w/Webprojects/Qinbo_diameter/e_net.htm ͔Βഈआ VC ࣍ݩ༗ݶ͕Ұ༷ऩଋͷඞཁे෼৚݅ (ҰൠԽ Glivenko-Cantelli ఆཧͷඞཁे෼৚݅) 37 / 60
53. ### ྫ: Χʔωϧ๏ F = {f ∈ H | ∥f ∥H

≤ 1} Χʔωϧؔ਺ k ࠶ੜ֩ώϧϕϧτۭؒ H k(x, x) ≤ 1 (∀x ∈ X) ΛԾఆ, e.g., Ψ΢εΧʔ ωϧ. ௚઀ Rademacher ෳࡶ͞ΛධՁͯ͠ΈΔɽ ∑ n i=1 ϵi f (xi ) = ⟨ ∑ n i=1 ϵi k(xi , ·), f ⟩H ≤ ∥ ∑ n i=1 ϵi k(xi , ·)∥H ∥f ∥H ≤ ∥ ∑ n i=1 ϵi k(xi , ·)∥H Λ࢖͏ɽ R(F) = E [ sup f ∈F | ∑ n i=1 ϵi f (xi )| n ] ≤ E [ ∥ ∑ n i=1 ϵi k(xi , ·)∥H n ] = E   √∑ n i,j=1 ϵi ϵj k(xi , xj ) n   ≤ √ E [∑ n i,j=1 ϵi ϵj k(xi , xj ) ] n (Jensen) = √∑ n i=1 k(xi , xi ) n ≤ 1 √ n 38 / 60
54. ### ྫ: ϥϯμϜߦྻͷ࡞༻ૉϊϧϜ A = (aij ): p × q ߦྻͰ֤

aij ͸ಠཱͳظ଴஋ 0 ͔ͭ |aij | ≤ 1 ͳΔ֬཰ม਺ɽ A ͷ࡞༻ૉϊϧϜ ∥A∥ := max ∥z∥≤1 z∈Rq ∥Az∥ = max ∥w∥≤1,∥z∥≤1 w∈Rp,z∈Rq w⊤Az. F = {fw,z (aij , (i, j)) = aij wi zj | w ∈ Rp, z ∈ Rq} ⇒ ∥A∥ = sup f ∈F ∑ i,j f (aij , (i, j)) n = pq ݸͷαϯϓϧ͕͋ΔͱΈͳ͢ɽ ∥fw,z − fw′,z′ ∥2 n = 1 pq ∑ p,q i,j=1 |aij (wi zj − w′ i z′ j )|2 ≤ 2 pq (∥w − w′∥2 + ∥z − z′∥2) ∴ N(F, ϵ, ∥ · ∥n ) { ≤ C( √ pqϵ)−(p+q), (ϵ ≤ 2/ √ pq), = 1, (otherwise). . . . . . . . E [ 1 pq sup w,z w⊤Az ] ≤ C √ pq ∫ 1 √ pq 0 √ (p + q) log(C/ √ pqϵ)dϵ ≤ √ p + q pq ΑͬͯɼA ͷ࡞༻ૉϊϧϜ͸ Op ( √ p + q). ˠ ௿ϥϯΫߦྻਪఆ, Robust PCA, ... ৄ͘͠͸ Tao (2012), Davidson and Szarek (2001) Λࢀরɽ 39 / 60
55. ### ྫ: Lasso ͷऩଋϨʔτ σβΠϯߦྻ X = (Xij ) ∈ Rn×p.

p (࣍ݩ) ≫ n (αϯϓϧ਺). ਅͷϕΫτϧ β∗ ∈ Rp: ඇθϩཁૉͷݸ਺͕͔͔ͨͩ d ݸ (εύʔε). Ϟσϧ : Y = Xβ∗ + ξ. ˆ β ← arg min β∈Rp 1 n ∥Xβ − Y ∥2 2 + λn∥β∥1. . Theorem (Lasso ͷऩଋϨʔτ (Bickel et al., 2009, Zhang, 2009)) . . . . . . . . σβΠϯߦྻ͕ Restricted eigenvalue condition (Bickel et al., 2009) ͔ͭ maxi,j |Xij | ≤ 1 Λຬͨ͠ɼϊΠζ͕ E[eτξi ] ≤ eσ2τ2/2 (∀τ > 0) Λຬͨ͢ͳΒ, ֬ ཰ 1 − δ Ͱ ∥ˆ β − β∗∥2 2 ≤ C d log(p/δ) n . ˞࣍ݩ͕ߴͯ͘΋ɼ͔͔ͨͩ log(p) Ͱ͔͠ޮ͍ͯ͜ͳ͍ɽ࣮࣭తͳ࣍ݩ d ͕ࢧ ഑తɽ 40 / 60
56. ### log(p) ͸Ͳ͔͜Β΍͖͔ͬͯͨʁ ༗ݶݸͷҰ༷ό΢ϯυ͔Β΍͖ͬͯͨɽ 1 n ∥X ˆ β − Y

∥2 2 + λn∥ˆ β∥1 ≤ 1 n ∥Xβ∗ − Y ∥2 2 + λn∥β∗∥1 ⇒ 1 n ∥X(ˆ β − β∗)∥2 2 + λn∥ˆ β∥1 ≤ 2 n ∥X⊤ξ∥∞ ͜Ε ∥ˆ β − β∗∥1 + λn∥β∗∥1 1 n ∥X⊤ξ∥∞ = max 1≤j≤p | 1 n n ∑ i=1 Xij ξi | 41 / 60
57. ### log(p) ͸Ͳ͔͜Β΍͖͔ͬͯͨʁ ༗ݶݸͷҰ༷ό΢ϯυ͔Β΍͖ͬͯͨɽ 1 n ∥X ˆ β − Y

∥2 2 + λn∥ˆ β∥1 ≤ 1 n ∥Xβ∗ − Y ∥2 2 + λn∥β∗∥1 ⇒ 1 n ∥X(ˆ β − β∗)∥2 2 + λn∥ˆ β∥1 ≤ 2 n ∥X⊤ξ∥∞ ͜Ε ∥ˆ β − β∗∥1 + λn∥β∗∥1 1 n ∥X⊤ξ∥∞ = max 1≤j≤p | 1 n n ∑ i=1 Xij ξi | Hoeﬀding ͷෆ౳ࣜ༝དྷͷҰ༷ό΢ϯυʹΑΓ, ֬཰ 1 − δ Ͱ max 1≤j≤p | 1 n n ∑ i=1 Xij ξi | ≤ σ √ 2 log(2p/δ) n . 41 / 60
58. ### Talagrand ͷ concentration inequality ൚༻ੑͷߴ͍ෆ౳ࣜɽ . Theorem (Talagrand (1996b), Massart

(2000), Bousquet (2002)) . . . . . . . . σ2 := supf ∈F E[f (X)2], Pn f := 1 n ∑ n i=1 f (xi ), Pf := E[f (X)] ͱ͢Δ. P [ sup f ∈F (Pn f − Pf ) ≥ C ( E [ sup f ∈F (Pn f − Pf ) ] + √ t n σ + t n )] ≤ e−t Fast learning rate Λࣔ͢ͷʹ༗༻ɽ 42 / 60
59. ### ͦͷଞͷτϐοΫ Johnson-Lindenstrauss ͷิ୊ (Johnson and Lindenstrauss, 1984, Dasgupta and Gupta,

1999) n ݸͷ఺ {x1, . . . , xn} ∈ Rd Λ k ࣍ݩۭؒ΁ࣹӨ͢Δ. k ≥ cδ log(n) ͳΒ, k ࣍ݩ΁ͷϥϯμϜϓϩδΣΫγϣϯ A ∈ Rk×d (ϥϯμϜߦྻ) ͸ (1 − δ)∥xi − xj ∥ ≤ ∥Axi − Axj ∥ ≤ (1 + δ)∥xi − xj ∥ Λߴ͍֬཰Ͱຬͨ͢ɽ ˠ restricted isometory (Baraniuk et al., 2008, Cand` es, 2008) Gaussian concentration inequality, concentration inequality on product space (Ledoux, 2001) sup f ∈F 1 n n ∑ i=1 ξi f (xi ) (ξi : Ψ΢ε෼෍ͳͲ) Majorizing measure: Ψ΢γΞϯϓϩηεʹ·ͭΘΔ্ք, Լք (Talagrand, 2000). 43 / 60
౷ܭతֶशཧ࿦ͱܦݧաఔ

Ұ༷ό΢ϯυ
جຊతͳෆ౳ࣜ
Rademacher ෳࡶ͞ͱ Dudley ੵ෼
ہॴ Rademacher ෳࡶ͞

࠷దੑ
ڐ༰ੑ
minimax ࠷దੑ

ϕΠζͷֶशཧ࿦

62. ### ϩεؔ਺ͷڧತੑΛੵۃతʹར༻ f f* L(f) f ^ Ұ༷ͳό΢ϯυ ϩεͷڧತੑΛ࢖͏ͱ ˆ f

ͷଘࡏൣғ੍͕ݶ͞ΕΔˠΑΓ͖͍ͭό΢ϯυ 46 / 60
63. ### ϩεؔ਺ͷڧತੑΛੵۃతʹར༻ f L(f) f ^ Ұ༷ͳό΢ϯυ ಉ͡࿦ཧΛԿ౓΋ద༻ͤ͞Δ͜ͱʹΑͬͯ ˆ f ͷϦεΫ͕খ͍͜͞ͱΛࣔ͢ɽ

ˆ f ͕ f ∗ ʹ͍ۙ͜ͱΛར༻ˠ “ہॴ”Rademacher ෳࡶ͞ 46 / 60
64. ### ہॴ Rademacher ෳࡶ͞ . . . . . . .

ہॴ Rademacher ෳࡶ͞: Rδ (F) := R({f ∈ F | E[(f − f ∗)2] ≤ δ}). ࣍ͷ৚݅ΛԾఆͯ͠ΈΔ. F ͸ 1 Ͱ্͔Β཈͑ΒΕ͍ͯΔ: ∥f ∥∞ ≤ 1 (∀f ∈ F). ℓ ͸ Lipschitz ࿈ଓ͔ͭ ڧತ: E[ℓ(Y , f (X))] − E[ℓ(Y , f ∗(X))] ≥ BE[(f − f ∗)2] (∀f ∈ F). . Theorem (Fast learning rate (Bartlett et al., 2005)) . . . . . . . . δ∗ = inf{δ | δ ≥ Rδ (F)} ͱ͢Δͱɼ֬཰ 1 − e−t Ͱ L(ˆ f ) − L(f ∗) ≤ C ( δ∗ + t n ) . δ∗ ≤ R(F) ͸ৗʹ੒Γཱͭ (ӈਤࢀর). ͜ΕΛ Fast learning rate ͱݴ͏ɽ R± (F) ± ± ±* 47 / 60
65. ### Fast learning rate ͷྫ log N(F, ϵ, ∥ · ∥n

) ≤ Cϵ−2ρ ͷͱ͖ɼ Rδ (F) ≤ C ( δ 1−ρ 2 √ n ∨ n− 1 1+ρ ) , ͕ࣔ͞Εɼδ∗ ͷఆ͔ٛΒ֬཰ 1 − e−t Ͱ͕࣍੒Γཱͭ: L(ˆ f ) − L(f ∗) ≤ C ( n− 1 1+ρ + t n ) . ˞ 1/ √ n ΑΓλΠτʂ ࢀߟจݙ ہॴ Rademacher ෳࡶ͞ͷҰൠ࿦: Bartlett et al. (2005), Koltchinskii (2006) ൑ผ໰୊, Tsybakov ͷ৚݅: Tsybakov (2004), Bartlett et al. (2006) Χʔωϧ๏ʹ͓͚Δ fast learning rate: Steinwart and Christmann (2008) Peeling device: van de Geer (2000) 48 / 60
౷ܭతֶशཧ࿦ͱܦݧաఔ

Ұ༷ό΢ϯυ
جຊతͳෆ౳ࣜ
Rademacher ෳࡶ͞ͱ Dudley ੵ෼
ہॴ Rademacher ෳࡶ͞

࠷దੑ
ڐ༰ੑ
minimax ࠷దੑ

ϕΠζͷֶशཧ࿦

౷ܭతֶशཧ࿦ͱܦݧաఔ

Ұ༷ό΢ϯυ
جຊతͳෆ౳ࣜ
Rademacher ෳࡶ͞ͱ Dudley ੵ෼
ہॴ Rademacher ෳࡶ͞

࠷దੑ
ڐ༰ੑ
minimax ࠷దੑ

ϕΠζͷֶशཧ࿦
69. ### ڐ༰ੑ ෼෍ͷϞσϧ: {Pθ |θ ∈ Θ} Pθ ʹ͓͚Δਪఆྔ ˇ f

ͷϦεΫͷظ଴஋: ¯ Lθ (ˇ f ) := EDn∼Pθ [E(X,Y )∼Pθ [ℓ(Y , ˇ f (X))]] . Deﬁnition (ڐ༰ੑ) . . . . . . . . ˆ f ͕ڐ༰త (admissible) ⇔ ¯ Lθ (ˇ f ) ≤ ¯ Lθ (ˆ f ) (∀θ ∈ Θ) ͔ͭ, ͋Δ θ′ ∈ Θ Ͱ ¯ Lθ′ (ˇ f ) < ¯ Lθ′ (ˆ f ) ͳΔਪఆྔ ˇ f ͕ ଘࡏ͠ͳ͍ɽ θ ¹ Lµ (· f ) ¹ Lµ (^ f ) θ ¹ Lµ (^ f ) 52 / 60
70. ### ྫ ؆୯ͷͨΊαϯϓϧ Dn = {(x1, . . . , xn

)} ∼ Pn θ ͔Β Pθ (θ ∈ Θ) Λਪఆ͢Δ໰୊ Λߟ͑Δɽ Ұ఺ౌ͚: ͋Δ θ0 Λৗʹ༻͍Δɽͦͷ θ0 ʹର͢Δ౰ͯ͸·Γ͸࠷ྑ͕ͩଞ ͷ θ ʹ͸ѱ͍ɽ ϕΠζਪఆྔ: ࣄલ෼෍ π(θ), ϦεΫ L(θ0, ˆ P) ˆ P = arg min ˆ P:ਪఆྔ ∫ EDn∼Pθ0 [L(θ0, ˆ P)]π(θ0 )dθ0. ೋ৐ϦεΫ L(θ, ˆ θ) = ∥θ − ˆ θ∥2: ˆ θ = ∫ θπ(θ|Dn )dθ (ࣄޙฏۉ) KL-ϦεΫ L(θ, ˆ P) = KL(Pθ ||ˆ P): ˆ P = ∫ P(·|θ)π(θ|Dn )dθ (ϕΠζ༧ଌ෼෍) ϕΠζਪఆྔͷఆٛΑΓɼϦεΫ L(θ, ˆ P) Λৗʹվળ͢Δਪఆྔ͸ଘࡏ͠ ͳ͍ɽ 53 / 60
౷ܭతֶशཧ࿦ͱܦݧաఔ

Ұ༷ό΢ϯυ
جຊతͳෆ౳ࣜ
Rademacher ෳࡶ͞ͱ Dudley ੵ෼
ہॴ Rademacher ෳࡶ͞

࠷దੑ
ڐ༰ੑ
minimax ࠷దੑ

ϕΠζͷֶशཧ࿦
72. ### minimax ࠷దੑ . Deﬁnition (minimax ࠷దੑ) . . . .

. . . . ˆ f ͕ minimax ࠷ద ⇔ max θ∈Θ ¯ Lθ (ˆ f ) = min ˇ f :ਪఆྔ max θ∈Θ ¯ Lθ (ˇ f )ɽ ֶशཧ࿦Ͱ͸ఆ਺ഒΛڐ͢͜ͱ͕ଟ͍: ∃C Ͱ max θ∈Θ ¯ Lθ (ˆ f ) ≤ C min ˇ f :ਪఆྔ max θ∈Θ ¯ Lθ (ˇ f ) (∀n). ͦ͏͍͏ҙຯͰʮminimax ϨʔτΛୡ੒͢ΔʯͱݴͬͨΓ͢Δɽ θ ¹ Lµ (^ f ) 55 / 60
73. ### minimax ϨʔτΛٻΊΔํ๏ Introduction to nonparametric estimation (Tsybakov, 2008) ʹৄ͍͠هड़. F

Λ༗ݶݸͷݩͰ୅දͤ͞ɼͦͷ͏ͪҰͭ࠷ྑͳ΋ͷΛબͿ໰୊Λߟ͑Δɽ (΋ͱͷ໰୊ΑΓ؆୯ˠϦεΫͷԼݶΛ༩͑Δ) {f1, . . . , fMn } ⊆ F F fj εn ݸ਺ Mn ͱޡࠩ εn ͷτϨʔυΦϑ: Mn ͕খ͍͞ํ͕࠷దͳݩΛબͿͷ͕؆୯ʹ ͳΔ͕ޡࠩ εn ͕େ͖͘ͳΔ. cf. Fano ͷෆ౳ࣜ, Assouad ͷิ୊. 56 / 60
74. ### εύʔεਪఆͷ minimax Ϩʔτ . Theorem (Raskutti and Wainwright (2011)) .

. . . . . . . ͋Δ৚݅ͷ΋ͱɼ֬཰ 1/2 Ҏ্Ͱɼ min ˆ β:ਪఆྔ max β∗:d-εύʔε ∥ˆ β − β∗∥2 ≥ C d log(p/d) n . Lasso ͸ minimax ϨʔτΛୡ੒͢Δ (d log(d) n ͷ߲Λআ͍ͯ)ɽ ͜ͷ݁ՌΛ Multiple Kernel Learning ʹ֦ுͨ݁͠Ռ: Raskutti et al. (2012), Suzuki and Sugiyama (2012). 57 / 60
౷ܭతֶशཧ࿦ͱܦݧաఔ

Ұ༷ό΢ϯυ
جຊతͳෆ౳ࣜ
Rademacher ෳࡶ͞ͱ Dudley ੵ෼
ہॴ Rademacher ෳࡶ͞

࠷దੑ
ڐ༰ੑ
minimax ࠷దੑ

ϕΠζͷֶशཧ࿦
76. ### ϕΠζͷֶशཧ࿦ ϊϯύϥϕΠζͷ౷ܭతੑ࣭ ڭՊॻ: Ghosh and Ramamoorthi (2003), Bayesian Nonparametrics. Springer,

2003. ऩଋϨʔτ Ұൠ࿦: Ghosal et al. (2000) Dirichlet mixture: Ghosal and van der Vaart (2007) Gaussian process: van der Vaart and van Zanten (2008a,b, 2011). 59 / 60
77. ### ϕΠζͷֶशཧ࿦ ϊϯύϥϕΠζͷ౷ܭతੑ࣭ ڭՊॻ: Ghosh and Ramamoorthi (2003), Bayesian Nonparametrics. Springer,

2003. ऩଋϨʔτ Ұൠ࿦: Ghosal et al. (2000) Dirichlet mixture: Ghosal and van der Vaart (2007) Gaussian process: van der Vaart and van Zanten (2008a,b, 2011). PAC-Bayes L(ˆ fπ ) ≤ infρ {∫ L(f )ρ(df ) + 2 [ λC2 n + KL(ρ||π)+log 2 ϵ λ ]} (Catoni, 2007) ݩ࿦จ: McAllester (1998, 1999) ΦϥΫϧෆ౳ࣜ: Catoni (2004, 2007) εύʔεਪఆ΁ͷԠ༻: Dalalyan and Tsybakov (2008), Alquier and Lounici (2011), Suzuki (2012) 59 / 60
78. ### ·ͱΊ Ұ༷ό΢ϯυ͕ॏཁ sup f ∈F { 1 n n ∑

i=1 ℓ(yi , f (xi )) − E[ℓ(Y , f (X))] } Rademacher ෳࡶ͞ ΧόϦϯάφϯόʔ Ծઆू߹͕୯७Ͱ͋Ε͹͋Δ΄Ͳɼ଎͍ऩଋɽ ࠷దੑن४ ڐ༰ੑ minimax ࠷దੑ f f* L(f) L(f) ^ f ^ Ұ༷ͳό΢ϯυ 60 / 60
