Slide 1

Slide 1 text

Murphy: Machine learning 13 ষ Sparse linear models ɹิॿࢿྉ Daisuke Yoneoka September 26, 2014 Daisuke Yoneoka Murphy: Machine learning 13 ষ Sparse linear models ɹิॿࢿྉ September 26, 2014 1 / 14

Slide 2

Slide 2 text

Notations γ ͸ bit vector Ͱ, ಛ௃ྔ j ͕ؔ࿈͋Δ৔߹͸ γj = 1, ͦΕҎ֎͸ 0. ∥γ∥0 = D j=1 γj ͸ l0 pseudo-norm. ∥γ∥1 = D j=1 |γj| ͸ l1 norm. ∥γ∥2 = ( D j=1 γ2 j )1/2 ͸ l2 norm. π0: ͋Δಛ௃ྔ͕ؔ࿈͍ͯ͠Δ֬཰ xor: exclusive or / exclusive disjunction Ͱഉଞత࿦ཧ࿨ͷҙ. ೖྗͷ͏ͪʮਅʯͷ਺͕ح਺ݸͳΒ͹ग़ྗ͕ਅʹͳΓ, ۮ਺ݸͷ৔߹͸ग़ྗ͕ʮِʯʹͳΔΑ͏ ͳԋࢉͷ͜ͱ. .*: ഑ྻͷ৐ࢉ. A. ∗ B ͸, ഑ྻ A ͱ഑ྻ B ͷཁૉ͝ͱͷੵ (ͳ͔ͥ·ͨٸʹ matlab ͷॻ͖ํ) x:,j: ߦྻ X ͷ j ྻ໨ͷίϥϜϕΫτϧ (·ͨ,matlab ํݴ) subderivative (ྼඍ෼): ತؔ਺ f : I → R ͷ θ0 Ͱྼඍ෼ͱ ͸,f(θ) − f(θ0) ≥ g(θ − θ0) θ ∈ I Λຬ଍͢Δ g ͷू߹ NLL: negative log likelihood, NLL(θ) ≡ − N i=1 log p(yi|xi, θ) Daisuke Yoneoka Murphy: Machine learning 13 ষ Sparse linear models ɹิॿࢿྉ September 26, 2014 2 / 14

Slide 3

Slide 3 text

Introduction ಛ௃ྔબ୒Ͱ͸, p(y|X) = p(y|f(wT X)) Ͱ w Λ sparse ʹͱΔ͜ͱΛߟ͑Δ. ۙ೥, ஫໨ू·ͬͯΔʂ Lots of computational advantages D >> N ໰୊, (ݹయత౷ܭ D < N) D:ύϥϝʔλ࣍ݩ, n:αϯϓϧαΠζ Ҩ఻ࢠղੳͰ͸ d ∼ 10, 000 ͱ n ∼ 100 Ͱ, ͳΔ΂͘খ͍͞ಛ௃ྔͷηοτΛൃݟ͍ͨ͠ Ch.14 ͰΧʔωϧΛ༻͍ͨղੳΛѻ͏. ͜ͷͱ͖ܭըߦྻ͸ N × N Ͱ ݁ہ, ಛ௃ྔબ୒=܇࿅σʔλͷαϒηοτΛબͿ͜ͱʹͳΔ.(Sparse kernel machine) ৴߸ॲཧͰ͸,wavelet Λجఈͱͯ͠දݱ͢Δ͕, ͜ͷجఈΛͳΔ΂͘গͳ͘બͿͱ͖ʹ࢖༻ Daisuke Yoneoka Murphy: Machine learning 13 ষ Sparse linear models ɹิॿࢿྉ September 26, 2014 3 / 14

Slide 4

Slide 4 text

Bayesian variable selection ಛ௃ྔͷ૊Έ߹Θͤͷ Posterior ΛٻΊ͍ͨ. p(γ|D) = e−f(γ) γ′ e−f(γ′ ) ͨͩ͠ f(γ) ≡ −[log p(D|γ) + log p(γ)] ͜Ε͸Ϟσϧ਺͕ଟ͘ͳΔͱͪΐ ͬͱղऍ͕೉͘͠ͳΔ Summary stats Λߟ͑ͯΈΔͱ, ࣗવʹ Posterior ͷ mode=MAP ਪఆྔ͕ࢥ͍ͭ͘. ˆ γ = argmax p(γ|D) = argmin f(γ) Mode ͸ͪΐ ͬͱΞϨ...median ͸ ˆ γ = {j : p(γj = 1|D) = 0.5} ͨͩ͜͠Ε͸, posterior marginal inclusion probability, p(γj = 1|D) ͷܭࢉ͕ඞཁ Ͱ΋͜Ε͸࣍ݩ਺͕͕͋Δͱݫ͘͠ͳͬͯ͘Δ Daisuke Yoneoka Murphy: Machine learning 13 ষ Sparse linear models ɹิॿࢿྉ September 26, 2014 4 / 14

Slide 5

Slide 5 text

Spike and slab model Posterior ͸ p(γ|D) ∝ p(γ)p(D|γ) Prior ͸ p(γ) = D j=1 Ber(γi |π0) = π∥γ∥0 0 (1 − π0)D−∥γ∥0 . Likelihood ͸ p(D|γ) = p(y|X, γ) = p(y|X, w, γ)p(w|γ, σ2)p(σ2)dwdσ2 p(w|γ, σ2) ͷ prior ͸ p(wj |γj , σ2) = δ0 (wj ) if γj = 0 N(wj |0, σ2σ2 j ) if γj = 1 . ͨͩ͠, x ͱ y ͸ standardized. ࠷ॳͷ͸, ݪ఺ʹ spike ཱ͕͍ͬͯΔײ͡ As σw → ∞ Ͱ p(wj|γj) ͸ uniform ʹͳΔͷͰ slab ͱݴ͑Δ. Zou. 2007. Marginal likelihood Λ BIC ͰۙࣅͰ͖Δ. log p(D|γ) ≈ log p(y|X, ˆ wγ , ˆ σ2) − ∥γ∥0 2 ͜͜͸ࣗ༝౓ log N ্ΑΓ, log p(γ|D) ≈ log p(y|X, ˆ wγ , ˆ σ2) − ∥γ∥0 2 log N − λ∥γ∥0 p(γ) ͷ prior +const Daisuke Yoneoka Murphy: Machine learning 13 ষ Sparse linear models ɹิॿࢿྉ September 26, 2014 5 / 14

Slide 6

Slide 6 text

From the Bernoulli-Gaussian model to l0 regularization yi |xi, w, γ, σ2 ∼ N( γjwjxij, σ2) γj ∼ Ber(π0 ) wj ∼ N(0, σ2 w ) Called Bernoulli-Gaussian model or binary mask model. (γj ͕ wj Λ mask out ͯ͠ ͍Δ) binary mask: γj → y ← wj vs slab: γj → wj → y γj ͱ wj ͷࣝผੑ͕ͳ͘ γj wj ͔ࣝ͠ผෆՄ ͪΐ ͬͱ͸͍͍͜ͱ͋ΔΑʂNon-Bayeisan ʹ͸׳Ε਌͠·Ε͍ͯΔײ͡ʹͳΔ Joint prior ͸ p(γ, w) ∝ N(0, σ2 w )π∥γ∥0 0 (1 − π0)D−∥γ∥0 ͜͏͢Δͱ log posterior ͸ f(γ, w) ≡ −2σ2 log p(γ, w, y|X) = ∥y − X(γ. ∗ w)∥2 + σ2 σ2 w ∥w∥2 + λ∥γ∥0 + const, ͨͩ͠ γ ≡ 2σ2 log( 1 − π0 π0 ). w−γ = 0 ͱ wγ Λ γ ͕ 0 or 1 ͷͱ͖ͷ w ͱ͠,σ2 w → ∞ Ͱ f(γ, w) = ∥y − Xγwγ∥2 2 + λ∥γ∥0. ɹ͜Εͬͯ, ্ͷ BIC ͷࣜʹࣅ͍ͯΔΑͶ l0 regularization; γ ࢖͏ͷ΍Ίͯ,support ͱͯ͠ w ͷॏཁੑΛද͢ม਺Λఆٛ͢Δ͜ ͱͰ f(w) = ∥y − Xw∥2 2 + λ∥w∥0 ͜ΕͰ࠷దԽ໰୊Λ γ ∈ {0, 1} ͷ̎஋͔Β࿈ଓ஋ w ʹม׵Մೳ. Ͱ΋, ୈೋ߲͸·ͩ·ͩ ࠷దԽ͠ʹ͍͘! Daisuke Yoneoka Murphy: Machine learning 13 ষ Sparse linear models ɹิॿࢿྉ September 26, 2014 6 / 14

Slide 7

Slide 7 text

ΞϧΰϦζϜ γ ͸ bit vector ͳͷͰ, શ਺૸ࠪ͸௒େม ˠͪΐ ͬͱ heuristic ʹ Wrapper method: Ϟσϧ଒ͷͳ͔Ͱίετ f(γ) (Τϥʔ཰ͳͲ) Λܭࢉ͠ͳ͕Β argmaxp(D|w) ΍ p(D|w)p(w)dw Λܭࢉ͢Δख๏. ௚ײతʹ͸, ֶशΞϧΰϦζϜ Λ࣮૷ͨؔ͠਺ fun Λ wrap ͯ͠, σʔλΛ subset ʹ෼ׂͯ͠είΞΛܭࢉ͠ͳ͕Β ద༻͍ͯ͘͠ख๏ ޮ཰ԽͷϙΠϯτ ͍͔ʹͯ͠લͷ γ ͷͱ͖ͷείΞΛߋ৽ͯ͠ γ′ ͷ৔߹ͷείΞΛܭࢉ͢Δ͔ ⇔ ίετ f(γ) ͷे෼౷ܭྔΛޮ཰తʹߋ৽ ⇔ f(γ) ͷܭࢉΛ Xγ ͚ͩʹґଘ্ͤͨ͞Ͱ,γ Λগ͠ߋ৽ͯ͠ γ′ ʹ͢Δ (Ұͭͷม਺Λग़ ͠ೖΕ͢Δ) ͜ͷͱ͖,QR ෼ղͰ XT γ Xγ Λ XT γ′ Xγ′ ʹߋ৽Ͱ͖Δ Daisuke Yoneoka Murphy: Machine learning 13 ষ Sparse linear models ɹิॿࢿྉ September 26, 2014 7 / 14

Slide 8

Slide 8 text

Greedy search 1 l0 regularization ͷ໨తؔ਺ͷ࠷దԽΛΛߟ͑Δ ೋ৐๏ͷੑ࣭Λར༻Մೳ. (See detail; Miller 2002; Soussen et al. 2010) Single best replacement (SBR): Greedy hill climbing ʹ͓͍ͯ γ Λগ͠ৼΒͤΔ͚ͩͰ౸ ୡՄೳͳۙ๣ϞσϧΛ୳ࡧ͢Δ͜ͱ. Sparse ͳղΛݟ͚ͭΔ͜ͱΛ໨తʹ͍ͯ͠ΔͷͰ, ॳ ظ஋͸ γ = 0. ͋ͱ͸είΞͷྑ͍ղ͕ݟ͔ͭΔ·Ͱग़͠ೖΕ. Orthogonal least squares: ΋͠,λ = 0 (i.e., prior p(γ) ΛೖΕΔ͜ͱͷേଇ͕ͳ͍ঢ়ଶ) ͱ ͢Δͱ,forward ʹม਺௥Ճ͚ͩͰ OK. ͜ͷͱ͖ Orthogonal least square, ·ͨ͸,greedy forward selection ͱݺͿ. Τϥʔ͸ ∥γ∥0 ͷ୯ௐݮগؔ਺ͱͳΔ. ߋ৽ࣜ͸ γ(t+1) = γ(t) ∪ {j∗}. ͨͩ͠,j∗ = argminj/ ∈γt minw∥y − (Xγj ∪jw)∥2 Orthogonal matching pursuits (OMP): ্ͷํ๏͸ߴՁ. ؆ུԽͨ͠ͷ͕͜Ε. j∗ = argminj/ ∈γt minβ∥y − (Xwt − βx:,j)∥2 Λղ͘͜ͱͰ࣍ͷީิൃݟ (wt ͸ݻఆ͞Ε ͍ͯΔ). ͜Εͷղ͸ॠࡴͰ β = xT :,j (y − Xwt) xT :,j x:,j . ͜Ε͸,wt Λݻఆͨ͠ͱ͖ͷ࢒ࠩ y − Xwt ͱ࠷΋૬ؔ͢ΔίϥϜ x:,j Λબ୒͢Δ͜ͱʹ૬౰. ͜ΕͰ৽͍͠ಛ௃ྔͷ૊Λͭ ͬͯ͘ wt+1 Λܭࢉ. Daisuke Yoneoka Murphy: Machine learning 13 ষ Sparse linear models ɹิॿࢿྉ September 26, 2014 8 / 14

Slide 9

Slide 9 text

Greedy search 2 ͖ͭͮ Matching pursuits: sparse boosting (least squares boosting) ͱҰॹ. 16 ষͰ΍Γ· ͠ΐ͏. Backwards selection: saturated model (๞࿨Ϟσϧ) ΑΓ࢝Ίͯ, ঃʑʹݮΒ͢ํ๏. ͜Ε ͸Ұൠʹ forward selection ΑΓ΋ྑ͍݁ՌΛ΋ͨΒ͢. ͳͥͳΒ, औࣺબ୒ͷܾఆ͕ͦͷଞ ͷશม਺͕ґଘ͍ͯ͠Δͱ͍͏ԾఆͰߦΘΕΔ͔Β. FoBa: forward-backward algorithm ͷҙ. SBR ͱࣅ͍ͯΔ͕࣍ͷީิΛબͿࡍʹ OMP ͷ Α͏ʹબͿ఺͕ಛ௃ Bayesian matching pursuit: OMP ͱࣅ͍ͯΔ͕, ೋ৐ޡࠩΛ໨తؔ਺ʹ͢ΔͷͰ͸ͳ ͘,bayesian marginal likelihood scoring criterion Λ࢖͏఺͕ಛ௃. ϏʔϜαʔν (෼ذ͕ ϏʔϜ෯ (ࣄલઃఆ) ΑΓ௕͘ͳͬͨ৔߹ʹ, ѱ͍ࢬΛמΔ) Λ࢖͏. Daisuke Yoneoka Murphy: Machine learning 13 ষ Sparse linear models ɹิॿࢿྉ September 26, 2014 9 / 14

Slide 10

Slide 10 text

Stochastic search ۙ๣ʹҠಈ͢Δͱ͖ʹ best ͳ΋ͷʹҠಈ (Greedy search) Ͱ͸ͳ͘, ֬཰తʹҠಈ ઌΛબ୒͢Δख๏ Posterior ࣗମΛܭࢉ͍ͨ͠৔߹,MCMC Ͱ͠ΐ. ఏҊ෼෍͸ γ Λগ͚ͩ͠มԽͤͨ͞΋ͷͳͷͰ,p(γ′|D) Λ p(γ|D) ͔Β࡞Δ͜ͱ͸ൺ ֱత༰қ. (See detail for O ʟ Hara and Sillanpaa 2009) ཭ࢄͳঢ়ଶۭؒͰ͸,MCMC ͸ඞͣ͠΋ඇޮ཰Ͱ͸ͳ͍. ͳͥͳ Β,p(γ′) = exp(−f(γ)) Ͱ௚઀֬཰͕ܭࢉՄೳ͔ͩΒ (ಉ͡ঢ়ଶʹ໭Δඞཁ͕ͳ͘ͳΔ ͷͰޮ཰ up). ߋʹޮ཰Λ͋͛ΔͨΊʹ͸, ߴείΞͷϞσϧ଒ S Λ࡞Γ,p(γ|D) ≈ e−f(γ) γ′∈S e−f(γ′) Ͱ posterior Λۙࣅ͢Δ. (Heaton and Scott 2009) Daisuke Yoneoka Murphy: Machine learning 13 ষ Sparse linear models ɹิॿࢿྉ September 26, 2014 10 / 14

Slide 11

Slide 11 text

EM and variational inference EM ΞϧΰϦζϜͰ Slab model (γj → wj → y) Λਪఆͯ͠ΈΔ E step: p(γj|wj)? M step: w ʹ͍ͭͯ࠷దԽ? ͜ΕͰ͸ಈ͔ͳ͍!ͳͥͳΒ,(13.11) ͷதͷ δ0(wj) ͱ N(wj|0, σ2 w ) ͕ൺֱෆՄೳ → δ0(wj) ΛΨ΢γΞϯͰۙࣅͰղܾ. (local minima ͷ໰୊͕࢒Δ) EM ΞϧΰϦζϜͰ Bernoulli-Gaussian model (γj → y ← wj ) Λਪఆͯ͠ΈΔ Posterior p(γ|D, w) ͸ܭࢉ͠ʹ͍͘ ͔͠͠, ͜ͷฏۉۙࣅ j q(γj)q(wj) Λܭࢉ͢Δ͜ͱ͸Մೳ (Huang et al. 2007; Rattray et al. 2009) Daisuke Yoneoka Murphy: Machine learning 13 ষ Sparse linear models ɹิॿࢿྉ September 26, 2014 11 / 14

Slide 12

Slide 12 text

l1 regularization: basics l0 (i.e.∥w∥0) ͸ತؔ਺Ͱͳ͍, ࿈ଓͰ΋ͳ͍! → ತؔ਺ۙࣅ! p(γ|D) ΛٻΊΔ͜ͱͷ೉͠͞ͷ͍͘Β͔͸ γ ∈ {0, 1} ͱ཭ࢄͰ͋Δ͜ͱ Prior p(w) Λ࿈ଓͳ෼෍ (ϥϓϥε෼෍) Ͱۙࣅ͢Δ. p(w|λ) = D j=1 Lap(wj |0, 1/λ) ∝ D j=1 e−λ∥wj ∥ േଇ෇͖໬౓͸ f(w) = log p(D|w) − log p(w|λ) = NLL(w) + λ∥w∥1 . ͜Ε͸ argminw NLL(w) + λ∥w∥0 ͱ͍͏ non-convex ͳ l0 ͷ໨తؔ਺ͷತؔ਺ۙࣅ ͱߟ͑ΒΕΔ Linear regression ͷ৔߹ (Known as BPDN (basis pursuit denoising)) f(w) = N i=1 − 1 2σ2 (yi − (wT xi))2 + λ∥w∥1 = RSS(w) + λ′∥w∥1 ͨͩ͠,λ′ = 2λσ2 Prior ʹ̌ฏۉϥϓϥε෼෍Λ͓͍ͯ,MAP ਪఆ͢Δ͜ͱΛ l1 ਖ਼ଇԽͱݺͿ Daisuke Yoneoka Murphy: Machine learning 13 ষ Sparse linear models ɹิॿࢿྉ September 26, 2014 12 / 14

Slide 13

Slide 13 text

Why does l1 regularization yield sparse solutions? Linear regression ʹݶఆ͢Δ͕ GLM Ұൠʹ֦ுՄೳ ໨తؔ਺͸ minw RSS(w) + λ∥w∥1 ⇔ LASSO: minw RSS(w)s.t. λ∥w∥1 ≤ B B খˠ λ େ ͪͳΈʹ minw RSS(w) + λ∥w∥2 2 ⇔ RIDGE: minw RSS(w)s.t. λ∥w∥2 2 ≤ B Figure: 13.3; l1 (left) vs l2 (right) regularization Daisuke Yoneoka Murphy: Machine learning 13 ষ Sparse linear models ɹิॿࢿྉ September 26, 2014 13 / 14

Slide 14

Slide 14 text

Optimality conditions for lasso Lasso ͸ non-smooth optimization (ඍ෼ෆՄೳ࠷దԽ) ͷྫ. ໨తؔ਺͸ minwRSS(w) + λ∥w∥1 ୈҰ߲ͷඍ෼͸ ∂ ∂wj RSS(w) = aj wj − cj . ͨͩ͠ aj = 2 n i=1 x2 ij , cj = 2 n i=1 xij(yi − wT −j xi,−j) j ͱ j ͳ͠ͷ࢒ࠩͷ಺ੵ cj ͸ j ൪໨ͷಛ௃ྔ͕ y ͷ༧ଌʹͲΕ͚ͩؔ࿈͍ͯ͠Δ͔Λදݱ શମͷඍ෼͸ ∂wj f(w) = (aj wj − cj ) + λ∂wj ∥w∥1 = ⎧ ⎪ ⎨ ⎪ ⎩ {ajwj − cj − λ} if wj < 0 [−cj − λ, −cj + λ] if wj = 0 {ajwj − cj + λ} if wj > 0 Daisuke Yoneoka Murphy: Machine learning 13 ষ Sparse linear models ɹิॿࢿྉ September 26, 2014 14 / 14