kilometer
June 09, 2018
1.4k

# BeginnerSession1_70th_TokyoR

ベイズ統計に関するチュートリアル資料です。

June 09, 2018

## Transcript

1. ### 70th Tokyo.R @kilometer BeginneR Session 1 -- Bayesian Modeling --

2018.06.09 at Microsoft Co.

3. ### Who！？ 名前： 三村 @kilometer 職業： ポスドク (こうがくはくし) 専⾨： ⾏動神経科学(霊⻑類) 脳イメージング

医療システム⼯学 R歴： ~ 10年ぐらい 流⾏: ガジュマル

8. ### BeginneR Advanced Hoxo_m If I have seen further it is

by standing on the sholders of Giants. -- Sir Isaac Newton, 1676

14. ### “Strong” Hypothesis “Weaken” Hypothesis Data Data What is modeling? Hypothesis

Driven Data Driven
15. ### What is modeling? f X ℎℎ . f X ℎℎ

. ℎ ℎ Hypothesis Driven Data Driven
16. ### What is modeling? A/B test Hypothesis Driven やったこと ないけどね！ or

A B HA : A is better HB : B is better H0 : We have to choice better 1 of 2 Strong hypothesis A B ＊ Simple data
17. ### What is modeling? Meta Analysis H0 : There are best/better

way Weaken hypothesis Complex data みんなこれの事を なんて呼ぶの？ Data Driven
18. ### What is modeling? Data Driven Analysis Hypothesis Driven Analysis How

to do? What to do? Decision Making Weaken Hypothesis Strong Hypothesis Simple Data Complex Data
19. ### What is modeling? Data Driven Hypothesis Driven How to do?

What to do? Decision Making Weaken Hypothesis Strong Hypothesis Simple Data Complex Data Simple Model Complex Model
20. ### What is modeling? Data Driven Hypothesis Driven How to do?

What to do? Decision Making Weaken Hypothesis Strong Hypothesis Simple Data Complex Data Simple Model Complex Model Narrow sense Broad sense

22. ### or A B HA : A is better HB :

B is better H0 : We have to choice better 1 of 2 A B ＊ A is better
23. ### There is only one difference between a madman and me.

The madman thinks he is sane. I know I am mad. Dalí is a dilly. 1956 , The American Magazine, 162(1), 28–9, 107–9. -- Salvador Dalí
24. ### or A B HA : A is better HB :

B is better H0 : We have to choice better 1 of 2. A B There is a difference between A and B A>B A is better d H1 :

26. ### Dice with α faces (regular polyhedron) ℎ … Truth Knowledge

? Hypothesis Observation = 5
27. ### ( = 5| = 4) = 0 Dice with faces

= 5 ( = 5| = 6) = 1 6 ( = 5| = 8) = 1 8 ( = 5| = 12) = 1 12 ( = 5| = 20) = 1 20 likelihood maximum likelihood
28. ### likelihood maximum likelihood you = {5, 4, 3, 4, 2,

1, 2, 3, 1, 4} ( = | = 4) = 0 ( = | = 8) = 1 810 ( = | = 12) = 1 1210 ( = | = 20) = 1 2010 ( = | = 6) = 1 610 Dice with faces
29. ### you Could you find α? Yes. α is estimated at

6!! Why do you think so? Hmmmm…, well.., how many ( = 6)? Oh, it is d edf !! ….nnNNNO!!! WHAT!!???? friend Because, arg maxi {(|)} = 6!!
30. ### Dice with faces ( = | = 6) = 1

610 maximum likelihood you(before) you(after) ( = 6|)!!?? Hmmmm… Well.., how many ( = 6)? friend = {5, 4, 3, 4, 2, 1, 2, 3, 1, 4}
31. ### = 1 , … , ∞ , ∀ ∈ ℕ

= 1 , … , p realization x <- sample(, 1) ∶= ∀ = || sample space (can NEVER get) stochastic variable probability distribution <- c(1, 1, 1, 1, 1, 2, 2, 3, 4, 5, 5) = hist(, freq = FALSE, label = TRUE) = 2 ~ ⇔ t → : number of trial
32. ### ∶ → = 1 , … , ∞ , ∀

∈ ℕ = 1 , … , p realization sample space (can NEVER get) = = ∀ = || probability distribution g <- function( = 6) { map(1:∞, ~sample(1: , n=10, replace = TRUE)) } = <- g() X <- density() ~ x → t → ⇔ ~(|) statistical modeling outcome function of face dice
33. ### probability distribution sample space | = ~ (|) ∶ →

parameter = 1 , … , p ∈ | realization X <- map(1:∞, ~g()) x <- sample(X, 1) = 1 , … , ∞ , ∀ ∈ ℕ statistical modeling

6 = 12 = 20
35. ### ~z (|) : → ~ (|) : → ∈ ∈

| ← = {1 , … , ∞ } x | ← , ∈ 4, 6, 8, 12, 20 t ← = 1 , … , ∞ x t ← , ∀ ∈ ℕ, ∀ ≤ , (|) (|) statistical modeling statistical modeling
36. ### ∀ ≤ | ← = {1 , … , ∞

} x | ← , ∈ {4, 6, 8, 12, 20} t ← = 1 , … , ∞ x t ← , ∀ ∈ ℕ, ∀ ≤ , ~(|) ~ (|)

38. ### ∗ ∗ () = = ) ∗ () () ℎ

() ≠ 0, Bayes’ theorem ∩ = ( ∩ )
39. ### = ) ∗ () () ℎ () ≠ 0, ~

(|) = ) ∗ () ~ (|) : → : →
40. ### likelihood = ) ∗ () () ℎ () ≠ 0,

= ) ∗ () ~ (|) ~ (|) : → : →
41. ### = = likelihood = ) ∗ () ~ (|) ~

(|) : → : → | ← = 1 , … , ∞ t ← = 1 , … , ∞ , ∈ 4, 6, 8, 12, 20
42. ### likelihood = …{ ∗ (|) ∀i } marginalization ∈ 4,

6, 8, 12, 20 likelihood = ) ∗ () ~ (|) ~ (|) : → : → = =
43. ### likelihood = ) ∗ | ∑ { ∗ (|) ∀i

} ~ (|) ~ (|) : → : → = ) ∗ ()
44. ### likelihood maximum likelihood you = {5, 4, 3, 4, 2,

1, 2, 3, 1, 4} ( = | = 4) = 0 ( = | = 8) = 1 810 ( = | = 12) = 1 1210 ( = | = 20) = 1 2010 ( = | = 6) = 1 610 Dice with faces
45. ### likelihood = ) ∗ | ∑ { ∗ (|) ∀i

} ~ (|) ~ (|) : → : → (|) = 1 , … , ∞ , ∀ ∈ ℕ sample space (can NEVER get)
46. ### likelihood = ) ∗ | ∑ { ∗ (|) ∀i

} (|) = 1 , … , ∞ , ∀ ∈ ℕ sample space CAN NEVER GET . ~ (|) ~ (|) : → : →
47. ### ∀ | ≅ ∀ |‰ = 1 5 ∈ 4,

6, 8, 12, 20 (|) likelihood = ) ∗ | ∑ { ∗ (|) ∀i } ~ (|) ~ (|) : → : →
48. ### = ) ∑ { ∀i } ≈ ) 1.7485 −

08 = (|) 4 + 6 + 8 + 12 + 20 likelihood ≅ ) ∗ |′ ∑ { ∗ (|′) ∀i } , ℎ ∀ |′ = 1/5 ~ (|) ~ (|) : → : →
49. ### ( = | = 6) = 1 610 maximum likelihood

you ≅ (| = 6) 1.7485 − 08 Hmmmm… Well.., how many ( = 6)? friend ≈ 94.85% = 6 = {5, 4, 3, 4, 2, 1, 2, 3, 1, 4} Dice with faces
50. ### 6 ≈ 94.58% 4 = 0% 8 ≈ 5.32% 12

≈ 0.09% 20 ≈ 0.0005% 4 X‰ = 1 5 6 X‰ = 1 5 8 X‰ = 1 5 12 X‰ = 1 5 20 X‰ = 1 5 prior probability posterior probability MAP(Maximum a posteriori) estimation arg i {(|)}= 6
51. ### = {5, 4, 3, 4, 2, 1, 2, 3, 1,

4} Dice with faces 11 ≤ 6|6 ∗ 6 ≈ 94.58% 11 ≤ 6|4 ∗ 4 = 0% 11 ≤ 6|8 ∗ 8 ≈ 3.99% 11 ≤ 6|12 ∗ 12 ≈ 0.046% 11 ≤ 6|20 ∗ 20 ≈ 0.0001% 11 ≤ 6 ≈ 98.62% predictive probability
52. ### you OK, let’s try 11!! friend (11 ≤ 6|) ≈

98.62% And, = 6 ≈ 94.58% = {5, 4, 3, 4, 2, 1, 2, 3, 1, 4} Dice with faces
53. ### you OK, let’s try 11!! friend (11 ≤ 6|) ≅

98.88% And, = 6 ≅ 94.85% 11 = 8 = {5, 4, 3, 4, 2, 1, 2, 3, 1, 4} Dice with faces
54. ### you OK, let’s try 11!! friend (11 ≤ 6|) ≈

98.62% And, = 6 ≈ 94.58% 11 = 8 = 6 {, 11 } = 0% = {5, 4, 3, 4, 2, 1, 2, 3, 1, 4} Dice with faces
55. ### = {5, 4, 3, 4, 2, 1, 2, 3, 1,

4} Dice with faces ́ = {, 8} likelihood ≅ ) ∗ |′ () posterior prior Dice with faces likelihood ́ ≅ ́ ) ∗ |′′ (́) prior
56. ### = {5, 4, 3, 4, 2, 1, 2, 3, 1,

4} Dice with faces ́ = {, 8} likelihood ≅ ) ∗ |′ () posterior prior Dice with faces likelihood ́ ≅ ́ ) ∗ | (́) prior posterior
57. ### X‰ = 20%, 20%, 20%, 20%, 20% prior posterior =

{4, 6, 8, 12, 20} ≈ 0%, 94.58%, 5.32%, 0.09%, 0.0005% posterior ́ ≈ 0%, 0%, 99.98%, 0.020%, 0.0000004% prior = {5, 4, 3, 4, 2, 1, 2, 3, 1, 4} ́ = {, 8}
58. ### you OK!!! Let’s 12 !! COME OOON friend (12 ≤

8|́) ≈ 99.98% And, = 8 ́ ≈ 99.98% Dice with faces ́ = {5, 4, 3, 4, 2, 1, 2, 3, 1, 4, 8}

60. ### likelihood posterior ≅ ) ∗ () likelihood | ”(t|i)∗”(i|z) |

”(||z) ”(t|i) prior distribution posterior distribution predictive distribution data prior
61. ### likelihood posterior ≅ ) ∗ () predictive distribution () (|)

Truth Information Criterion in Bayesian modeling prior likelihood | ”(t|i)∗”(i|z) | ”(||z) ”(t|i) prior distribution posterior distribution data
62. ### likelihood prior posterior ≅ ) ∗ () predictive distribution ()

(|) —˜ (| = − () Kullback–Leibler divergence Information Criterion in Bayesian modeling Truth = − log − − log = log likelihood | ”(t|i)∗”(i|z) | ”(||z) ”(t|i) prior distribution posterior distribution data expectation self-information
63. ### = › ∗ log () (|) = › ∗ log

() − › ∗ log (|) = −•(t) − › ∗ log (|) Generalization error ≔ min ” —˜ (| ⇔ min ” Entropy WAIC Information Criterion in Bayesian modeling Kullback–Leibler divergence —˜ (| = log = −[()] − › ∗ log (|)
64. ### likelihood prior posterior ≅ ) ∗ () predictive distribution ()

(|) Truth Information Criterion in Bayesian modeling —˜ (| = −• + Generalization error ≈ likelihood | ”(t|i)∗”(i|z) | ”(||z) ”(t|i) prior distribution posterior distribution data
65. ### likelihood prior posterior ≅ ) ∗ () likelihood posterior distribution

predictive distribution () (|) Truth Information Criterion in Bayesian modeling evidence = − log ≔ —˜ (| = −• + Generalization error ≈ likelihood | ”(t|i)∗”(i|z) | ”(||z) ”(t|i) prior distribution posterior distribution data self-information
66. ### ≔ = − log = log () () − log

() = log () () ∗ 1 () ”(z) = = log () () − log () = —˜ (| − … p ∗ log (p ) p —˜ (| = − •(z) evidence Information Criterion in Bayesian modeling
67. ### likelihood prior posterior ≅ ) ∗ () likelihood | ”(t|i)∗”(i|z)

| ”(||z) ”(t|i) prior distribution posterior distribution predictive distribution data () (|) Truth Information Criterion in Bayesian modeling evidence —˜ ( | = −•(t) + —˜ ( | = − •(z) Free energy Generalization error ≈ ≈ = − log ≔ self-information

69. ### Dice with α faces (regular polyhedron) ℎ … Truth Knowledge

? Hypothesis Observation = {5, 4, 3, 4, 2, 1, 2, 3, 1, 4}
70. ### ∶ → = 1 , … , ∞ , ∀

∈ ℕ = 1 , … , realization sample space (can NEVER get) = = ∀ = || probability distribution = <- g() X <- density() ~ x → t → ⇔ ~(|) statistical modeling outcome function of face dice g <- function( = 6) { map(1:∞, ~sample(1: , n=10, replace = TRUE)) }
71. ### ~ (|) : → ∈ 4, 6, 8, 12, 20

(|) = 6 = 12 = 20 t ← = 1 , … , ∞ x t ← , ∀ ∈ ℕ, ∀ ≤ ,
72. ### ~ (|) : → (|) log ( ) = 6

= 12 = 20 (|) = 8 likelihood = 6 ∈ 4, 6, 8, 12, 20 t ← = 1 , … , ∞ x t ← , ∀ ∈ ℕ, ∀ ≤ ,
73. ### ~ (|) : → log ( ) = 6 =

12 = 20 (|) = 8 likelihood ~ (|) ~ (|) : → (|) = =
74. ### ~ (|) : → log ( ) = 6 =

12 = 20 (|) = 8 likelihood ~ (|) ~ (|) : → (|) = ) ∗ | (|) ≅ ) ∗ |′ ∑ { ∗ (|′) ∀i } = ) ∗ () () Bayes' theorem likelihood prior posterior
75. ### log ( ) = 6 = 12 = 20 (|)

= 8 likelihood = 6 ≅ 94.58% = 6 ℎ ∀ |′ = 1/5 ~ (|) : → ~ (|) ~ (|) : → (|) = ) ∗ | (|) ≅ ) ∗ |′ ∑ { ∗ (|′) ∀i } likelihood prior posterior
76. ### ~ (|) : → ~ (|) ~ (|) : →

(|) = ) ∗ | (|) ≅ ) ∗ |′ ∑ { ∗ (|′) ∀i } likelihood prior posterior log ( ) = 6 = 12 = 20 (|) = 8 likelihood = 6 ≅ 94.58% = 6 ℎ ∀ |′ = 1/5
77. ### likelihood prior posterior ≅ ) ∗ () likelihood | ”(t|i)∗”(i|z)

| ”(||z) ”(t|i) prior distribution posterior distribution predictive distribution data () (|) Truth Information Criterion in Bayesian modeling ebidence —˜ ( | = −•(t) + —˜ ( | = − •(z) Free energy Generalization error ≈ ≈ = − log ≔ self-information

79. ### or A B HA : A is better HB :

B is better H0 : We have to choice better 1 of 2. A B There is a difference between A and B A>B A is better θ H1 :
80. ### or x y HA : A is better HB :

B is better H0 : We have to choice better 1 of 2. x y There is a difference between x and y A>B A is better θ H1 : = t − § § ← | ”(t|i) ©ª t ← | ”(t|i) ©¬ ←
81. ### or x y HA : A is better HB :

B is better H0 : We have to choice better 1 of 2. x y There is a difference between x and y A>B A is better θ H1 : = t − § t - ← | ← | ”(|│z) ”(t│i) § - ← | ← | ”(°│±) ”(§│²)
82. ### or x y HA : A is better HB :

B is better H0 : We have to choice better 1 of 2. x y There is a difference between x and y A>B A is better θ H1 : ³ ← [t , § ] t - ← | ← | ”(|│z) ”(t│i) § - ← | ← | ”(°│±) ”(§│²)
83. ### or x y HA : A is better HB :

B is better H0 : We have to choice better 1 of 2. x y There is a difference between x and y A>B A is better θ H1 : ³ ← [t , § ] t - ← | ← | ” ” § - ← | ← | ” ”

87. ### What is modeling? f X ℎℎ . f X ℎℎ

. ℎ ℎ Hypothesis Driven Data Driven
88. ### ∶ → = 1 , … , ∞ , ∀

∈ ℕ = 1 , … , p realization sample space (can NEVER get) = = ∀ = || probability distribution = <- g() X <- density() ~ x → t → ⇔ ~(|) statistical modeling outcome function with parameter g <- function( = 6) { map(1:∞, ~sample(1: , n=10, replace = TRUE)) }
89. ### | ← = {1 , … , ∞ } x

| ← t ← = 1 , … , ∞ x t ← ~(|) ~ (|) (, ) Bayesian Modeling
90. ### v.s. me “MUST be wholy REJECTED!!!” “p-value **cking!!!” Frequentist Bayesian

Old Stereotype
91. ### f X ℎℎ . f X ℎℎ . ℎ ℎ

Hypothesis Driven Data Driven ∶ → ∶ → →