BeginnerSession1_70th_TokyoR

Slide 1

Slide 1 text

70th Tokyo.R @kilometer BeginneR Session 1 -- Bayesian Modeling -- 2018.06.09 at Microsoft Co.

Slide 2

Slide 2 text

Who！？

Slide 3

Slide 3 text

Who！？名前：三村 @kilometer 職業：ポスドク (こうがくはくし) 専⾨：⾏動神経科学(霊⻑類) 脳イメージング医療システム⼯学 R歴： ~ 10年ぐらい流⾏: ガジュマル

Slide 4

Slide 4 text

BeginneR Session

Slide 5

Slide 5 text

BeginneR

Slide 6

Slide 6 text

BeginneR

Slide 7

Slide 7 text

Before After BeginneR Session BeginneR BeginneR

Slide 8

Slide 8 text

BeginneR Advanced Hoxo_m If I have seen further it is by standing on the sholders of Giants. -- Sir Isaac Newton, 1676

Slide 9

Slide 9 text

BeginneR Session 1 -- Bayesian Modeling --

Slide 10

Slide 10 text

What is modeling? Welcome to Bayesian statistics Agenda

Slide 11

Slide 11 text

What is modeling?

Slide 12

Slide 12 text

What is modeling? ℎ f X ℎℎ Truth Knowledge

Slide 13

Slide 13 text

What is modeling? ℎ f X ℎℎ Truth Knowledge Narrow sense Broad sense

Slide 14

Slide 14 text

“Strong” Hypothesis “Weaken” Hypothesis Data Data What is modeling? Hypothesis Driven Data Driven

Slide 15

Slide 15 text

What is modeling? f X ℎℎ . f X ℎℎ . ℎ ℎ Hypothesis Driven Data Driven

Slide 16

Slide 16 text

What is modeling? A/B test Hypothesis Driven やったことないけどね！ or A B HA : A is better HB : B is better H0 : We have to choice better 1 of 2 Strong hypothesis A B ＊ Simple data

Slide 17

Slide 17 text

What is modeling? Meta Analysis H0 : There are best/better way Weaken hypothesis Complex data みんなこれの事をなんて呼ぶの？ Data Driven

Slide 18

Slide 18 text

What is modeling? Data Driven Analysis Hypothesis Driven Analysis How to do? What to do? Decision Making Weaken Hypothesis Strong Hypothesis Simple Data Complex Data

Slide 19

Slide 19 text

What is modeling? Data Driven Hypothesis Driven How to do? What to do? Decision Making Weaken Hypothesis Strong Hypothesis Simple Data Complex Data Simple Model Complex Model

Slide 20

Slide 20 text

What is modeling? Data Driven Hypothesis Driven How to do? What to do? Decision Making Weaken Hypothesis Strong Hypothesis Simple Data Complex Data Simple Model Complex Model Narrow sense Broad sense

Slide 21

Slide 21 text

What is modeling? ℎ f X ℎℎ Truth Knowledge Narrow sense Broad sense

Slide 22

Slide 22 text

or A B HA : A is better HB : B is better H0 : We have to choice better 1 of 2 A B ＊ A is better

Slide 23

Slide 23 text

There is only one difference between a madman and me. The madman thinks he is sane. I know I am mad. Dalí is a dilly. 1956 , The American Magazine, 162(1), 28–9, 107–9. -- Salvador Dalí

Slide 24

Slide 24 text

or A B HA : A is better HB : B is better H0 : We have to choice better 1 of 2. A B There is a difference between A and B A>B A is better d H1 :

Slide 25

Slide 25 text

Welcome to Bayesian statistics

Slide 26

Slide 26 text

Dice with α faces (regular polyhedron) ℎ … Truth Knowledge ? Hypothesis Observation = 5

Slide 27

Slide 27 text

( = 5| = 4) = 0 Dice with faces = 5 ( = 5| = 6) = 1 6 ( = 5| = 8) = 1 8 ( = 5| = 12) = 1 12 ( = 5| = 20) = 1 20 likelihood maximum likelihood

Slide 28

Slide 28 text

likelihood maximum likelihood you = {5, 4, 3, 4, 2, 1, 2, 3, 1, 4} ( = | = 4) = 0 ( = | = 8) = 1 810 ( = | = 12) = 1 1210 ( = | = 20) = 1 2010 ( = | = 6) = 1 610 Dice with faces

Slide 29

Slide 29 text

you Could you find α? Yes. α is estimated at 6!! Why do you think so? Hmmmm…, well.., how many ( = 6)? Oh, it is d edf !! ….nnNNNO!!! WHAT!!???? friend Because, arg maxi {(|)} = 6!!

Slide 30

Slide 30 text

Dice with faces ( = | = 6) = 1 610 maximum likelihood you(before) you(after) ( = 6|)!!?? Hmmmm… Well.., how many ( = 6)? friend = {5, 4, 3, 4, 2, 1, 2, 3, 1, 4}

Slide 31

Slide 31 text

= 1 , … , ∞ , ∀ ∈ ℕ = 1 , … , p realization x <- sample(, 1) ∶= ∀ = || sample space (can NEVER get) stochastic variable probability distribution <- c(1, 1, 1, 1, 1, 2, 2, 3, 4, 5, 5) = hist(, freq = FALSE, label = TRUE) = 2 ~ ⇔ t → : number of trial

Slide 32

Slide 32 text

∶ → = 1 , … , ∞ , ∀ ∈ ℕ = 1 , … , p realization sample space (can NEVER get) = = ∀ = || probability distribution g <- function( = 6) { map(1:∞, ~sample(1: , n=10, replace = TRUE)) } = <- g() X <- density() ~ x → t → ⇔ ~(|) statistical modeling outcome function of face dice

Slide 33

Slide 33 text

probability distribution sample space | = ~ (|) ∶ → parameter = 1 , … , p ∈ | realization X <- map(1:∞, ~g()) x <- sample(X, 1) = 1 , … , ∞ , ∀ ∈ ℕ statistical modeling

Slide 34

Slide 34 text

( = | = 6) = 6 = !!?? = 6 = 12 = 20

Slide 35

Slide 35 text

~z (|) : → ~ (|) : → ∈ ∈ | ← = {1 , … , ∞ } x | ← , ∈ 4, 6, 8, 12, 20 t ← = 1 , … , ∞ x t ← , ∀ ∈ ℕ, ∀ ≤ , (|) (|) statistical modeling statistical modeling

Slide 36

Slide 36 text

∀ ≤ | ← = {1 , … , ∞ } x | ← , ∈ {4, 6, 8, 12, 20} t ← = 1 , … , ∞ x t ← , ∀ ∈ ℕ, ∀ ≤ , ~(|) ~ (|)

Slide 37

Slide 37 text

Conditional probability () () ∩ = ( ∩ )

Slide 38

Slide 38 text

∗ ∗ () = = ) ∗ () () ℎ () ≠ 0, Bayes’ theorem ∩ = ( ∩ )

Slide 39

Slide 39 text

= ) ∗ () () ℎ () ≠ 0, ~ (|) = ) ∗ () ~ (|) : → : →

Slide 40

Slide 40 text

likelihood = ) ∗ () () ℎ () ≠ 0, = ) ∗ () ~ (|) ~ (|) : → : →

Slide 41

Slide 41 text

= = likelihood = ) ∗ () ~ (|) ~ (|) : → : → | ← = 1 , … , ∞ t ← = 1 , … , ∞ , ∈ 4, 6, 8, 12, 20

Slide 42

Slide 42 text

likelihood = …{ ∗ (|) ∀i } marginalization ∈ 4, 6, 8, 12, 20 likelihood = ) ∗ () ~ (|) ~ (|) : → : → = =

Slide 43

Slide 43 text

likelihood = ) ∗ | ∑ { ∗ (|) ∀i } ~ (|) ~ (|) : → : → = ) ∗ ()

Slide 44

Slide 44 text

likelihood maximum likelihood you = {5, 4, 3, 4, 2, 1, 2, 3, 1, 4} ( = | = 4) = 0 ( = | = 8) = 1 810 ( = | = 12) = 1 1210 ( = | = 20) = 1 2010 ( = | = 6) = 1 610 Dice with faces

Slide 45

Slide 45 text

likelihood = ) ∗ | ∑ { ∗ (|) ∀i } ~ (|) ~ (|) : → : → (|) = 1 , … , ∞ , ∀ ∈ ℕ sample space (can NEVER get)

Slide 46

Slide 46 text

likelihood = ) ∗ | ∑ { ∗ (|) ∀i } (|) = 1 , … , ∞ , ∀ ∈ ℕ sample space CAN NEVER GET . ~ (|) ~ (|) : → : →

Slide 47

Slide 47 text

∀ | ≅ ∀ |‰ = 1 5 ∈ 4, 6, 8, 12, 20 (|) likelihood = ) ∗ | ∑ { ∗ (|) ∀i } ~ (|) ~ (|) : → : →

Slide 48

Slide 48 text

= ) ∑ { ∀i } ≈ ) 1.7485 − 08 = (|) 4 + 6 + 8 + 12 + 20 likelihood ≅ ) ∗ |′ ∑ { ∗ (|′) ∀i } , ℎ ∀ |′ = 1/5 ~ (|) ~ (|) : → : →

Slide 49

Slide 49 text

( = | = 6) = 1 610 maximum likelihood you ≅ (| = 6) 1.7485 − 08 Hmmmm… Well.., how many ( = 6)? friend ≈ 94.85% = 6 = {5, 4, 3, 4, 2, 1, 2, 3, 1, 4} Dice with faces

Slide 50

Slide 50 text

6 ≈ 94.58% 4 = 0% 8 ≈ 5.32% 12 ≈ 0.09% 20 ≈ 0.0005% 4 X‰ = 1 5 6 X‰ = 1 5 8 X‰ = 1 5 12 X‰ = 1 5 20 X‰ = 1 5 prior probability posterior probability MAP(Maximum a posteriori) estimation arg i {(|)}= 6

Slide 51

Slide 51 text

= {5, 4, 3, 4, 2, 1, 2, 3, 1, 4} Dice with faces 11 ≤ 6|6 ∗ 6 ≈ 94.58% 11 ≤ 6|4 ∗ 4 = 0% 11 ≤ 6|8 ∗ 8 ≈ 3.99% 11 ≤ 6|12 ∗ 12 ≈ 0.046% 11 ≤ 6|20 ∗ 20 ≈ 0.0001% 11 ≤ 6 ≈ 98.62% predictive probability

Slide 52

Slide 52 text

you OK, let’s try 11!! friend (11 ≤ 6|) ≈ 98.62% And, = 6 ≈ 94.58% = {5, 4, 3, 4, 2, 1, 2, 3, 1, 4} Dice with faces

Slide 53

Slide 53 text

you OK, let’s try 11!! friend (11 ≤ 6|) ≅ 98.88% And, = 6 ≅ 94.85% 11 = 8 = {5, 4, 3, 4, 2, 1, 2, 3, 1, 4} Dice with faces

Slide 54

Slide 54 text

you OK, let’s try 11!! friend (11 ≤ 6|) ≈ 98.62% And, = 6 ≈ 94.58% 11 = 8 = 6 {, 11 } = 0% = {5, 4, 3, 4, 2, 1, 2, 3, 1, 4} Dice with faces

Slide 55

Slide 55 text

= {5, 4, 3, 4, 2, 1, 2, 3, 1, 4} Dice with faces ́ = {, 8} likelihood ≅ ) ∗ |′ () posterior prior Dice with faces likelihood ́ ≅ ́ ) ∗ |′′ (́) prior

Slide 56

Slide 56 text

= {5, 4, 3, 4, 2, 1, 2, 3, 1, 4} Dice with faces ́ = {, 8} likelihood ≅ ) ∗ |′ () posterior prior Dice with faces likelihood ́ ≅ ́ ) ∗ | (́) prior posterior

Slide 57

Slide 57 text

X‰ = 20%, 20%, 20%, 20%, 20% prior posterior = {4, 6, 8, 12, 20} ≈ 0%, 94.58%, 5.32%, 0.09%, 0.0005% posterior ́ ≈ 0%, 0%, 99.98%, 0.020%, 0.0000004% prior = {5, 4, 3, 4, 2, 1, 2, 3, 1, 4} ́ = {, 8}

Slide 58

Slide 58 text

you OK!!! Let’s 12 !! COME OOON friend (12 ≤ 8|́) ≈ 99.98% And, = 8 ́ ≈ 99.98% Dice with faces ́ = {5, 4, 3, 4, 2, 1, 2, 3, 1, 4, 8}

Slide 59

Slide 59 text

There was nobody that then know their whereabouts...

Slide 60

Slide 60 text

likelihood posterior ≅ ) ∗ () likelihood | ”(t|i)∗”(i|z) | ”(||z) ”(t|i) prior distribution posterior distribution predictive distribution data prior

Slide 61

Slide 61 text

likelihood posterior ≅ ) ∗ () predictive distribution () (|) Truth Information Criterion in Bayesian modeling prior likelihood | ”(t|i)∗”(i|z) | ”(||z) ”(t|i) prior distribution posterior distribution data

Slide 62

Slide 62 text

likelihood prior posterior ≅ ) ∗ () predictive distribution () (|) —˜ (| = − () Kullback–Leibler divergence Information Criterion in Bayesian modeling Truth = − log − − log = log likelihood | ”(t|i)∗”(i|z) | ”(||z) ”(t|i) prior distribution posterior distribution data expectation self-information

Slide 63

Slide 63 text

Slide 64

Slide 64 text

likelihood prior posterior ≅ ) ∗ () predictive distribution () (|) Truth Information Criterion in Bayesian modeling —˜ (| = −• + Generalization error ≈ likelihood | ”(t|i)∗”(i|z) | ”(||z) ”(t|i) prior distribution posterior distribution data

Slide 65

Slide 65 text

likelihood prior posterior ≅ ) ∗ () likelihood posterior distribution predictive distribution () (|) Truth Information Criterion in Bayesian modeling evidence = − log ≔ —˜ (| = −• + Generalization error ≈ likelihood | ”(t|i)∗”(i|z) | ”(||z) ”(t|i) prior distribution posterior distribution data self-information

Slide 66

Slide 66 text

≔ = − log = log () () − log () = log () () ∗ 1 () ”(z) = = log () () − log () = —˜ (| − … p ∗ log (p ) p —˜ (| = − •(z) evidence Information Criterion in Bayesian modeling

Slide 67

Slide 67 text

likelihood prior posterior ≅ ) ∗ () likelihood | ”(t|i)∗”(i|z) | ”(||z) ”(t|i) prior distribution posterior distribution predictive distribution data () (|) Truth Information Criterion in Bayesian modeling evidence —˜ ( | = −•(t) + —˜ ( | = − •(z) Free energy Generalization error ≈ ≈ = − log ≔ self-information

Slide 68

Slide 68 text

Summary

Slide 69

Slide 69 text

Dice with α faces (regular polyhedron) ℎ … Truth Knowledge ? Hypothesis Observation = {5, 4, 3, 4, 2, 1, 2, 3, 1, 4}

Slide 70

Slide 70 text

∶ → = 1 , … , ∞ , ∀ ∈ ℕ = 1 , … , realization sample space (can NEVER get) = = ∀ = || probability distribution = <- g() X <- density() ~ x → t → ⇔ ~(|) statistical modeling outcome function of face dice g <- function( = 6) { map(1:∞, ~sample(1: , n=10, replace = TRUE)) }

Slide 71

Slide 71 text

~ (|) : → ∈ 4, 6, 8, 12, 20 (|) = 6 = 12 = 20 t ← = 1 , … , ∞ x t ← , ∀ ∈ ℕ, ∀ ≤ ,

Slide 72

Slide 72 text

~ (|) : → (|) log ( ) = 6 = 12 = 20 (|) = 8 likelihood = 6 ∈ 4, 6, 8, 12, 20 t ← = 1 , … , ∞ x t ← , ∀ ∈ ℕ, ∀ ≤ ,

Slide 73

Slide 73 text

~ (|) : → log ( ) = 6 = 12 = 20 (|) = 8 likelihood ~ (|) ~ (|) : → (|) = =

Slide 74

Slide 74 text

~ (|) : → log ( ) = 6 = 12 = 20 (|) = 8 likelihood ~ (|) ~ (|) : → (|) = ) ∗ | (|) ≅ ) ∗ |′ ∑ { ∗ (|′) ∀i } = ) ∗ () () Bayes' theorem likelihood prior posterior

Slide 75

Slide 75 text

log ( ) = 6 = 12 = 20 (|) = 8 likelihood = 6 ≅ 94.58% = 6 ℎ ∀ |′ = 1/5 ~ (|) : → ~ (|) ~ (|) : → (|) = ) ∗ | (|) ≅ ) ∗ |′ ∑ { ∗ (|′) ∀i } likelihood prior posterior

Slide 76

Slide 76 text

~ (|) : → ~ (|) ~ (|) : → (|) = ) ∗ | (|) ≅ ) ∗ |′ ∑ { ∗ (|′) ∀i } likelihood prior posterior log ( ) = 6 = 12 = 20 (|) = 8 likelihood = 6 ≅ 94.58% = 6 ℎ ∀ |′ = 1/5

Slide 77

Slide 77 text

likelihood prior posterior ≅ ) ∗ () likelihood | ”(t|i)∗”(i|z) | ”(||z) ”(t|i) prior distribution posterior distribution predictive distribution data () (|) Truth Information Criterion in Bayesian modeling ebidence —˜ ( | = −•(t) + —˜ ( | = − •(z) Free energy Generalization error ≈ ≈ = − log ≔ self-information

Slide 78

Slide 78 text

Oh, by the way…

Slide 79

Slide 79 text

or A B HA : A is better HB : B is better H0 : We have to choice better 1 of 2. A B There is a difference between A and B A>B A is better θ H1 :

Slide 80

Slide 80 text

or x y HA : A is better HB : B is better H0 : We have to choice better 1 of 2. x y There is a difference between x and y A>B A is better θ H1 : = t − § § ← | ”(t|i) ©ª t ← | ”(t|i) ©¬ ←

Slide 81

Slide 81 text

or x y HA : A is better HB : B is better H0 : We have to choice better 1 of 2. x y There is a difference between x and y A>B A is better θ H1 : = t − § t - ← | ← | ”(|│z) ”(t│i) § - ← | ← | ”(°│±) ”(§│²)

Slide 82

Slide 82 text

or x y HA : A is better HB : B is better H0 : We have to choice better 1 of 2. x y There is a difference between x and y A>B A is better θ H1 : ³ ← [t , § ] t - ← | ← | ”(|│z) ”(t│i) § - ← | ← | ”(°│±) ”(§│²)

Slide 83

Slide 83 text

or x y HA : A is better HB : B is better H0 : We have to choice better 1 of 2. x y There is a difference between x and y A>B A is better θ H1 : ³ ← [t , § ] t - ← | ← | ” ” § - ← | ← | ” ”

Slide 84

Slide 84 text

Slide 85

Slide 85 text

Summary, again…

Slide 86

Slide 86 text

What is modeling? ℎ f X ℎℎ Truth Knowledge Narrow sense Broad sense

Slide 87

Slide 87 text

What is modeling? f X ℎℎ . f X ℎℎ . ℎ ℎ Hypothesis Driven Data Driven

Slide 88

Slide 88 text

∶ → = 1 , … , ∞ , ∀ ∈ ℕ = 1 , … , p realization sample space (can NEVER get) = = ∀ = || probability distribution = <- g() X <- density() ~ x → t → ⇔ ~(|) statistical modeling outcome function with parameter g <- function( = 6) { map(1:∞, ~sample(1: , n=10, replace = TRUE)) }

Slide 89

Slide 89 text

| ← = {1 , … , ∞ } x | ← t ← = 1 , … , ∞ x t ← ~(|) ~ (|) (, ) Bayesian Modeling

Slide 90

Slide 90 text

v.s. me “MUST be wholy REJECTED!!!” “p-value **cking!!!” Frequentist Bayesian Old Stereotype

Slide 91

Slide 91 text

f X ℎℎ . f X ℎℎ . ℎ ℎ Hypothesis Driven Data Driven ∶ → ∶ → →

Slide 92

Slide 92 text

“Life shrinks or expands to one’s courage.” -- Anaïs Nin, 2000 http://theamericanreader.com