SOC 4015 & SOC 5050 - Lecture 05

THE DISTRIBUTION OF RANDOM VARIABLES QUANTITATIVE ANALYSIS CHRISTOPHER PRENER, PH.D.
FALL 2018 WEEK 05 LECTURE 05

AGENDA QUANTITATIVE ANALYSIS / WEEK 05 / LECTURE 05 1.
Front Matter 2. Binomial Distribution 3. Poisson Distribution 4. Normal Distribution 5. Statistical Signiﬁcance 6. Normality Testing 7. Back Matter

⋆ THEME We want to think   systematically about the 
likelihood of observing   particular outcomes relative to a known set of possible outcomes.

1 FRONT   MATTER

Lab 04, PS-02, and Lecture Prep 06 are due before
the next lecture. There was an update on the ﬁnal project due today as an issue in your ﬁnal project GitHub repo (DoeProject if your last name is “Doe”) 1. FRONT MATTER ANNOUNCEMENTS

BINOMIAL DISTRIBUTION 2

▸ A sequence of independent trials (n) with constant probability
of success at each trial (p) where we are interested in the number of successes (x) 2. BINOMIAL DISTRIBUTION DEFINITION

of success at each trial (p) where we are interested in the number of successes (x) ▸ Acronym: • B = binary outcome • I = independence • N = ﬁxed sample size • S = same probability 2. BINOMIAL DISTRIBUTION DEFINITION

of success at each trial (p) where we are interested in the number of successes (x) ▸   ▸   2. BINOMIAL DISTRIBUTION EMPIRICAL RULE

2. BINOMIAL DISTRIBUTION R FUNCTIONS stats::dbinom(k, size=n, prob=p) returns probability
of observing k successes P(X = k) stats::pbinom(k, size=n, prob=p, lower.tail = TRUE) returns probability of observing k or fewer successes P(X ≤ k) stats::pbinom(k, size=n, prob=p, lower.tail = FALSE) returns probability of observing more than k successes P(X > k)

2. BINOMIAL DISTRIBUTION BINOMIAL WORKFLOW ▸ What is the probability
of 10 or fewer successes occurring in an sequence of 100 independent trials with a binary outcome where the probability of success is .25 for each trial? ▸ Is the binomial distribution appropriate?

of 10 or fewer successes occurring in an sequence of 100 independent trials with a binary outcome where the probability of success is .25 for each trial? ▸ Is the binomial distribution appropriate? ▸ What is the appropriate R function?

2. BINOMIAL DISTRIBUTION R FUNCTIONS stats::dbinom(k, size=n, prob=p) returns probability
of observing k successes P(X = k) stats::pbinom(k, size=n, prob=p, lower.tail = TRUE) returns probability of observing k or fewer successes P(X ≤ k) stats::pbinom(k, size=n, prob=p, lower.tail = FALSE) returns probability of observing more than k successes P(X > k)

of 10 or fewer successes occurring in an sequence of 100 independent trials with a binary outcome where the probability of success is .25 for each trial? ▸ Is the binomial distribution appropriate? ▸ What is the appropriate R function? ▸ What is n? What is k? What is p?

of 10 or fewer successes occurring in an sequence of 100 independent trials with a binary outcome where the probability of success is .25 for each trial? ▸ Is the binomial distribution appropriate? ▸ What is the appropriate R function? ▸ What is n? What is k? What is p? > pbinom(10, size=100, prob=.25, lower.tail = TRUE) [1] 0.0001371006

p(k or fewer   successes) =   .0001

▸ A trial where there are only two outcomes -
“success” and “failure” ▸ Over the long run (law of large numbers), there is a 50% chance of “success” and a 50% chance of “failure” ▸ Jacob Bernoulli was a Swiss mathematician and professor at the University of Basel 2. BINOMIAL DISTRIBUTION BERNOULLI TRIAL 1654-1705

▸ Blaise Pascal was a French mathematician ▸ Described a
process (array) of repeated binomial coefﬁcients that were triangular in shape ▸ Known for centuries in China, Persia, and even in Europe 2. BINOMIAL DISTRIBUTION PASCAL’S TRIANGLE 1623-1662

▸ Blaise Pascal was a French mathematician ▸ Described a
process (array) of repeated binomial coefﬁcients that were triangular in shape ▸ Known for centuries in China, Persia, and even in Europe 2. BINOMIAL DISTRIBUTION PASCAL’S TRIANGLE YANG HUI’S TRIANGLE (1303)

PASCAL’S TRIANGLE 2. BINOMIAL DISTRIBUTION

2. BINOMIAL DISTRIBUTION PASCAL’S TRIANGLE 1 1 1 1 1
2 1 3 3 1

2 1 3 3 1 0

2 1 3 3 1

2. BINOMIAL DISTRIBUTION GALTON BOX

▸ Sir Francis Galton English statistician ▸ Cousin of Charles
Darwin ▸ A eugenicist who coined the phrase “nature versus nurture” ▸ Demonstrated a number of important statistical ideas, including correlation ▸ Invented what is known as the quincunx or “Galton Box” 2. BINOMIAL DISTRIBUTION “BEAN MACHINE” 1822-1911

2. BINOMIAL DISTRIBUTION GALTON BOX

2. BINOMIAL DISTRIBUTION THE QUINCUNX http://goo.gl/qKUTsx

2. BINOMIAL DISTRIBUTION THE QUINCUNX

2. BINOMIAL DISTRIBUTION DE MOIVRE–LAPLACE THEOREM ABRAHAM DE MOIRE PIERRE-SIMON
LAPLACE

2. BINOMIAL DISTRIBUTION DE MOIVRE–LAPLACE THEOREM ABRAHAM DE MOIRE 1718

POISSON DISTRIBUTION 3

▸ French mathematician and physicist ▸ Laplace’s student ▸ Built
upon the binomial distribution to describe events that occur with exceptional rarity 3. POISSON DISTRIBUTION SIMÉON POISSON 1781-1840

PRUSSIAN HUSSARDS 3. POISSON DISTRIBUTION LADISLAUS BORTKIEWICZ (1898)

DEFINITION ▸ Used for events where n is large and
p is very small, … ▸ …so small that their product approaches a constant we call lambda (). 3. POISSON DISTRIBUTION

▸ n = a count of independent events ▸ p
= probability 3. POISSON DISTRIBUTION Let: DEFINITION ▸ n can occur a (theoretically) inﬁnite number of times. ▸ Its only parameter is (np) - the greek letter lambda.

3. POISSON DISTRIBUTION R FUNCTIONS stats::dpois(k, lambda=m) returns probability of
observing k successes P(X = k) stats::ppois(k, lambda=m, lower.tail = TRUE) returns probability of observing k or fewer successes P(X ≤ k) stats::ppois(k, lambda=m, lower.tail = FALSE) returns probability of observing more than k successes P(X > k)

POISSON WORKFLOW ▸ The probability of a car accident at
a given intersection is 0.00004. In a typical week, 100,000 cars pass through the intersection. What is the probability of observing 6 car accidents in a single week at this intersection? • Is the poisson distribution appropriate? 3. POISSON DISTRIBUTION

a given intersection is 0.00004. In a typical week, 100,000 cars pass through the intersection. What is the probability of observing 6 car accidents in a single week at this intersection? • Is the poisson distribution appropriate? • What is the appropriate R function? 3. POISSON DISTRIBUTION

3. POISSON DISTRIBUTION R FUNCTIONS stats::dpois(k, lambda=m) returns probability of
observing k successes P(X = k) stats::ppois(k, lambda=m, lower.tail = TRUE) returns probability of observing k or fewer successes P(X ≤ k) stats::ppois(k, lambda=m, lower.tail = FALSE) returns probability of observing more than k successes P(X > k)

a given intersection is 0.00004. In a typical week, 100,000 cars pass through the intersection. What is the probability of observing 6 car accidents in a single week at this intersection? • Is the poisson distribution appropriate? • What is the appropriate R function? • What is ? What is k? 3. POISSON DISTRIBUTION

a given intersection is 0.00004. In a typical week, 100,000 cars pass through the intersection. What is the probability of observing 6 car accidents in a single week at this intersection? • Is the poisson distribution appropriate? • What is the appropriate R function? • What is ? What is k? 3. POISSON DISTRIBUTION > dpois(6, lambda=4) [1] 0.1041956

p(k successes) =   .104

NORMAL DISTRIBUTION 4

▸ First suggested by French mathematician Abraham de Moivre as
a an outcome of the binomial distribution 4. NORMAL DISTRIBUTION HISTORY 1667-1754

a an outcome of the binomial distribution ▸ Carl Friedrich Gauss, a German mathematician who had an immensely inﬂuential career, demonstrated its importance in 1809 4. NORMAL DISTRIBUTION HISTORY 1777-1855

MEMORIALIZING GAUSS 4. NORMAL DISTRIBUTION

a an outcome of the binomial distribution ▸ Carl Friedrich Gauss, a German mathematician who had an immensely inﬂuential career, demonstrated its importance in 1809 ▸ Pierre Simon Laplace also made signiﬁcant contributions to its usefulness beginning in 1810 4. NORMAL DISTRIBUTION HISTORY 1749-1827

▸ As opposed to the binomial and Poisson distributions, the
normal is a continuous probability function ▸ Can take on an inﬁnite range of values ( -∞ < x < ∞ ) ▸ Symmetric around , which has same value as median and mode ▸ Spread of distribution determined by ▸ Standard normal has = 0 and   = 1 4. NORMAL DISTRIBUTION DEFINITION

68.2% predictive interval

4. NORMAL DISTRIBUTION Z-SCORES ▸ The value of an observation
expressed in standard deviation units. ▸ or stats::pnorm(z, mean=0, sd=1,  lower.tail=TRUE) returns the cumulative probability under the standard normal distribution P(X ≤ z)

▸ A series of surveys conducted over the past month
found that support for a new trade policy was 41.5% with a standard deviation of 2.25. What is the probability that a randomly selected poll shows support for the trade policy to be 46%? • Is the normal distribution appropriate? 4. NORMAL DISTRIBUTION Z-SCORE WORKFLOW

found that support for a new trade policy was 41.5% with a standard deviation of 2.25. What is the probability that a randomly selected poll shows support for the trade policy to be 46%? • Is the normal distribution appropriate? • What is z? 4. NORMAL DISTRIBUTION Z-SCORE WORKFLOW

found that support for a new trade policy was 41.5% with a standard deviation of 2.25. What is the probability that a randomly selected poll shows support for the trade policy to be 46%? • Is the normal distribution appropriate? • What is z? 4. NORMAL DISTRIBUTION Z-SCORE WORKFLOW > pnorm(2, mean=0, sd=1, lower.tail=TRUE) [1] 0.9772499

97.72%

STATISTICAL SIGNIFICANCE 5

▸ English mathematician ▸ Student of Sir Francis Galton (and,
like Galton, he was a eugenicist and social Darwinist) 5. STATISTICAL SIGNIFICANCE KARL PEARSON 1857-1936

▸ English mathematician ▸ Student of Sir Francis Galton (and,
like Galton, he was a eugenicist and social Darwinist) ▸ Formalized the concept of the   “p-value” around 1914 ▸ Based on Laplace’s earlier use of the idea ▸ Also introduced “moments”, histograms, and a number of other concepts we’ll get to this semester! 5. STATISTICAL SIGNIFICANCE KARL PEARSON 1857-1936

▸ English mathematician and biologist ▸ Like Galton and Pearson,
he was a eugenicist and social Darwinist ▸ Introduced the concept of the null hypothesis… ▸ … and popularized the idea of statistical signiﬁcance including the selection of p = .05. ▸ Responsible for movement away from Bayesian analyses because of his preference for objectivity. 5. STATISTICAL SIGNIFICANCE R.A. FISHER 1890-1962

INFORMALLY, A P-VALUE IS THE PROBABILITY UNDER A SPECIFIED STATISTICAL
MODEL THAT A STATISTICAL SUMMARY OF THE DATA WOULD BE EQUAL TO OR MORE EXTREME THAN ITS OBSERVED VALUE. American Statistical Association “Statement on p-values”  (2016)

THE PROBABILITY OF GETTING RESULTS AT LEAST AS EXTREME AS
THE ONES YOU OBSERVED, GIVEN THAT THE NULL HYPOTHESIS IS CORRECT Christie Aschwanden "Not Even Scientists Can Easily Explain P-values"  (2015)

IMAGINE, HE SAID, THAT YOU HAVE A COIN THAT YOU
SUSPECT IS WEIGHTED TOWARD HEADS. (YOUR NULL HYPOTHESIS IS THEN THAT THE COIN IS FAIR.) YOU FLIP IT 100 TIMES AND GET MORE HEADS THAN TAILS. THE P-VALUE WON’T TELL YOU WHETHER THE COIN IS FAIR, BUT IT WILL TELL YOU THE PROBABILITY THAT YOU’D GET AT LEAST AS MANY HEADS AS YOU DID IF THE COIN WAS FAIR. THAT’S IT — NOTHING MORE. Christie Aschwanden "Not Even Scientists Can Easily Explain P-values"  (2015)

z 1.64 2.33 3.09 0.05 0.01 0.001 % of scores
< 95% 99% 99.9% % of scores > 5% 1% 0.1%

z 1.64 2.33 3.09 p < 0.05 0.01 0.001 %
of scores < 95% 99% 99.9% % of scores > 5% 1% 0.1%

z 1.96 2.58 3.29 0.025 0.005 0.0005 % of scores
< 97.5% 99.5% 99.95% % of scores > 2.5% 0.5% 0.05%

z 1.96 2.58 3.29 0.05 0.01 0.001 % of scores
inside 95% 99% 99.9% % of scores outside 5% 1% 0.1%

z 1.96 2.58 3.29 p < 0.05 0.01 0.001 %
of scores inside 95% 99% 99.9% % of scores outside 5% 1% 0.1%

Q: WHY DO SO MANY COLLEGES AND GRAD SCHOOLS TEACH
P = 0.05? A: BECAUSE THAT’S STILL WHAT THE SCIENTIFIC COMMUNITY AND JOURNAL EDITORS USE.  Q: WHY DO SO MANY PEOPLE STILL USE P = 0.05? A: BECAUSE THAT’S WHAT THEY WERE TAUGHT IN COLLEGE OR GRAD SCHOOL. George Cobb, Ph.D. “Statement on p-values”  (2015)

WE TEACH IT BECAUSE IT’S WHAT WE DO; WE DO
IT BECAUSE IT’S WHAT WE TEACH. George Cobb, Ph.D. “Statement on p-values”  (2015)

NORMALITY TESTING 6

“REAL” DATA ARE NEVER PERFECTLY NORMAL

6. NORMALITY TESTING QUANTIFYING ABNORMAL diagnostic plots descriptive statistics hypothesis
tests

▸ = sample mean ▸ i = lower bound ▸
n = sample size ▸ xi = a given value in the vector 6. NORMALITY TESTING Let: SKEW ¯ x

▸ “Third moment” of the normal distribution ▸ Measure of
the asymmetry of a continuous distribution ▸ Normal distributions have sk = 0 ▸ sk > 0 indicates a longer right tail relative to the left tail’s length ▸ sk < 0 indicates a longer left tail relative to the right tail’s length ▸ Values < -2 or > 2 are indicators of a non-normal distribution 6. NORMALITY TESTING SKEW

“NORMALLY DISTRIBUTED” HISTOGRAM

NON-NORMAL HISTOGRAM

▸ = sample mean ▸ i = lower bound ▸
n = sample size ▸ xi = a given value in the vector 6. NORMALITY TESTING Let: KURTOSIS ¯ x

▸ “Fourth moment” of the normal distribution ▸ Measure of
the “weight” of the tails - how many observations fall in the tails relative to the center ▸ Normal distributions have k = 3 where k is always positive and k ≥ 1. ▸ Normal distributions have k = 0 excess kurtosis (ek = k-3) where ek can be negative. ▸ k > 5 is one rule of thumb for problematic distributions 6. NORMALITY TESTING KURTOSIS

6. NORMALITY TESTING KURTOSIS Mesokurtic - kurtosis = 3 k
= 3

6. NORMALITY TESTING KURTOSIS Leptokurtic - kurtosis > 3 k
= 4.805

6. NORMALITY TESTING KURTOSIS Platykurtic - kurtosis < 3 k
= 1.810

tests

6. NORMALITY TESTING HISTOGRAM WITH NORMAL DISTRIBUTION

QUANTILE-QUANTILE PLOT ▸ Known as the “q-q” plot for short
▸ Plots a given variable against a theoretical distribution (such as the standard normal distribution) as a diagnostic test ▸ Look for how your variable (the points) sits relative to the normal distribution (the 45-degree line). Pay particular attention to the tails. 6. NORMALITY TESTING

“NORMALLY DISTRIBUTED”

POSITIVE (RIGHT) SKEW

NEGATIVE (LEFT) SKEW

6. NORMALITY TESTING QUANTILE-QUANTILE PLOT

tests

6. NORMALITY TESTING SHAPIRO-FRANCIA TEST   Assumptions: 1. Results are
valid for 5 ≤ n ≤ 5000

6. NORMALITY TESTING SHAPIRO-FRANCIA TEST Data are not markedly different
from the normal distribution. H0 Data are markedly different from the normal distribution. HA

from the normal distribution. H0 Data are markedly different from the normal distribution. HA If the p value associated with the   test statistic is greater than .05…

from the normal distribution. H0 Data are markedly different from the normal distribution. HA If the p value associated with the   test statistic is less than .05…

7 BACK   MATTER

AGENDA REVIEW 7. BACK MATTER 2. Binomial Distribution 3. Poisson
Distribution 4. Normal Distribution 5. Statistical Signiﬁcance 6. Normality Testing

REMINDERS 7. BACK MATTER Lab 04, PS-02, and Lecture Prep
06 are due before the next lecture. There was an update on the ﬁnal project due today as an issue in your ﬁnal project GitHub repo (DoeProject if your last name is “Doe”)

SOC 4015 & SOC 5050 - Lecture 05

SOC 4015 & SOC 5050 - Lecture 05

More Decks by Christopher Prener

Other Decks in Education

Featured

Transcript