Para-normal Statistics:
Analyzing what doesn't add up.
Steven Lembark
Workhorse Computing
[email protected]
Slide 2
Slide 2 text
Normality
We expect data is normal.
It's what we are trained for.
Chi-Squared, F depend on it.
It's the guts of ANOVA.
Theory guarantees it, sort of.
Slide 3
Slide 3 text
What is "normal"?
Normal data is:
Parametric
Real
Symmetric
Unimodal
Slide 4
Slide 4 text
Ab-normal data
Not all data is parametric.
Slide 5
Slide 5 text
Ab-normal data
Not all data is parametric:
"Bold" + "Tide" / 2 ==
Slide 6
Slide 6 text
Ab-normal data
Not all data is parametric: Nominal Data
"Bold" + "Tide" / 2 == ??
"Bald" - "Harry" >= 0 ??
Slide 7
Slide 7 text
Ab-normal data
Not all data is parametric: Ordinal Data
"On a scale of 1 to 5 how would you rate..."
Is the average really 3?
Are differences between ranks unform?
Slide 8
Slide 8 text
Ab-normal data
Not all data is parametric: Ordinal Data
"On a scale of 1 to 5 how would you rate..."
Is the average really 3?
For different people?
Slide 9
Slide 9 text
Ab-normal data
Not all data is unimodal, symmetric.
Bi-modal data has higher sample variance.
Positive data is skewed.
Slide 10
Slide 10 text
Ab-normal data
Counts usually Binomial or Poission.
Binomial: Coin flips.
Poisson: Sample success/failure.
Slide 11
Slide 11 text
Power of Positive Thinking
Binomial: Count of
success from IID
experiments.
Mean = np
Variance = npq
Slide 12
Slide 12 text
Power of Positive Thinking
Poisson: Count of
occurrances in sample
size n.
Mean = np
Variance = np
Slide 13
Slide 13 text
Power of Positive Thinking
Curves all positive.
Right tailed.
Binomial has highest power if sample data is binomial.
Result: Smaller n for given Beta.
Slide 14
Slide 14 text
Kinda normal
Approximations work...
Slide 15
Slide 15 text
Kinda normal
Approximations work some of the time.
Rule: npq > 5 for binomial approximation.
Goal: Keep mean > 3σ so normal is all positive.
Q: How good an approximation?
A: It depends...
Slide 16
Slide 16 text
The middle way
Binomial:
n=20, p=0.5
Normal:
µ 10, σ = 2.23
Decent approximation.
Slide 17
Slide 17 text
Off to one side
Binomial:
n=20, p=0.3
Normal:
µ = 6, σ = 2.0
Drifting negative.
Slide 18
Slide 18 text
Life on the edge
Binomial:
n=20, p=0.1
Normal:
µ = 2, σ = 1.3
Significant negative.
General rule: npq > 5
Small or large p is skewed.
Six-sigma range should be positive.
At that point n > 5 / pq.
For p = 0.0013, n = 3582.
Sample size around 4000?
Slide 21
Slide 21 text
When we assume we make...
Assuming normal data leaves a less robust conclusion.
Stronger, less robust:
Sensitive to individual datasets.
Not reproducable.
Slide 22
Slide 22 text
Non-parametric Statistics
Origins in Psychology, Biology, Marketing.
Analyze counts, ranks.
Tests based on discrete distributions.
Slide 23
Slide 23 text
Common in Quality
Frequency of failures.
QC with No-Go guages.
Variations between batch runs.
Customer feedback.
Slide 24
Slide 24 text
Example: Safety study
Q: Are departments equally "safe"?
Q: Is a new configuration any "safer"?
Compare sample populations.
Slide 25
Slide 25 text
What is "safe"?
Fewer reported injurys?
What is P( injury ) per operation?
Slide 26
Slide 26 text
What is "safe"?
Fewer reported injurys?
What is P( injury ) per operation?
0.5?
0.3?
Slide 27
Slide 27 text
What is "safe"?
Fewer reported injurys?
What is P( injury ) per operation?
0.5?
0.1?
A whole lot less?
Slide 28
Slide 28 text
What is "safe"?
Fewer reported injurys?
What is P( injury ) per operation?
0.5?
0.1?
A whole lot less?
N(0.01, 0.01) is heavily negative.
Slide 29
Slide 29 text
Severe?
Parametric measure of injurys?
Slide 30
Slide 30 text
Severe?
Parametric ranking of injurys?
( Finger + Thumb ) / 2 == ?
Slide 31
Slide 31 text
Severe?
Parametric ranking of injurys?
( Finger + Thumb ) / 2 == ?
( Hand + Eye ) == Arm ?
( Hand + Hand ) == 2 * Hand ?
Slide 32
Slide 32 text
Ordinal Data
Ranked data, not scaled.
Slide 33
Slide 33 text
Ordinal Data
Ranked data, not scaled.
Hangnail < Finger Tip < Finger < Hand < Arm
"Fuzzy Buckets"
Have p( accident ) from history.
Slide 34
Slide 34 text
Kolomogrov-Smirnov
Got tonic?
Slide 35
Slide 35 text
Kolomogrov-Smirnov
Nope, not Vodka.
Like F or ANOVA: Populations are "different".
Slide 36
Slide 36 text
K-S Test
Compare cumulative
data (blue) vs. Expeted
(red).
Measure is largest
difference (arrow).
Slide 37
Slide 37 text
K-S for safety
Rank the injurys on relative scale.
Compare counts by bucket.
Cumulative distribution:
accomodates empty cells.
minor mis-catagorization.
Slide 38
Slide 38 text
A good datum is hard to find,
You always get the other kind.
Apologies to Bessie Smith
Sliding-scale questions:
"How would you rate..."
"How well did..."
"How likely are you to..."
Slide 39
Slide 39 text
A good datum is hard to find,
You always get the other kind.
Apologies to Bessie Smith
Reproducability:
Variable skill.
Variable methods.
Variable data handling.
Slide 40
Slide 40 text
A good datum is hard to find,
You always get the other kind.
Apologies to Bessie Smith
Big Data:
Multiple sources.
Multiple populations.
Multiple data standards.
Slide 41
Slide 41 text
Repeatable Analysis
Variety of NP tests for "messy" data.
Handle protocol, sampling variations.
Robust conclusions with real data.
Slide 42
Slide 42 text
Summary
Non-parametric data: counts, nominal, ordinal data.
Non-parametric analysis avoids NID assumptions.
Robust analysis of real data.
Even the para-normal.
References: K-S
http://itl.nist.gov/div898/handbook/eda/section3/eda35g.htm
Exploratory data analysis is worth exploring.
https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
As always... really good writeup of the test definition, math.
Slide 46
Slide 46 text
References: Robust Analysis
https://en.wikipedia.org/wiki/Robust_statistics
https://en.wikipedia.org/wiki/Robust_regression
Decent introductions.
Also look up "robust statistics" at nist.gov or "robust statistical
analysis" at duckduckgo.
Slide 47
Slide 47 text
References: This talk
http://slideshare.net/lembark
Along with everything else I've done...