Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resolved: All Data Are Wrong (by definition)

Resolved: All Data Are Wrong (by definition)

Most data analysis focuses on the analysis and presentation of data, as if the data are somehow divine. Most people remain blissfully unaware of the fact that all data has to be sourced and that sourcing requires the application of a measurement process. The measurement process always introduces discrepancies between the expected result and the actual result. This talk presents examples of bad data presentation and then shows you how to make measurement discrepancies explicit.

Dr. Neil Gunther

April 11, 2020
Tweet

More Decks by Dr. Neil Gunther

Other Decks in Science

Transcript

  1. Resolved: All Data Are Wrong (by definition) Dr. Neil Gunther

    Performance Dynamics Distributed Systems Meetup Online Pune, INDIA April 11 2020 SM c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 1 / 37
  2. Abstract “In God we Trust, all others must bring data”

    is a quote usually attributed to the famous proponent of data-driven decisions, William Edwards Deming. His unique viewpoint (c.1950) was that data analysis is essential for achieving superior performance in all facets of manufacturing. Today, the SARS-COV-2 pandemic has highlighted the importance of data-driven decisions in the effort to combat its devastating impact until a vaccine becomes available. Deming’s approach, however, though necessary, is not sufficient. Put more bluntly, how do I know the data you are bringing is any good? Early Covid-19 data was manipulated by the Chinese government. Russia has since been accused of the same thing. But, even if the data have not been doctored, that doesn’t mean you should treat data as sacrosanct — no matter the context. The illusion that data are divine comes, in part, from the naive acceptance of the way measured values are reported. For example, a %cpu of 72.2, is commonly seen in various O/S performance tools. This pristine numerical representation gives the illusion of a divine source: an illusion that is especially rampant in distributed performance monitoring applications. In reality, data are devilish and thus need to be ‘waterboarded’ to extract the truth. In this talk, I’ll show you how. c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 2 / 37
  3. Introduction Outline 1 Introduction 2 Measurement Basics 3 Bad Data

    Examples 4 How to Express Errors c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 3 / 37
  4. Introduction Things I’ve Tweeted about data All data are wrong

    ... by definition Treating data as divine is a sin Measurement is much more than just sucking up data from various repositories You can’t trust Data Scientists with data because they never learnt to make measurements Measurement is a process, not a number c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 4 / 37
  5. Introduction Types of data Linux ‘top’ shows both Continuous data:

    These data have fractional values (usually ≥ 0) Represented by positive reals Decimal point, R≥0 Examples: %CPU column, timestamp Discrete data: These data are whole numbers (usually ≥ 0) Counts represented by positive integers Cardinals +0, Z+ Examples: memory pages, packets Won’t be considering: Categorical data Nominal data Ordinal data c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 6 / 37
  6. Measurement Basics Outline 1 Introduction 2 Measurement Basics 3 Bad

    Data Examples 4 How to Express Errors c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 7 / 37
  7. Measurement Basics Cut me some slack The expected value is

    never identical to the actual value How much slack should be allowed between the expected (pencil line) and actual (saw cut)? Construction: Woodframe tolerance 1/8" (USA) or 2 mm ISO (EU). Your Income: It’s tax season. How accurate are the amounts on your income tax filing? Try asking your CPA. GFC 2008: How many mortgages went into default? How much real money was actually lost? These are BANKS! How could they not know? We don’t have any “building code” tolerances in the IT world. Generally done (if at all) on an ad hoc design basis. c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 11 / 37
  8. Bad Data Examples Outline 1 Introduction 2 Measurement Basics 3

    Bad Data Examples 4 How to Express Errors c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 12 / 37
  9. Bad Data Examples Superluminal neutrino data Sept 2011: OPERA/LHC Italian

    team announces neutrinos vν > c c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 13 / 37
  10. Bad Data Examples Superluminal neutrino data Sept 2011: OPERA/LHC Italian

    team announces neutrinos vν > c Big deal: Implies Einstein was wrong coz SRT is badly broken Extremely difficult measurements “It only takes ONE experiment to prove me wrong” c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 13 / 37
  11. Bad Data Examples Superluminal neutrino data Sept 2011: OPERA/LHC Italian

    team announces neutrinos vν > c Big deal: Implies Einstein was wrong coz SRT is badly broken Extremely difficult measurements “It only takes ONE experiment to prove me wrong” Data check: 6σ confidence level 1 in a BILLION chance that these data are a fluke 5σ (1 in a million) is sufficient for particle physics (e.g., Higgs boson) c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 13 / 37
  12. Bad Data Examples Superluminal neutrino data Sept 2011: OPERA/LHC Italian

    team announces neutrinos vν > c Big deal: Implies Einstein was wrong coz SRT is badly broken Extremely difficult measurements “It only takes ONE experiment to prove me wrong” Data check: 6σ confidence level 1 in a BILLION chance that these data are a fluke 5σ (1 in a million) is sufficient for particle physics (e.g., Higgs boson) Dec 2011: OPERA team withdraws paper c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 13 / 37
  13. Bad Data Examples Superluminal neutrino data Sept 2011: OPERA/LHC Italian

    team announces neutrinos vν > c Big deal: Implies Einstein was wrong coz SRT is badly broken Extremely difficult measurements “It only takes ONE experiment to prove me wrong” Data check: 6σ confidence level 1 in a BILLION chance that these data are a fluke 5σ (1 in a million) is sufficient for particle physics (e.g., Higgs boson) Dec 2011: OPERA team withdraws paper c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 13 / 37
  14. Bad Data Examples Superluminal neutrinos disappear Unlocked: vν > c

    Locked: vν < c c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 14 / 37
  15. Bad Data Examples Superluminal neutrinos disappear Unlocked: vν > c

    Locked: vν < c OPERA team manager also disappeared c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 14 / 37
  16. Bad Data Examples Hubble’s bubble 1929 The most important scatter

    plot in history (1 pc = 3.3 ly) c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 15 / 37
  17. Bad Data Examples Hubble’s trouble 1929 1 Linear regression model

    ˆ y = 423.94 x1 (through origin, c = 0) regressand ˆ y: recession velocity (km/sec) regressor x1 : enormous distance to the “stars” 2 Slope (rise/run) =⇒ expansion rate H0 (inverse time units) 3 Inverted slope =⇒ time the universe has been expanding 4 Edwin double checked and “corrected” the slope (in wrong direction!) H0 465.18 km/s/Mpc (Hubble’s constant) Tuniv 2 billion years 5 Tearth > 3 billion yrs in 1929 ... 4.5 billion yrs today 6 Oops! What do you do? Toss it (data says you’re wrong) Publish (possibly destroy your career) c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 16 / 37
  18. Bad Data Examples Hubble’s win 2003 Data comes from the

    Devil; models come from God (H0 70 today) c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 17 / 37
  19. Bad Data Examples Counting is easy Can’t have errors ...

    right? But might get entangled c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 18 / 37
  20. Bad Data Examples CERN LHC paper “ATLAS Collaboration” is a

    proxy for the entire experimental team Too many authors to fit under the paper title But exactly how many? c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 19 / 37
  21. Bad Data Examples CERN authors Author list is 15 appended

    pages! Article limit is 4 pages c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 20 / 37
  22. Bad Data Examples CERN authors Author list is 15 appended

    pages! Article limit is 4 pages More than 3000 authors !!! · · · · · · · · · c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 20 / 37
  23. Bad Data Examples Just write a progam (right?) Author counts

    Sources or error: PDF has to be parsed as text Comma separators, new lines, etc. Weird numbers for author affiliation Different programs get different counts What about the tea lady? Streaming counts Sources or error: Is counter or sensor working correctly? Is event being triggered correctly? Is the count being transported reliably? Is count being accumulated correctly? Are statistics (quantiles, histrograms, etc.) What is the error margin? c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 21 / 37
  24. Bad Data Examples Load average — mangled metrics [njg]~% uptime

    11:28 up 2:50, 3 users, load averages: 0.91 1.09 1.08 Load average is the original performance tool (c.1965) Measures run-queue length in Unix/Linux OS Not your average kind of average (low-pass filter) In Solaris 2.0 and 2.2 kernel code for LA was modified Sun Microsystems never told anyone Remained broken until Solaris 2.3 If you used it for sys admin or capacity planning ... good luck! Software is so malleable You need to regression test your vendor perf tools c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 22 / 37
  25. How to Express Errors Outline 1 Introduction 2 Measurement Basics

    3 Bad Data Examples 4 How to Express Errors c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 23 / 37
  26. How to Express Errors All data has errors 1 Data

    has to be sourced 2 The source is usually some form of measuring instrument(ation) 3 Measuring always introduces errors (wrongness) 4 All data are wrong 5 The real question is: How big is that wrongidity? How much wrongidityness can you tolerate? (SLAs) Out of sight, out of mind A widespread problem today is that most people doing data analysis are far removed from the sources of data and the type of measurement processes that generated those data. Blissfully unaware of the measurement aspect, the whole notion of errors never enters their mind. Data appear divine. c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 24 / 37
  27. How to Express Errors Expressing measurement error Need more that

    one sample measurement 1 Take n data samples 2 Take the sample mean (usual average) 3 Take the sample standard deviation around that mean 4 Compute the standard error (divide std dev by √ n) 5 Write the result as mean ± std error 6 Range around the mean quantifies the error margin 7 Rule of thumb: Relative error: std error mean × 100 A standard error of around ±5% is typical and acceptable c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 25 / 37
  28. How to Express Errors Example times.ms <- rnorm(100, mean=10, sd=3)

    # fake some data plot(times.ms, type="h", col="blue", xlab="Sample", ylab="Time (ms)") # data analysis mu <- mean(times.ms) # 9.871424 sd <- sd(times.ms) # 3.231457 se <- sd / sqrt(length(times.ms)) # 0.3231457 cat(sprintf("Reported value: %.2f \u00b1 %.2f ms\n", mu, se)) abline(h=mu, col="red") abline(h=mu + se, col="red", lty="dashed") abline(h=mu - se, col="red", lty="dashed") Reported value: 9.87 ± 0.32 ms 0 20 40 60 80 100 5 10 15 Sample Time (ms) 0 20 40 60 80 100 5 10 15 Sample Time (ms) c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 26 / 37
  29. How to Express Errors Other error representations See slide 17

    for error bars on Hubble data c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 27 / 37
  30. How to Express Errors Covid-19 curves (Added) Financial Times daily

    SIR curves don’t show any error bars (but could) Appear each day on Twitter Created by John Burn-Murdoch @jburnmurdoch c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 28 / 37
  31. How to Express Errors Confidence intervals c 2020 Performance Dynamics

    Resolved: All Data Are Wrong April 11, 2020 29 / 37
  32. How to Express Errors Significant digits Guerrilla CaP book Chap.

    3 Can’t have more digits reported than the smallest resolution of the measuring instrument c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 30 / 37
  33. How to Express Errors Accuracy vs. precision Also in Chap.

    3 Archery targets Distance from bullseye analogous to accuracy Clustering analogous to precision c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 31 / 37
  34. How to Express Errors Hubble’s data Hubble didn’t have a

    precision problem He had an accuracy problem His data was way off the bullseye Systematic error in 1920s telescope optics Likely suspected that or wouldn’t have published But made it worse by “correcting” the slope Despite all that, his linear model survived c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 32 / 37
  35. How to Express Errors Check. Check. Then check again! c

    2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 33 / 37
  36. How to Express Errors Waterboading your data 0 100 200

    300 400 0 100 200 300 400 500 Measured Throughput Offered load TPS 0 100 200 300 400 0 100 200 300 400 Measured Response Time Offered load ms 0 100 200 300 400 0 100 200 300 400 Running Threads Offered load Active threads Use Little’s law N = X ∗ R X and R are measured directly N: offered load on DVR side X ∗ R: actual load on SUT side N = X ∗ R !? c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 34 / 37
  37. How to Express Errors Summary All data are wrong. Why?

    Measurement is a process That process introduces errors (systematic, random, etc.) Need to see error margin to decide acceptability Check ... check .... and check again Measurements are repesented by numbers (but not math) Numbers like 72 or 72.37 are not sufficient Require 72.37 ±0.21 to indicate error explicitly Don’t exceed sigfigs Need (i) sample mean ˆ µ and (ii) sample standard deviation ˆ σ Measure n samples (1-shots are meaningless) Standard error: ˆ σ/ √ n Reported value: ˆ µ ± ˆ σ/ √ n (as above) But reporting errors does not guarantee meaningfulness (remember those neutrinos) c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 35 / 37
  38. How to Express Errors Wanna know more? 1. Guerrilla book

    2. Online training c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 36 / 37
  39. Comments and/or Questions? Performance Dynamics Castro Valley, California www.perfdynamics.com Training

    Twitter Facebook Blog [email protected] +1-510-537-5758 c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 37 / 37