$30 off During Our Annual Pro Sale. View Details »

Resolved: All Data Are Wrong (by definition)

Resolved: All Data Are Wrong (by definition)

Most data analysis focuses on the analysis and presentation of data, as if the data are somehow divine. Most people remain blissfully unaware of the fact that all data has to be sourced and that sourcing requires the application of a measurement process. The measurement process always introduces discrepancies between the expected result and the actual result. This talk presents examples of bad data presentation and then shows you how to make measurement discrepancies explicit.

Dr. Neil Gunther

April 11, 2020
Tweet

More Decks by Dr. Neil Gunther

Other Decks in Science

Transcript

  1. Resolved: All Data Are Wrong
    (by definition)
    Dr. Neil Gunther
    Performance Dynamics
    Distributed Systems Meetup Online
    Pune, INDIA
    April 11 2020
    SM
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 1 / 37

    View Slide

  2. Abstract
    “In God we Trust, all others must bring data” is a quote usually attributed to the famous proponent of
    data-driven decisions, William Edwards Deming. His unique viewpoint (c.1950) was that data analysis is
    essential for achieving superior performance in all facets of manufacturing. Today, the SARS-COV-2 pandemic
    has highlighted the importance of data-driven decisions in the effort to combat its devastating impact until a
    vaccine becomes available. Deming’s approach, however, though necessary, is not sufficient. Put more bluntly,
    how do I know the data you are bringing is any good? Early Covid-19 data was manipulated by the Chinese
    government. Russia has since been accused of the same thing. But, even if the data have not been doctored,
    that doesn’t mean you should treat data as sacrosanct — no matter the context. The illusion that data are divine
    comes, in part, from the naive acceptance of the way measured values are reported. For example, a %cpu of
    72.2, is commonly seen in various O/S performance tools. This pristine numerical representation gives the
    illusion of a divine source: an illusion that is especially rampant in distributed performance monitoring
    applications. In reality, data are devilish and thus need to be ‘waterboarded’ to extract the truth. In this talk, I’ll
    show you how.
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 2 / 37

    View Slide

  3. Introduction
    Outline
    1 Introduction
    2 Measurement Basics
    3 Bad Data Examples
    4 How to Express Errors
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 3 / 37

    View Slide

  4. Introduction
    Things I’ve Tweeted about data
    All data are wrong ... by definition
    Treating data as divine is a sin
    Measurement is much more than just sucking up data from various repositories
    You can’t trust Data Scientists with data because they never learnt to make measurements
    Measurement is a process, not a number
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 4 / 37

    View Slide

  5. Introduction
    Typical performance data
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 5 / 37

    View Slide

  6. Introduction
    Types of data
    Linux ‘top’ shows both
    Continuous data: These data have fractional values (usually ≥ 0)
    Represented by positive reals
    Decimal point, R≥0
    Examples: %CPU column, timestamp
    Discrete data: These data are whole numbers (usually ≥ 0)
    Counts represented by positive integers
    Cardinals +0, Z+
    Examples: memory pages, packets
    Won’t be considering:
    Categorical data
    Nominal data
    Ordinal data
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 6 / 37

    View Slide

  7. Measurement Basics
    Outline
    1 Introduction
    2 Measurement Basics
    3 Bad Data Examples
    4 How to Express Errors
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 7 / 37

    View Slide

  8. Measurement Basics
    Measure twice ...
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 8 / 37

    View Slide

  9. Measurement Basics
    ... cut once
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 9 / 37

    View Slide

  10. Measurement Basics
    No perfect measurement
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 10 / 37

    View Slide

  11. Measurement Basics
    Cut me some slack
    The expected value is never identical to the actual value
    How much slack should be allowed between the expected (pencil line) and
    actual (saw cut)?
    Construction: Woodframe tolerance 1/8" (USA) or 2 mm ISO (EU).
    Your Income: It’s tax season. How accurate are the amounts on your income
    tax filing? Try asking your CPA.
    GFC 2008: How many mortgages went into default? How much real money
    was actually lost? These are BANKS! How could they not
    know?
    We don’t have any “building code” tolerances in the IT world.
    Generally done (if at all) on an ad hoc design basis.
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 11 / 37

    View Slide

  12. Bad Data Examples
    Outline
    1 Introduction
    2 Measurement Basics
    3 Bad Data Examples
    4 How to Express Errors
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 12 / 37

    View Slide

  13. Bad Data Examples
    Superluminal neutrino data
    Sept 2011: OPERA/LHC Italian team announces neutrinos vν
    > c
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 13 / 37

    View Slide

  14. Bad Data Examples
    Superluminal neutrino data
    Sept 2011: OPERA/LHC Italian team announces neutrinos vν
    > c
    Big deal: Implies Einstein was wrong coz SRT is badly broken
    Extremely difficult measurements
    “It only takes ONE experiment to prove me wrong”
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 13 / 37

    View Slide

  15. Bad Data Examples
    Superluminal neutrino data
    Sept 2011: OPERA/LHC Italian team announces neutrinos vν
    > c
    Big deal: Implies Einstein was wrong coz SRT is badly broken
    Extremely difficult measurements
    “It only takes ONE experiment to prove me wrong”
    Data check: 6σ confidence level
    1 in a BILLION chance that these data are a fluke
    5σ (1 in a million) is sufficient for particle physics (e.g., Higgs boson)
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 13 / 37

    View Slide

  16. Bad Data Examples
    Superluminal neutrino data
    Sept 2011: OPERA/LHC Italian team announces neutrinos vν
    > c
    Big deal: Implies Einstein was wrong coz SRT is badly broken
    Extremely difficult measurements
    “It only takes ONE experiment to prove me wrong”
    Data check: 6σ confidence level
    1 in a BILLION chance that these data are a fluke
    5σ (1 in a million) is sufficient for particle physics (e.g., Higgs boson)
    Dec 2011: OPERA team withdraws paper
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 13 / 37

    View Slide

  17. Bad Data Examples
    Superluminal neutrino data
    Sept 2011: OPERA/LHC Italian team announces neutrinos vν
    > c
    Big deal: Implies Einstein was wrong coz SRT is badly broken
    Extremely difficult measurements
    “It only takes ONE experiment to prove me wrong”
    Data check: 6σ confidence level
    1 in a BILLION chance that these data are a fluke
    5σ (1 in a million) is sufficient for particle physics (e.g., Higgs boson)
    Dec 2011: OPERA team withdraws paper
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 13 / 37

    View Slide

  18. Bad Data Examples
    Superluminal neutrinos disappear
    Unlocked: vν
    > c Locked: vν
    < c
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 14 / 37

    View Slide

  19. Bad Data Examples
    Superluminal neutrinos disappear
    Unlocked: vν
    > c Locked: vν
    < c
    OPERA team manager also disappeared
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 14 / 37

    View Slide

  20. Bad Data Examples
    Hubble’s bubble 1929
    The most important scatter plot in history (1 pc = 3.3 ly)
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 15 / 37

    View Slide

  21. Bad Data Examples
    Hubble’s trouble 1929
    1 Linear regression model ˆ
    y = 423.94 x1
    (through origin, c = 0)
    regressand ˆ
    y: recession velocity (km/sec)
    regressor x1
    : enormous distance to the “stars”
    2 Slope (rise/run) =⇒ expansion rate H0
    (inverse time units)
    3 Inverted slope =⇒ time the universe has been expanding
    4 Edwin double checked and “corrected” the slope (in wrong direction!)
    H0
    465.18 km/s/Mpc (Hubble’s constant)
    Tuniv
    2 billion years
    5 Tearth > 3 billion yrs in 1929 ... 4.5 billion yrs today
    6 Oops! What do you do?
    Toss it (data says you’re wrong)
    Publish (possibly destroy your career)
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 16 / 37

    View Slide

  22. Bad Data Examples
    Hubble’s win 2003
    Data comes from the Devil; models come from God (H0
    70 today)
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 17 / 37

    View Slide

  23. Bad Data Examples
    Counting is easy
    Can’t have errors ... right?
    But might get entangled
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 18 / 37

    View Slide

  24. Bad Data Examples
    CERN LHC paper
    “ATLAS Collaboration” is a proxy for the entire experimental team
    Too many authors to fit under the paper title
    But exactly how many?
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 19 / 37

    View Slide

  25. Bad Data Examples
    CERN authors
    Author list is 15 appended pages!
    Article limit is 4 pages
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 20 / 37

    View Slide

  26. Bad Data Examples
    CERN authors
    Author list is 15 appended pages!
    Article limit is 4 pages
    More than 3000 authors !!!
    · · · · · · · · ·
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 20 / 37

    View Slide

  27. Bad Data Examples
    Just write a progam (right?)
    Author counts Sources or error:
    PDF has to be parsed as text
    Comma separators, new lines, etc.
    Weird numbers for author affiliation
    Different programs get different counts
    What about the tea lady?
    Streaming counts Sources or error:
    Is counter or sensor working correctly?
    Is event being triggered correctly?
    Is the count being transported reliably?
    Is count being accumulated correctly?
    Are statistics (quantiles, histrograms, etc.)
    What is the error margin?
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 21 / 37

    View Slide

  28. Bad Data Examples
    Load average — mangled metrics
    [njg]~% uptime
    11:28 up 2:50, 3 users, load averages: 0.91 1.09 1.08
    Load average is the original performance tool (c.1965)
    Measures run-queue length in Unix/Linux OS
    Not your average kind of average (low-pass filter)
    In Solaris 2.0 and 2.2 kernel code for LA was modified
    Sun Microsystems never told anyone
    Remained broken until Solaris 2.3
    If you used it for sys admin or capacity planning ... good luck!
    Software is so malleable
    You need to regression test your vendor perf tools
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 22 / 37

    View Slide

  29. How to Express Errors
    Outline
    1 Introduction
    2 Measurement Basics
    3 Bad Data Examples
    4 How to Express Errors
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 23 / 37

    View Slide

  30. How to Express Errors
    All data has errors
    1 Data has to be sourced
    2 The source is usually some form of measuring instrument(ation)
    3 Measuring always introduces errors (wrongness)
    4 All data are wrong
    5 The real question is:
    How big is that wrongidity?
    How much wrongidityness can you tolerate? (SLAs)
    Out of sight, out of mind
    A widespread problem today is that most people doing data analysis are far
    removed from the sources of data and the type of measurement processes
    that generated those data. Blissfully unaware of the measurement aspect, the
    whole notion of errors never enters their mind. Data appear divine.
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 24 / 37

    View Slide

  31. How to Express Errors
    Expressing measurement error
    Need more that one sample measurement
    1 Take n data samples
    2 Take the sample mean (usual average)
    3 Take the sample standard deviation around that mean
    4 Compute the standard error (divide std dev by

    n)
    5 Write the result as mean ± std error
    6 Range around the mean quantifies the error margin
    7 Rule of thumb:
    Relative error:
    std error
    mean
    × 100
    A standard error of around ±5% is typical and acceptable
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 25 / 37

    View Slide

  32. How to Express Errors
    Example
    times.ms <- rnorm(100, mean=10, sd=3) # fake some data
    plot(times.ms, type="h", col="blue", xlab="Sample", ylab="Time (ms)")
    # data analysis
    mu <- mean(times.ms) # 9.871424
    sd <- sd(times.ms) # 3.231457
    se <- sd / sqrt(length(times.ms)) # 0.3231457
    cat(sprintf("Reported value: %.2f \u00b1 %.2f ms\n", mu, se))
    abline(h=mu, col="red")
    abline(h=mu + se, col="red", lty="dashed")
    abline(h=mu - se, col="red", lty="dashed")
    Reported value: 9.87 ± 0.32 ms
    0 20 40 60 80 100
    5 10 15
    Sample
    Time (ms)
    0 20 40 60 80 100
    5 10 15
    Sample
    Time (ms)
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 26 / 37

    View Slide

  33. How to Express Errors
    Other error representations
    See slide 17 for error bars on Hubble data
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 27 / 37

    View Slide

  34. How to Express Errors
    Covid-19 curves (Added)
    Financial Times daily SIR curves don’t show any error bars (but could)
    Appear each day on Twitter
    Created by John Burn-Murdoch @jburnmurdoch
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 28 / 37

    View Slide

  35. How to Express Errors
    Confidence intervals
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 29 / 37

    View Slide

  36. How to Express Errors
    Significant digits
    Guerrilla CaP book Chap. 3
    Can’t have more digits reported than
    the smallest resolution of the measuring
    instrument
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 30 / 37

    View Slide

  37. How to Express Errors
    Accuracy vs. precision
    Also in Chap. 3
    Archery targets
    Distance from bullseye
    analogous to accuracy
    Clustering analogous to
    precision
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 31 / 37

    View Slide

  38. How to Express Errors
    Hubble’s data
    Hubble didn’t have a
    precision problem
    He had an accuracy
    problem
    His data was way off the
    bullseye
    Systematic error in 1920s
    telescope optics
    Likely suspected that or
    wouldn’t have published
    But made it worse by
    “correcting” the slope
    Despite all that, his linear
    model survived
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 32 / 37

    View Slide

  39. How to Express Errors

    View Slide

  40. How to Express Errors
    Check. Check. Then check again!
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 33 / 37

    View Slide

  41. How to Express Errors
    Waterboading your data
    0 100 200 300 400
    0 100 200 300 400 500
    Measured Throughput
    Offered load
    TPS
    0 100 200 300 400
    0 100 200 300 400
    Measured Response Time
    Offered load
    ms
    0 100 200 300 400
    0 100 200 300 400
    Running Threads
    Offered load
    Active threads
    Use Little’s law N = X ∗ R
    X and R are measured directly
    N: offered load on DVR side
    X ∗ R: actual load on SUT side
    N = X ∗ R !?
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 34 / 37

    View Slide

  42. How to Express Errors
    Summary
    All data are wrong. Why?
    Measurement is a process
    That process introduces errors (systematic, random, etc.)
    Need to see error margin to decide acceptability
    Check ... check .... and check again
    Measurements are repesented by numbers (but not math)
    Numbers like 72 or 72.37 are not sufficient
    Require 72.37 ±0.21 to indicate error explicitly
    Don’t exceed sigfigs
    Need (i) sample mean ˆ
    µ and (ii) sample standard deviation ˆ
    σ
    Measure n samples (1-shots are meaningless)
    Standard error: ˆ
    σ/

    n
    Reported value: ˆ
    µ ± ˆ
    σ/

    n (as above)
    But reporting errors does not guarantee meaningfulness (remember those neutrinos)
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 35 / 37

    View Slide

  43. How to Express Errors
    Wanna know more?
    1. Guerrilla book 2. Online training
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 36 / 37

    View Slide

  44. Comments and/or Questions?
    Performance Dynamics
    Castro Valley, California
    www.perfdynamics.com
    Training
    Twitter
    Facebook
    Blog
    [email protected]
    +1-510-537-5758
    c 2020 Performance Dynamics Resolved: All Data Are Wrong April 11, 2020 37 / 37

    View Slide