$30 off During Our Annual Pro Sale. View Details »

Hunting for Bad Data: A Practitioner’s Guide to Self Healing Systems

emanuil
November 06, 2019

Hunting for Bad Data: A Practitioner’s Guide to Self Healing Systems

In 2013, the total amount of data in the world was 4.4 zettabytes. In 2020 it is estimated to be 10 times more. With speed advances and miniaturization of computers, it is now easier than even to collect all sorts of data in vast amounts. But quantity does not equate quality. Some of the most expensive software defects were caused by handing an incorrect data. The most recent example is the crash of the Schiaparelli Mars Lander.

Back to Earth, at Falcon.io, we investigate every exception that is generated from our production environment. We were surprised to find out that 19% are caused by bad data. This includes missing data, data with the wrong type, truncated data, duplicated data, inconsistent format etc. As our core business is to collect and process and analyze data from the biggest social networks, this finding was a major wakeup call.

“Jidoka” is a term that comes from lean manufacturing. It means to give the machines “the ability to detect when an abnormal condition has occurred”. We wanted to go one step further and create a simple, yet robust system ability to not only recognize bad data but also to fix it automatically so that it can heal itself. As it turns out, in some cases, it’s faster/easier to fix bad data than to locate the code that produced it. This talk is a practical one, containing ideas and tips that you can quickly implement when dealing with data quality.

emanuil

November 06, 2019
Tweet

More Decks by emanuil

Other Decks in Programming

Transcript

  1. Hunting for Bad Data
    A Practitioner’s Guide to Self Healing Systems
    [email protected]
    @EmanuilSlavov

    View Slide

  2. A Tale of Two Defects
    @EmanuilSlavov

    View Slide

  3. @EmanuilSlavov

    View Slide

  4. @EmanuilSlavov

    View Slide

  5. Defects
    Manifestations
    Resources
    Logs
    Data
    @EmanuilSlavov

    View Slide

  6. Test your data the way
    you test your code.
    @EmanuilSlavov

    View Slide

  7. @EmanuilSlavov

    View Slide

  8. Gartner, 2013
    IBM, 2016
    Ovum, 2014
    @EmanuilSlavov

    View Slide

  9. 5%of our data was bad. Any operation
    on it would cause an exception/defect
    @EmanuilSlavov

    View Slide

  10. 19%of our backend exceptions
    are caused by bad data
    @EmanuilSlavov

    View Slide

  11. What is Bad Data?*
    Missing Bad Format
    Unrealistic Unsynchronized
    Conflicting
    Duplicated
    * The Quartz guide to bad data

    github.com/Quartz/bad-data-guide
    @EmanuilSlavov

    View Slide

  12. Two Types
    of Checks

    View Slide

  13. Data Sanity Checks
    @EmanuilSlavov
    The data is clearly not valid, wrong format,
    missing etc.

    View Slide

  14. Business Data Checks
    @EmanuilSlavov
    The data looks valid, but does not conform
    to the business at hand.

    View Slide

  15. Self Check
    @EmanuilSlavov

    View Slide

  16. Investigate Defect
    or Exception
    Fix Root
    Cause
    Write Automated
    Test
    @EmanuilSlavov

    View Slide

  17. Investigate Defect
    or Exception
    Caused by
    Bad Data?
    Add Automatic
    Bad Data Check
    Fix Root
    Cause
    Write Automated
    Test
    Yes
    No
    System Level
    [database]
    Unit Level
    [codebase]
    @EmanuilSlavov

    View Slide

  18. Check production periodically
    Run after automated tests pass*
    *on dedicated test environment
    @EmanuilSlavov

    View Slide

  19. The Problems
    with DB Checks
    May take too much time
    Data entering the system
    @EmanuilSlavov

    View Slide

  20. Data Input
    Data Output
    Some
    Manipulation
    DB
    Usually covered
    by input validation
    Most likely checks
    are missing
    Read/Write Checks
    @EmanuilSlavov

    View Slide

  21. Checks Before
    DB Write
    @EmanuilSlavov

    View Slide

  22. Schema
    vs
    NoSchema
    @EmanuilSlavov

    View Slide

  23. @EmanuilSlavov
    DB Schema Advantages
    Data Type
    Default Value
    Permitted Values
    Nullable

    View Slide

  24. @EmanuilSlavov
    github.com/emanuil/kobold

    View Slide

  25. @EmanuilSlavov

    View Slide

  26. @EmanuilSlavov

    View Slide

  27. As testers, our job does not
    end when we release a feature.
    @EmanuilSlavov

    View Slide

  28. Data Repair
    @EmanuilSlavov

    View Slide

  29. Automatic detection is good,
    but automatic repair is better.
    @EmanuilSlavov

    View Slide

  30. Its faster to fix data than code
    The offending code is might not be there
    Might be hard to find what caused the bad data
    Future defect prevention
    @EmanuilSlavov

    View Slide

  31. Standard
    Fixes
    @EmanuilSlavov

    View Slide

  32. Remove an entry
    Set a default value
    Extract missing value from metadata
    Approximate from neighboring data
    Request missing data again
    Archive/delete old data
    @EmanuilSlavov

    View Slide

  33. Those standard fixes
    are easy to script.
    @EmanuilSlavov

    View Slide

  34. Run the script automatically
    on a given period to self heal
    your system.
    @EmanuilSlavov

    View Slide

  35. How to
    Start?
    @EmanuilSlavov

    View Slide

  36. Define what is bad data for your context
    Examine Bugs and Exceptions
    Put checks in a script and run it periodically
    Study the common fixes and script them
    Make sure your backups are working
    @EmanuilSlavov

    View Slide

  37. Jidoka
    ⾃自働化
    @EmanuilSlavov

    View Slide

  38. @EmanuilSlavov

    View Slide

  39. View Slide

  40. WE’RE HIRING.

    View Slide