Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hunting for Bad Data: A Practitioner’s Guide to Self Healing Systems

November 06, 2019

Hunting for Bad Data: A Practitioner’s Guide to Self Healing Systems

In 2013, the total amount of data in the world was 4.4 zettabytes. In 2020 it is estimated to be 10 times more. With speed advances and miniaturization of computers, it is now easier than even to collect all sorts of data in vast amounts. But quantity does not equate quality. Some of the most expensive software defects were caused by handing an incorrect data. The most recent example is the crash of the Schiaparelli Mars Lander.

Back to Earth, at Falcon.io, we investigate every exception that is generated from our production environment. We were surprised to find out that 19% are caused by bad data. This includes missing data, data with the wrong type, truncated data, duplicated data, inconsistent format etc. As our core business is to collect and process and analyze data from the biggest social networks, this finding was a major wakeup call.

“Jidoka” is a term that comes from lean manufacturing. It means to give the machines “the ability to detect when an abnormal condition has occurred”. We wanted to go one step further and create a simple, yet robust system ability to not only recognize bad data but also to fix it automatically so that it can heal itself. As it turns out, in some cases, it’s faster/easier to fix bad data than to locate the code that produced it. This talk is a practical one, containing ideas and tips that you can quickly implement when dealing with data quality.


November 06, 2019

More Decks by emanuil

Other Decks in Programming


  1. Hunting for Bad Data A Practitioner’s Guide to Self Healing

    Systems [email protected] @EmanuilSlavov
  2. A Tale of Two Defects @EmanuilSlavov

  3. @EmanuilSlavov

  4. @EmanuilSlavov

  5. Defects Manifestations Resources Logs Data @EmanuilSlavov

  6. Test your data the way you test your code. @EmanuilSlavov

  7. @EmanuilSlavov

  8. Gartner, 2013 IBM, 2016 Ovum, 2014 @EmanuilSlavov

  9. 5%of our data was bad. Any operation on it would

    cause an exception/defect @EmanuilSlavov
  10. 19%of our backend exceptions are caused by bad data @EmanuilSlavov

  11. What is Bad Data?* Missing Bad Format Unrealistic Unsynchronized Conflicting

    Duplicated * The Quartz guide to bad data
 github.com/Quartz/bad-data-guide @EmanuilSlavov
  12. Two Types of Checks

  13. Data Sanity Checks @EmanuilSlavov The data is clearly not valid,

    wrong format, missing etc.
  14. Business Data Checks @EmanuilSlavov The data looks valid, but does

    not conform to the business at hand.
  15. Self Check @EmanuilSlavov

  16. Investigate Defect or Exception Fix Root Cause Write Automated Test

  17. Investigate Defect or Exception Caused by Bad Data? Add Automatic

    Bad Data Check Fix Root Cause Write Automated Test Yes No System Level [database] Unit Level [codebase] @EmanuilSlavov
  18. Check production periodically Run after automated tests pass* *on dedicated

    test environment @EmanuilSlavov
  19. The Problems with DB Checks May take too much time

    Data entering the system @EmanuilSlavov
  20. Data Input Data Output Some Manipulation DB Usually covered by

    input validation Most likely checks are missing Read/Write Checks @EmanuilSlavov
  21. Checks Before DB Write @EmanuilSlavov

  22. Schema vs NoSchema @EmanuilSlavov

  23. @EmanuilSlavov DB Schema Advantages Data Type Default Value Permitted Values

  24. @EmanuilSlavov github.com/emanuil/kobold

  25. @EmanuilSlavov

  26. @EmanuilSlavov

  27. As testers, our job does not end when we release

    a feature. @EmanuilSlavov
  28. Data Repair @EmanuilSlavov

  29. Automatic detection is good, but automatic repair is better. @EmanuilSlavov

  30. Its faster to fix data than code The offending code

    is might not be there Might be hard to find what caused the bad data Future defect prevention @EmanuilSlavov
  31. Standard Fixes @EmanuilSlavov

  32. Remove an entry Set a default value Extract missing value

    from metadata Approximate from neighboring data Request missing data again Archive/delete old data @EmanuilSlavov
  33. Those standard fixes are easy to script. @EmanuilSlavov

  34. Run the script automatically on a given period to self

    heal your system. @EmanuilSlavov
  35. How to Start? @EmanuilSlavov

  36. Define what is bad data for your context Examine Bugs

    and Exceptions Put checks in a script and run it periodically Study the common fixes and script them Make sure your backups are working @EmanuilSlavov
  37. Jidoka ⾃自働化 @EmanuilSlavov

  38. @EmanuilSlavov

  39. None