Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hunting for Bad Data: A Practitioner’s Guide to...

emanuil
November 06, 2019

Hunting for Bad Data: A Practitioner’s Guide to Self Healing Systems

In 2013, the total amount of data in the world was 4.4 zettabytes. In 2020 it is estimated to be 10 times more. With speed advances and miniaturization of computers, it is now easier than even to collect all sorts of data in vast amounts. But quantity does not equate quality. Some of the most expensive software defects were caused by handing an incorrect data. The most recent example is the crash of the Schiaparelli Mars Lander.

Back to Earth, at Falcon.io, we investigate every exception that is generated from our production environment. We were surprised to find out that 19% are caused by bad data. This includes missing data, data with the wrong type, truncated data, duplicated data, inconsistent format etc. As our core business is to collect and process and analyze data from the biggest social networks, this finding was a major wakeup call.

“Jidoka” is a term that comes from lean manufacturing. It means to give the machines “the ability to detect when an abnormal condition has occurred”. We wanted to go one step further and create a simple, yet robust system ability to not only recognize bad data but also to fix it automatically so that it can heal itself. As it turns out, in some cases, it’s faster/easier to fix bad data than to locate the code that produced it. This talk is a practical one, containing ideas and tips that you can quickly implement when dealing with data quality.

emanuil

November 06, 2019
Tweet

More Decks by emanuil

Other Decks in Programming

Transcript

  1. 5%of our data was bad. Any operation on it would

    cause an exception/defect @EmanuilSlavov
  2. What is Bad Data?* Missing Bad Format Unrealistic Unsynchronized Conflicting

    Duplicated * The Quartz guide to bad data
 github.com/Quartz/bad-data-guide @EmanuilSlavov
  3. Investigate Defect or Exception Caused by Bad Data? Add Automatic

    Bad Data Check Fix Root Cause Write Automated Test Yes No System Level [database] Unit Level [codebase] @EmanuilSlavov
  4. The Problems with DB Checks May take too much time

    Data entering the system @EmanuilSlavov
  5. Data Input Data Output Some Manipulation DB Usually covered by

    input validation Most likely checks are missing Read/Write Checks @EmanuilSlavov
  6. Its faster to fix data than code The offending code

    is might not be there Might be hard to find what caused the bad data Future defect prevention @EmanuilSlavov
  7. Remove an entry Set a default value Extract missing value

    from metadata Approximate from neighboring data Request missing data again Archive/delete old data @EmanuilSlavov
  8. Define what is bad data for your context Examine Bugs

    and Exceptions Put checks in a script and run it periodically Study the common fixes and script them Make sure your backups are working @EmanuilSlavov