In 2013, the total amount of data in the world was 4.4 zettabytes. In 2020 it is estimated to be 10 times more. With speed advances and miniaturization of computers, it is now easier than even to collect all sorts of data in vast amounts. But quantity does not equate quality. Some of the most expensive software defects were caused by handing an incorrect data. The most recent example is the crash of the Schiaparelli Mars Lander.
Back to Earth, at Falcon.io, we investigate every exception that is generated from our production environment. We were surprised to find out that 19% are caused by bad data. This includes missing data, data with the wrong type, truncated data, duplicated data, inconsistent format etc. As our core business is to collect and process and analyze data from the biggest social networks, this finding was a major wakeup call.
“Jidoka” is a term that comes from lean manufacturing. It means to give the machines “the ability to detect when an abnormal condition has occurred”. We wanted to go one step further and create a simple, yet robust system ability to not only recognize bad data but also to fix it automatically so that it can heal itself. As it turns out, in some cases, it’s faster/easier to fix bad data than to locate the code that produced it. This talk is a practical one, containing ideas and tips that you can quickly implement when dealing with data quality.
Hunting for Bad Data
A Practitioner’s Guide to Self Healing Systems
A Tale of Two Defects
Test your data the way
you test your code.
5%of our data was bad. Any operation
on it would cause an exception/defect
19%of our backend exceptions
are caused by bad data
What is Bad Data?*
Missing Bad Format
* The Quartz guide to bad data
Data Sanity Checks
The data is clearly not valid, wrong format,
Business Data Checks
The data looks valid, but does not conform
to the business at hand.
Bad Data Check
Check production periodically
Run after automated tests pass*
*on dedicated test environment
with DB Checks
May take too much time
Data entering the system
by input validation
Most likely checks
DB Schema Advantages
As testers, our job does not
end when we release a feature.
Automatic detection is good,
but automatic repair is better.
Its faster to ﬁx data than code
The oﬀending code is might not be there
Might be hard to ﬁnd what caused the bad data
Future defect prevention
Remove an entry
Set a default value
Extract missing value from metadata
Approximate from neighboring data
Request missing data again
Archive/delete old data
Those standard ﬁxes
are easy to script.
Run the script automatically
on a given period to self heal
Deﬁne what is bad data for your context
Examine Bugs and Exceptions
Put checks in a script and run it periodically
Study the common ﬁxes and script them
Make sure your backups are working