Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Quest to Find Bad Data With Data Profiling

Richie Rump
February 07, 2015

The Quest to Find Bad Data With Data Profiling

The legendary adventurer is on the quest of a lifetime. He's found a database that holds the locations of the world's greatest relics, but there's a problem. No, it's not snakes. The database was written using an ancient format and it may hold incorrect data. Our data archaeologist must find a way move the correct data into SQL Server or risk losing our most important treasures, forever. Luckily, he has an powerful tool, data profiling.
 
Data profiling is a way that you can identify bad data, measure inaccurate data, and know for sure that your queries are right. There are many different ways our adventurer can use data profiling. This session will show you what data profiling is and how it can help lead you to fortune and glory.

Richie Rump

February 07, 2015
Tweet

More Decks by Richie Rump

Other Decks in Technology

Transcript

  1. • Degree of excellence exhibited by the data in relation

    to the portrayal of the actual scenario. • The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use. • The totality of features and characteristics of data that bears on their ability to satisfy a given purpose; the sum of the degrees of excellence for factors related to data. • The processes and technologies involved in ensuring the conformance of data values to business requirements and acceptance criteria. • Complete, standards based, consistent, accurate and time stamped. “Data Quality” Wikipedia, revised on December 17, 2014, http://en.wikipedia.org/wiki/Data_quality
  2. • Boring enterprisey buzz words. • This is not a

    data quality strategy discussion • Hands-on tactical view for people who want to know how to get things done.
  3. • Investigate what data profiling is • Look at data

    profiling techniques • Walk through some data profiling tools • “Free Stuff” • The stuff you can use right now.
  4. Data profiling is a process whereby one examines the data

    available in an existing database or flat file and collect statistics and information about that data. Ed Lindsey, Three-Dimensional Analysis: Data Profiling Techniques (Data Profiling LLC, 2008), 29.
  5. Because you need to: • Transform the data • Move

    the data • Find patterns in data • The data is unknown (to you) • Improve data quality • Data warehouse stuff
  6. • Inferred data type • Data types • Minimum, mean

    and max values • Nulls rule, % of nulls, the number of nulls • Distinct values • Frequency distribution (counts) • Pattern analysis
  7. • Identify primary keys • Identify candidate keys • Verify

    foreign keys • Candidates for normalization • Duplicate rows
  8. • Apply your business rules to the data. • If

    you don’t know the rules of the business you better figure it out.
  9. • Evaluate the Table and Column Schema • Get a

    feel of what kind of data you’ll be dealing with. • Check out the data types • Take a peek at the data itself • DMV’s are great for viewing the table and column information
  10. • Determine Min, Max and Mean for numeric columns. •

    Gives an idea about the range of the data. • Does the Min, Max and Mean make sense for the data?
  11. • Nulls can be a problem so we need to

    know what we’re dealing with. • Nulls rule: Is it nullable? • Number of nulls: How many nulls are we dealing with? • Percentage of nulls: How many nulls compared with the number of rows?
  12. • Need to look at the individual values of a

    column and how frequently they occur. • We can do this by getting the distinct values and counts (frequency distribution) of a column.
  13. • Determine the format that a string is in. •

    Valid Phone Numbers • 3058675309 • (305)8675309 • 305-867-5309 • (305) 867-5309
  14. • We can use LIKE or PATINDEX in TSQL •

    We can use Regular Expressions against the data.
  15. • Mostly important for doing wibbly, wobbly, data warehouse-y stuff.

    • Need to go though column(s) and identify which combination of columns would make good candidate keys
  16. • Does the data in one table match the data

    in another. • What if we had a type that was missing? • We can accomplish this by using left joins and see what is missing.
  17. • Do we have data that needs to be normalized?

    • Delimited strings • Addresses • Repeated data that could use it’s own table?
  18. • Because we don’t want another Clone Wars. • Maybe

    it’s valid…maybe it’s not. • But we need to know if it’s there or not.
  19. • Since SQL Server 2008 SSIS has included the Data

    Profiling Task • Connects to a database and performs different data profiling tasks • Outputs information in XML • Use the Data Profile Viewer application to view the results.
  20. • Column Length Distribution • Column Null Ratio • Column

    Pattern • Column Statistics • Column Value Distribution • Candidate Key • Functional Dependency • Value Inclusion
  21. • Shows the extent that the values in one column

    depends on the values in another column(s). • Example: Zip Codes should all correspond to the same State.
  22. • Data profiling tools in one easy to use script.

    • Table Schema • NULL Ratios and Unique Values • Column Statistics • Candidate Key Check • Column Value Distribution
  23. • Three-Dimensional Analysis: Data Profiling Techniques, Ed Lindsey • CLR

    Assembly RegEx Functions for SQL Server by Example - https://www.simple-talk.com/sql/t-sql- programming/clr-assembly-regex-functions-for-sql- server-by-example/ • Pattern Matching (Regex) in T-SQL - http://www.sqllion.com/2010/12/pattern-matching- regex-in-t-sql/ • Data Profiling Task (MSDN) http://msdn.microsoft.com/en- us/library/bb895263.aspx