Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Cleansing / data related to individual

Data Cleansing / data related to individual

Bence Faludi

June 25, 2014
Tweet

More Decks by Bence Faludi

Other Decks in Technology

Transcript

  1. HELLO WORLD Bence Faludi (@bfaludi) Senior Database Manager ! Data

    warehouse specialist, IT project manager, Python expert. Built databases for Nissan, led IT development for Union Insurance, Central European University, Procter & Gamble and Profession.hu. Organizer of the Budapest Database Meetup and creator of the mETL business intelligence tool. ! Bence has more than 9 years of experience in development and project management. ! email: [email protected] twitter: @bfaludi Positions * Senior Database Manager @ Mito Europe * Head Of Development @ Ozmo * Organizer @ Budapest Database Meetup Responsibilities * Data warehouse design * Mathematical predictions * Data Cleansing & Analytics * Data Consulting * ETL & Python/Go Development * IT Project Management
  2. MITO & DATA data mining data cleansing & deduplication data

    data warehouse maintenance data visualisation data analysis CRM data reporting BI software development data migration
  3. WAT? ‣ Data cleansing is the process of detecting and

    correcting or removing corrupted or inaccurate records from a record set. ‣ Data cleansing also involves activities like, harmonization and standardization of data.
  4. PRINCIPLES ‣ Incorrect or inconsistent data can lead to false

    conclusions. ‣ Incorrect data can be costly. ‣ Spoils the image of the company. ‣ Makes it unable to reach out your target audience.
  5. PURPOSE ‣ Improve data quality. ‣ Increase data reliability. ‣

    Ensure data consistency. ‣ Eliminate errors and redundancy to remove all duplicated records. ‣ Prevent data purging issues.
  6. PROCESS Data Auditing to detect anomalies 1 Define structure and

    operations on the data 2 Execution run the operation workflow 3 Post-processing verify the result manually* 4
  7. SPLIT Standardize & correct your data before it will be

    loaded into the database Improve existing database Prevent Follow-up
  8. PROCESS OF PREVENTING Preparations detect anomalies, augment, rearrange, standardize automatically*

    1 Define structure and transformations & operations on the data 2 Execution run the operation workflow 3 Post-processing verify the result manually* 4 ETL load final dataset into database 5
  9. PROBLEMS TO SOLVE ‣ typographical errors, missing letters ‣ split

    address information into parts ‣ address standardization ‣ roman number conversion ‣ missing accents ‣ missing special characters ‣ remove test or scam data ‣ first name, last name classification and sorting ‣ detecting company’s name ‣ formatting phone numbers to E164 ‣ replacement of phone number’s national code
  10. EXAMPLE Full Name Company’s Name Address City Postal Code Phone!

    Number Email ! Address Kovács Péter Koin Finances Kft. Nador 23 Buda 1051 +36 1 132 1111 fp kukac gmail pont hu Dr. Kovacs Akos Sziv út 49, 1em/11 Budapest VI. kerület 1788333 [email protected] Farkas Ambrus János Player Media 8. kerulet 06- (31)-987/777 fa.janos [at] gmai [dot] com Éva Ambrusné Hegedűs Hegedűs Bt. 21 Ady Endre Str Érd 1221 6301234567, 06301238888 Mito Europe Kft. Bence Faludi Nádor u. 23. V. emelet Budapest 1053 [email protected]
  11. DEFINE EXPECTED DATA Full Name Company’s Name Address City Postal

    Code Phone! Number Email ! Address Kovács Péter Koin Finances Kft. Nador 23 Buda 1051 +36 1 132 1111 fp kukac gmail pont hu Dr. Kovacs Akos Sziv út 49, 1em/11 Budapest VI. kerület 1788333 [email protected] Farkas Ambrus János Player Media 8. kerulet 06- (31)-987/777 fa.janos [at] gmai [dot] com Éva Ambrusné Hegedűs Hegedűs Bt. 21 Ady Endre Str Érd 1221 6301234567, 06301238888 Mito Europe Kft. Bence Faludi Nádor u. 23. V. emelet Budapest 1053 [email protected] Full Name Company Name Phone Email Address City Postal Code
  12. AUGMENTATION Full Name Kovács Péter Dr. Kovacs Akos Farkas Ambrus

    János Éva Ambrusné Hegedűs Mito Europe Kft. FIRSTNAME LASTNAME FIRSTNAME LASTNAME TITLE FIRSTNAME LASTNAME FIRSTNAME LASTNAME FIRSTNAME LASTNAME FIRSTNAME MARRIED LASTNAME LASTNAME COMPANY POSTFIX UNKNOWN UNKNOWN 80 percent of the records contain name parts Common format (up to 75%): [Lastname] [Firstname] Send to manual check Mark it, not valid name value Full Name FULL NAME COMPANY NAME FULL NAME FULL NAME FULL NAME
  13. AUGMENTATION Company’s Name Koin Finances Kft. Player Media Hegedűs Bt.

    Bence Faludi Company’s name Mark it, not valid company’s name COMPANY POSTFIX UNKNOWN UNKNOWN FIRSTNAME LASTNAME UNKNOWN UNKNOWN LASTNAME COMPANY POSTFIX COMPANY NAME COMPANY NAME COMPANY NAME FULL NAME We won’t send it to manual check.
  14. AUGMENTATION Full address Address City Postal Code Nador 23 Buda

    1051 Sziv út 49, 1em/11 Budapest VI. kerület 8. kerulet 21 Ady Str Érd Nádor u. 23. V. emelet Budapest 1053 CITY PART CITY N LASTNAME STR SUFFIX CITY CITY CITY W DISTRICT CITY W DISTRICT ZIP STR SUFFIX N FIRSTNAME RN FLOOR SUFFIX STR SUFFIX N UNKNOWN FLOOR SUFFIX N N FIRSTNAME N CITY CITY CITY ZIP DISTRICT CITY DISTRICT CITY CITY ZIP STREET STREET STREET STREET ZIP
  15. AUGMENTATION Phone + Email Phone! Number Email ! Address +36

    1 132 1111 fp kukac gmail pont hu 1788333 [email protected] 06- (31)-987/777 fa.janos [at] gmai [dot] com 1221 6301234567, 06301238888 [email protected] PHONE N PHONE PHONE UNKNOWN AT UNKNOWN DOT DOMAIN SUFFIX EMAIL EMAIL UNKNOWN AT UNKNOWN DOT DOMAIN SUFFIX PHONE ZIP N ZIP MULTIPLE PHONE EMAIL EMAIL EMAIL EMAIL PHONE PHONE PHONE Mark it, not valid email address & phone number N PHONE
  16. REARRANGE Full Name Company’s Name Address City Postal Code Phone!

    Number Email ! Address Kovács Péter Koin Finances Kft. Nador 23 Buda 1051 +36 1 132 1111 fp kukac gmail pont hu Dr. Kovacs Akos Sziv út 49, 1em/11 Budapest VI. kerület 1788333 [email protected] Farkas Ambrus János Player Media 8. kerulet 06- (31)-987/777 fa.janos [at] gmai [dot] com Éva Ambrusné Hegedűs Hegedűs Bt. 21 Ady Str Érd 1221 6301234567, 06301238888 Mito Europe Kft. Bence Faludi Nádor u. 23. V. emelet Budapest 1053 [email protected] FULL NAME COMPANY NAME FULL NAME FULL NAME FULL NAME COMPANY NAME COMPANY NAME COMPANY NAME FULL NAME STREET STREET STREET CITY CITY CITY STREET ZIP DISTRICT CITY DISTRICT CITY CITY ZIP ZIP PHONE PHONE PHONE EMAIL EMAIL EMAIL EMAIL MULTIPLE PHONE
  17. IMPLEMENTATION ‣ Name collections per country
 9419 first name, 106484

    last name in Hungary ‣ Available titles and name prefixes 
 Dr, I﬋, Özv, … ‣ Postal code ranges 
 1000-9999 ‣ Cities, city parts, districts
 Budapest, Debrecen, Margitsziget, Békásmegyer, … ‣ Regular expressions
 Email, Phone number, … ! ‣ List of possible terms and spelling variations: ‣ District suffix 
 kerület, ker., … ‣ Street type
 út, utca, u., … ‣ Floor suffix
 emelet, em., … ‣ Email components
 pont, kukac, at, dot, … ‣ Company abbreviations
 Bt., Kft., Zrt., …
  18. PROCESS OF PREVENTING Preparations detect anomalies, augment, rearrange, standardize automatically*

    1 Define structure and transformations & operations on the data 2 Execution run the operation workflow 3 Post-processing verify the result manually* 4 ETL load final dataset into database 5
  19. BASIC OPERATIONS Input data Output data Name Full name data

    type Full name Full name data type Company name Company name data type City City data type City w empty/constant value Address Full Str. Address data type Address Full Str. Address data type
  20. COPY FROM Input data Output data City 2. City data

    type City 1. City data type City 1. City data type
  21. SPLIT MULTIPLE VALUES Phone Numbers Phone Number data type Input

    data Output data Phone 1. Phone Number data type Phone 2. Phone Number data type example: “6301234567, 06301238888”
  22. MULTI LEVEL DATA TYPES Street Street type House number Building

    City District Postal Code Full Street Name Full Street Address City Part Full Address Street Number Full Street Floor Door
  23. SPLIT DATA TYPE INTO PARTS Address Full Str. Address data

    type Input data Output data Full street Full Street data type Number House Number data type Building Building data type
  24. MERGE DATA TYPES Input data Output data First name First

    name data type Last name Last name data type Full name Full Name data type Title Title data type Middle name First name data type Need to define the expected ordering, if it is different then the default one.
  25. EXAMPLE Input Full name Full name data type First name

    First name data type Last name Last name data type Address Full Str. Address data type Full Street Full Street data type Number Street Number data type Company name Company name data type City City data type City City data type Postal Code Postal Code data type Postal Code Postal Code data type District District data type Email Email data type Email Email data type Phone Number Phone Number data type Phone 1. Phone Number data type Phone 2. Phone Number data type Output structure Country Country data type
  26. ARRANGE THE STRUCTURE First name First name data type Full

    Street Full Street data type City City data type Postal Code Postal Code data type Phone 1. Phone Number data type Email Email data type Last name Last name data type Number Street Number data type District District data type Phone 2. Phone Number data type Output structure Country Country data type 1 2 4 6 5 3 7 8 9 10 11
  27. PROCESS OF PREVENTING Define structure and transformations & operations on

    the data 2 Execution run the operation workflow 3 Post-processing verify the result manually* 4 ETL load final dataset into database 5 Preparations detect anomalies, augment, rearrange, standardize automatically* 1
  28. EMAIL Python  package:     validate_email Standardize to email format

    Validate mx domain Email ! Address fp kukac gmail pont hu [email protected] fa.janos [at] gmai [dot] com [email protected] [email protected] 1 [email protected] 1 [email protected] 1 [email protected] 2 [email protected] 1 Not existing domain 2 Not existing domain 2 Clear the value
  29. PHONE NUMBER Phone! Number +36 1 132 1111 1788333 06-(31)-987/777

    6301234567 Not valid 3 +3631987777 2 +3611321111 1 +361788333 1 0631987777 1 Clear the value 06301234567 1 +36301234567 2 •Regular expression collection •Phone number parser •Determine number type (mobile, landline) •Standardize to E164 format. Python  package:     phonenumbers
  30. COMPANY ‣ Companies form
 Bt., Kft., Zrt., … ‣ Companies

    types
 Önkormányzat, Hivatal, … ‣ List of companies ‣ Company name
 Mito Europe ‣ Company form
 Kft. ‣ Company form *long
 Korlátolt felelősségű társaság ‣ Full Company name
 Mito Europe Kft. ‣ Full Company name *long
 Mito Europe Korlátolt felelősségű társaság ‣ Regular expressions without accent ‣ We cannot fix the spellings ‣ Leibenstein distance from the elements of the company list. ‣ We can fix the spellings
  31. NAME ‣ List of first names ‣ List of last

    names ‣ List of titles ‣ Before
 Dr. Kovácsné-Nagy Benjamin ‣ Split the full name into parts. ‣ Convert into lowercase. 1. ‣ kovácsné-nagy ‣ dr ‣ benjamin
  32. NAME ‣ List of first names ‣ List of last

    names ‣ List of titles ‣ Generate lowercase NFDK format of the name 2. ‣ kovácsné-nagy ‣ dr ‣ benjamin ‣ kovácsné-nagy
 kovacsne-nagy ‣ benjamin
 benjamin ‣ dr
 dr
  33. NAME ‣ List of first names ‣ List of last

    names ‣ List of titles ‣ Validate the original name and the NFDK format name. ‣ Use the name lists for the validation. ‣ Classify the identified parts. 3. ‣ kovácsné-nagy
 kovacsne-nagy ‣ benjamin
 benjamin ‣ dr
 dr Not exists kovácsné-nagy not exists ! kovacsne-nagy not exists dr benjamin benjámin? TITLE FIRST NAME
  34. NAME ‣ List of first names ‣ List of last

    names ‣ List of titles ‣ Try to split the non- existing names. ‣ Try to remove the affixes. ‣ Classify the identified parts. 4. ‣ kovácsné-nagy
 kovacsne-nagy ‣ kovácsné
 kovacsne ‣ nagy
 nagy ‣ kovács / né
 kovacs / ne LAST NAME MARRIED LAST NAME
  35. NAME ‣ List of first names ‣ List of last

    names ‣ List of titles ‣ Leibenstein distance from the elements of the name list. ! ! ‣ Extend the name to the complex form. ‣ Fix the spelling based on population data. 5. ‣ benjamin
 benjamin (~5001)
 benjámin (~6610) benjamin ‣ benjámin Use the common one if other parts not contain accent. Use if other parts contain accent. FIRST NAME FIRST NAME Mark it, manual check required of the name
  36. NAME ‣ List of first names ‣ List of last

    names ‣ List of titles ‣ Classify the gender 6. ‣ kovács /né
 last name ‣ benjamin
 first name FEMALE MALE Mark it, manual check required of the name If multiple genders were determined for the different part of the name
  37. NAME ‣ List of first names ‣ List of last

    names ‣ List of titles ‣ Classify the gender 6. ‣ anikó
 first name ‣ benjamin
 first name FEMALE MALE Set the gender to Unknown If multiple genders were determined for the same part of the name
  38. NAME ‣ List of first names ‣ List of last

    names ‣ List of titles ‣ Standardize to the given order. ! ! ‣ Convert to titlecase. 7. ‣ kovács /né
 last name ‣ benjamin
 first name ‣ dr
 title ‣ nagy
 last name ‣ Kovács /né
 last name ‣ Benjamin
 first name ‣ Dr
 title ‣ Nagy
 last name Data Type Merge - Expecting ordering: [Title]* [Firstname] [Lastname]+ Mark ambiguous names, manual correction required
  39. ADDRESS ‣ Determine the city ‣ from city’s name ‣

    from district ‣ from city part ‣ from postal code ‣ Fill out the postal code if missing 1. ‣ Validate the street by regular expressions ‣ street name ‣ street name parts ‣ Use the closest version from the matching records ‣ postal code ‣ district ‣ neighbourhood ‣ city 2. ‣ Map of cities & postal codes. ‣ List of streets, street types
  40. ADDRESS ‣ Map of cities & postal codes. ‣ List

    of streets, street types ‣ Fix the Postal Code if the difference is under a given score.
 2 letters in Hungary ‣ Postal Code exists but doesn’t.
 Manual check required with the suggestion of the valid city name if we can find the given street information. ‣ City is exists but Postal Code is not.
 Check the street information. If street is not filled out then manual check is required but the default suggestion is to clear the field’s value. ‣ City and Postal Code not existing.
 Clear the values. ‣ Postal Code exists but City is not filled out.
 Use the suggestion of the valid city name. ‣ City exists but Postal Code is not filled out.
 Check the street information and fill out with the matching record. CITY != POSTAL CODE
  41. ADDRESS ‣ Map of cities & postal codes. ‣ List

    of streets, street types ‣ Fix postal code and other information ‣ Set match level
 street, city part, district, etc. 3. ‣ Standardise the address information. ‣ Determine house number: 
 regular expressions per country 4.
  42. PROCESS OF PREVENTING Preparations detect anomalies, augment, rearrange, standardize automatically*

    1 Define structure and transformations & operations on the data 2 Execution run the operation workflow 3 Post-processing verify the result manually* 4 ETL load final dataset into database 5
  43. MARKED RECORDS ‣ Marked records were generated after the execution

    and require manual decision. ‣ Use suggested data ‣ Revert modifications ‣ Clear marked field ‣ Manual correction
  44. AMBIGUOUS RECORDS Full name Farkas Ambrus János Input data Execution

    First name Farkas Ambrus János Last name Farkas Ambrus János First name Ambrus János Last name Farkas Manual correction
  45. LEARNING CONTRADICTION Full name Farkas Ambrus János 3 month ago

    First name Ambrus János Last name Farkas 1 month ago Full name Farkas Ambrus János First name János Last name Farkas Ambrus now ?
  46. ADDRESSES ‣ Validate the range of house numbers to determine

    the correct classification. ‣ 2nd building, 4th floor ‣ 2-4 building ‣ Check roman numbers in addresses. ‣ Building: I ‣ House Number: 1 ‣ All abbreviations should be removed.
  47. NAMES ‣ Gender classification failed then check it manually. ‣

    Validate collected titles and prefixes. ‣ Check all last names that were modified during the execution. ‣ Validate splittings if name contains hyphen. ‣ Not common first name validation.
  48. MARKED RECORDS ‣ Not known first or last name ‣

    Not existing address or mayor fixes on the given input. ‣ Not identified city name when postal code related to multiple cities. ‣ Names with many parts. ‣ Scam or test record possibility. ‣ etc.
  49. REMOVE DUPLICATES Records in this Records after deduplication Kis Tamás

    Orfű Kádár Noémi Győr Nagy Péter Debrecen Nagy Péter Debrecen Kis Tamás Orfű Kádár Noémi Győr Nagy Péter Debrecen
  50. REMOVE DUPLICATES ‣ Recommended deduplication algorithm for preventing: ‣ Name

    && Phone Number ‣ Name && Email ‣ Name && City && Postal Code && Full Street Name && Name Freq. for City <= 0.00000061555 Mark as duplicated record
  51. PROCESS OF PREVENTING Preparations detect anomalies, augment, rearrange, standardize automatically*

    1 Define structure and transformations & operations on the data 2 Execution run the operation workflow 3 Post-processing verify the result manually* 4 ETL load final dataset into database 5
  52. ETL ‣ Extract, Transform and Load (ETL) refers to a

    process in database usage and especially in data warehousing that involves: ‣ Extracting data from outside sources ‣ Transforming it to fit operation needs, which can include quality levels ‣ Loading it into the end target ‣ Tools: ‣ Pentaho Kettle, IBM Information Server, Oracle Warehouse Builder (OWB), mETL
  53. SUMMARY ‣ It’s hard to create your own data cleansing

    application. ‣ Necessary and usable data is not available publicly. ‣ Data cleansing professionals with experience are indispensable. ‣ Every country has there own exceptions. ‣ Business rules are different every time. ‣ Use existing services.
  54. Bence Faludi (@bfaludi) Senior Database Manager [email protected] Bence’s inbox is

    open 0-24 if you have any questions. Thanks for your attention! we are hiring! [email protected]