Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deep Dive Data Cleaning

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

Deep Dive Data Cleaning

Advanced data cleaning concepts that every data scientist MUST know to simplify the data cleaning process.

Avatar for VICTOR OMONDI

VICTOR OMONDI

May 07, 2022

More Decks by VICTOR OMONDI

Other Decks in Technology

Transcript

  1. 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People &

    Organisations Interested In The Python Programming Language & Its Ecosystem 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People & Organisations Interested In The Python Programming Language & Its Ecosystem DEEP DIVE DATA CLEANING Victor Omondi
  2. 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People &

    Organisations Interested In The Python Programming Language & Its Ecosystem About Me Data Analyst @ Maisha Meds Organizer PyConKE 2022 VICK Victor Omondi
  3. 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People &

    Organisations Interested In The Python Programming Language & Its Ecosystem A Brief About Maisha Meds • https://maishameds.org/app/ • Our mission is to improve access to essential healthcare for patients by motivating providers to deliver affordable and quality care, enabled by technology. • We’re Hiring: https://maishameds.org
  4. Snapshot of my work Access Data CSV, DB, web scraping

    Extract Insights Analyze data Explore & Process Data Transform and standardize the data in a required format Report Insights Create visualization, dashboard or ppt 01 02 03 04
  5. But Why Clean the Data? Access Data Extract Insights Explore

    & Process Data Report Insights 01 02 03 04 Human Error Technical Error
  6. 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People &

    Organisations Interested In The Python Programming Language & Its Ecosystem Table of Contents
  7. 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People &

    Organisations Interested In The Python Programming Language & Its Ecosystem Common Data Problems • Inconsistent column names • Missing data • Outliers/Anomalies • Duplicate rows • Untidy • Inconsistent column types • SPELLING MISTAKES in TEXT
  8. 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People &

    Organisations Interested In The Python Programming Language & Its Ecosystem
  9. 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People &

    Organisations Interested In The Python Programming Language & Its Ecosystem Most Used Data Cleaning methods • Dropping • Imputation • Replacement
  10. 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People &

    Organisations Interested In The Python Programming Language & Its Ecosystem An example of dirty data
  11. 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People &

    Organisations Interested In The Python Programming Language & Its Ecosystem Dirty data is costly and difficult Lots of time, money and expertise is spent. Often, cleaning might be done wrong.
  12. 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People &

    Organisations Interested In The Python Programming Language & Its Ecosystem Ways of evaluating data quality • Valid • Accurate • Complete • Consistent • Uniform • Repeatable •
  13. 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People &

    Organisations Interested In The Python Programming Language & Its Ecosystem Dirty Data Adventure
  14. 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People &

    Organisations Interested In The Python Programming Language & Its Ecosystem Dealing with Duplicates?
  15. 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People &

    Organisations Interested In The Python Programming Language & Its Ecosystem
  16. 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People &

    Organisations Interested In The Python Programming Language & Its Ecosystem String Matching • JellyFish: https://github.com/jamesturk/jellyfish • Thefuzz (prev fuzzywuzzy): https://github.com/seatgeek/thefuzz
  17. 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People &

    Organisations Interested In The Python Programming Language & Its Ecosystem Where replacement is impossible… • Nairobi • Nairobae • Nairob • Narobi • Kanairobi • Kanairo • Naiobi
  18. 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People &

    Organisations Interested In The Python Programming Language & Its Ecosystem Record linkage…? • Record linkage: https://github.com/J535D165/recordlinkage
  19. 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People &

    Organisations Interested In The Python Programming Language & Its Ecosystem Privacy cleaning • Scrubadub: https://github.com/LeapBeyond/scrubadub
  20. 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People &

    Organisations Interested In The Python Programming Language & Its Ecosystem Improved datetimes • Arrow: https://github.com/arrow-py/arrow
  21. 3RD ANNUAL PYCON SUMMIT KE Biggest Gathering For People &

    Organisations Interested In The Python Programming Language & Its Ecosystem THANK YOU