Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Cleaning on text to prepare for analysis a...

ianozsvald
August 28, 2015

Data Cleaning on text to prepare for analysis and machine learning @ EuroSciPy 2015

Dirty data makes analysis and machine learning harder (or impossible!) and more prone to failure. I'll talk on the techniques we use at ModelInsight to fix badly encoded, inconsistent and hard-to-parse text data that enable us to prepare real-world industrial data for research.

Topics will include text cleaning through normalisation and similarity measures, date parsing, data joining and visualisation. This talk is aimed at helping you make rapid progress on new projects.

Conference link:
https://www.euroscipy.org/2015/schedule/presentation/4/
Write-up:
http://ianozsvald.com/2015/08/28/euroscipy-2015-and-data-cleaning-on-text-for-ml-talk/

ianozsvald

August 28, 2015
Tweet

More Decks by ianozsvald

Other Decks in Science

Transcript

  1. Data Cleaning on text to prepare for analysis and machine

    learning @ EuroSciPy 2015 Ian Ozsvald @IanOzsvald ModelInsight.io
  2. [email protected] @IanOzsvald EuroSciPy August 2015 Who am I? • Past

    speaker+teacher at PyDatas, EuroPythons, PyCons, PyConUKs • Co-org of PyDataLondon • O'Reilly Author • ModelInsight.io for NLP+ML IP creation in London • “I clean data” #sigh • Please learn from my mistakes
  3. [email protected] @IanOzsvald EuroSciPy August 2015 Unstructured data->Value • Increasing rate

    of growth and meant for human consumption • Hard to: • Extract • Parse • Make machine-readable • It is also very valuable...part of my consultancy - we're currently automating recruitment: • Uses: search, visualisation, new ML features • Most industrial problems messy, not “hard”, but time consuming! • How can we make it easier for ourselves? • “80% of our time is cleaning data” [The Internet]
  4. [email protected] @IanOzsvald EuroSciPy August 2015 Extracting text from PDFs &

    Word • http://textract.readthedocs.org/en/latest/ (Python) • Apache Tika (Java - via jnius?) • Difficulties • Formatting probably horrible • No semantic interpretation (e.g. CVs) • Keyword stuffing, images, out-of-order or multi-column text, tables #sigh • “Content ExtRactor and MINEr” (CERMINE) for academic papers • Commercial CV parsers (e.g. Sovren) • Do you know of other tools that add structure?
  5. [email protected] @IanOzsvald EuroSciPy August 2015 Extracting tables from PDFs •

    ScraperWiki's https://pdftables.com/ (builds on pdfminer) • http://tabula.technology/ (Ruby/Java OS, seems to require user intervention) • messytables (Python, lesser known, auto- guesses dtypes for CSV & HTML & PDFs) • Maybe you can help with better solutions?
  6. [email protected] @IanOzsvald EuroSciPy August 2015 Fixing badly encoding text •

    http://ftfy.readthedocs.org/en/latest/ • HTML unescaping: • chromium-compact-language-detector will guess human language from 80+ options (so you can choose your own decoding options) -> “Turkish I Problem (next)” • chardet: CP1252 Windows vs UTF-8
  7. [email protected] @IanOzsvald EuroSciPy August 2015 Interpreting dtypes • Use pandas

    to get text data (e.g. from JSON/CSV) • Categories (e.g. “male”/”female”) are easily spotted by eye • [“33cm”, “22inches”, ...] could be easily converted • Date parsing: • The default is for US-style (MMDD), not Euro-style (DDMM) • pd.from_csv(parse_dates=[cols], dayfirst=False) • Labix dateutil, delorean, arrow, parsedatetime (NLP) • Could you write a module to suggest possible conversions on dataframe for the user (and notify if ambiguities are present e.g. 1/1 to 12/12...MM/DD or DD/MM)?
  8. [email protected] @IanOzsvald EuroSciPy August 2015 Automate feature extraction? • Can

    we extract features from e.g. Id columns (products/rooms/categories)? • We could identify categorical labels and suggest Boolean column equivalents • We could remove some of the leg-work...you avoid missing possibilities, junior data scientists get “free help” • What tools do you know of and use?
  9. [email protected] @IanOzsvald EuroSciPy August 2015 Merging two data sources •

    pd.merge(df1, df2) # exact keys, SQL- like • fuzzywuzzy/metaphone for approximate string matching • DataMade's dedupe.readthedocs.org to identify duplicates (or OpenRefine)
  10. [email protected] @IanOzsvald EuroSciPy August 2015 Manual Normalisation • Eyeball the

    problem, solve by hand • Lots of unit-tests! • lower() # “Accenture”->”accenture” • strip() # “ this and ”->”this and” • Beware ;nbsp& (approx 20 of these!) • replace(<pattern>,””) # “BigCo Ltd”->”BigCo” • unidecode # “áéîöũ”->”aeiou” • normalise unicode (e.g. >50 dash variants!) • NLTK stemming & WordNet ISA relt.
  11. [email protected] @IanOzsvald EuroSciPy August 2015 Representations of Null • Just

    have 1 (not 4!) • Consider Engarde & Hypothesis • Write a schema-checker to check all of this on the source data & in your DataFrames!
  12. [email protected] @IanOzsvald EuroSciPy August 2015 Automated Normalisation? • My annotate.io

    • Why not make the machine do this for us? No regular expressions! No fiddling!
  13. [email protected] @IanOzsvald EuroSciPy August 2015 Visualising new data sources •

    GlueViz (numeric) • SeaBorn • setosa.io/csv-fingerprint/ • Do you have good tools?
  14. [email protected] @IanOzsvald EuroSciPy August 2015 Augmenting Data • Alchemy -

    sentiment and entities • DBPedia - entities – http://dbpedia.org/data/Barclays.json • NLTK • SpaCy • Use ML... but don't forget regexs and other simple techniques
  15. [email protected] @IanOzsvald EuroSciPy August 2015 Starting ML on Text •

    SKLearn's CountVectorizer for binary features • BernoulliNaiveBayes (then LogReg) • Favour better data & features • Diagnose failures at each stage • Avoid complex models until you need them
  16. [email protected] @IanOzsvald EuroSciPy August 2015 Closing... • Give me feedback

    on annotate.io • Give me your dirty-data horror stories, I want to fix some of these problems • http://ianozsvald.com/ • PyDataLondon monthly meetup • Do you have data science deployment stories for my keynote at BudapestBIForum? What's “hardest” in (data) science for your team?