Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Cleaning on text to prepare for analysis and machine learning @ EuroSciPy 2015

August 28, 2015

Data Cleaning on text to prepare for analysis and machine learning @ EuroSciPy 2015

Dirty data makes analysis and machine learning harder (or impossible!) and more prone to failure. I'll talk on the techniques we use at ModelInsight to fix badly encoded, inconsistent and hard-to-parse text data that enable us to prepare real-world industrial data for research.

Topics will include text cleaning through normalisation and similarity measures, date parsing, data joining and visualisation. This talk is aimed at helping you make rapid progress on new projects.

Conference link:


August 28, 2015

More Decks by ianozsvald

Other Decks in Science


  1. Data Cleaning on text to prepare for
    analysis and machine learning @
    EuroSciPy 2015
    Ian Ozsvald @IanOzsvald ModelInsight.io

    View full-size slide

  2. [email protected] @IanOzsvald
    EuroSciPy August 2015
    Who am I?

    Past speaker+teacher at PyDatas,
    EuroPythons, PyCons, PyConUKs

    Co-org of PyDataLondon

    O'Reilly Author

    ModelInsight.io for NLP+ML
    IP creation in London

    “I clean data” #sigh

    Please learn from my mistakes

    View full-size slide

  3. [email protected] @IanOzsvald
    EuroSciPy August 2015
    Unstructured data->Value

    Increasing rate of growth and meant for human consumption

    Hard to:



    Make machine-readable

    It is also very valuable...part of my consultancy - we're
    currently automating recruitment:

    Uses: search, visualisation, new ML features

    Most industrial problems messy, not “hard”, but time consuming!

    How can we make it easier for ourselves?

    “80% of our time is cleaning data” [The Internet]

    View full-size slide

  4. [email protected] @IanOzsvald
    EuroSciPy August 2015
    Extracting text from PDFs & Word

    http://textract.readthedocs.org/en/latest/ (Python)

    Apache Tika (Java - via jnius?)


    Formatting probably horrible

    No semantic interpretation (e.g. CVs)

    Keyword stuffing, images, out-of-order or multi-column text,
    tables #sigh

    “Content ExtRactor and MINEr” (CERMINE) for academic

    Commercial CV parsers (e.g. Sovren)

    Do you know of other tools that add structure?

    View full-size slide

  5. [email protected] @IanOzsvald
    EuroSciPy August 2015
    Extracting tables from PDFs

    ScraperWiki's https://pdftables.com/
    (builds on pdfminer)

    http://tabula.technology/ (Ruby/Java OS,
    seems to require user intervention)

    messytables (Python, lesser known, auto-
    guesses dtypes for CSV & HTML & PDFs)

    Maybe you can help with better solutions?

    View full-size slide

  6. [email protected] @IanOzsvald
    EuroSciPy August 2015
    Fixing badly encoding text


    HTML unescaping:

    chromium-compact-language-detector will guess human
    language from 80+ options (so you can choose your own
    decoding options) -> “Turkish I Problem (next)”

    chardet: CP1252 Windows vs UTF-8

    View full-size slide

  7. [email protected] @IanOzsvald
    EuroSciPy August 2015
    The “Turkish I Problem”
    Irish: dotted and dotless lowercase i mean the same thing

    View full-size slide

  8. [email protected] @IanOzsvald
    EuroSciPy August 2015
    Interpreting dtypes

    Use pandas to get text data (e.g. from JSON/CSV)

    Categories (e.g. “male”/”female”) are easily spotted by eye

    [“33cm”, “22inches”, ...] could be easily converted

    Date parsing:

    The default is for US-style (MMDD), not Euro-style (DDMM)

    pd.from_csv(parse_dates=[cols], dayfirst=False)

    Labix dateutil, delorean, arrow, parsedatetime (NLP)

    Could you write a module to suggest possible conversions
    on dataframe for the user (and notify if ambiguities are
    present e.g. 1/1 to 12/12...MM/DD or DD/MM)?

    View full-size slide

  9. [email protected] @IanOzsvald
    EuroSciPy August 2015
    Automate feature extraction?

    Can we extract features from e.g. Id columns

    We could identify categorical labels and
    suggest Boolean column equivalents

    We could remove some of the leg-work...you
    avoid missing possibilities, junior data
    scientists get “free help”

    What tools do you know of and use?

    View full-size slide

  10. [email protected] @IanOzsvald
    EuroSciPy August 2015
    Automated validation?

    Use 'known' to validate 'unknown' DF?

    View full-size slide

  11. [email protected] @IanOzsvald
    EuroSciPy August 2015
    Automated validation?

    View full-size slide

  12. [email protected] @IanOzsvald
    EuroSciPy August 2015
    Merging two data sources

    pd.merge(df1, df2) # exact keys, SQL-

    fuzzywuzzy/metaphone for approximate
    string matching

    DataMade's dedupe.readthedocs.org to
    identify duplicates (or OpenRefine)

    View full-size slide

  13. [email protected] @IanOzsvald
    EuroSciPy August 2015
    Manual Normalisation

    Eyeball the problem, solve by hand

    Lots of unit-tests!

    lower() # “Accenture”->”accenture”

    strip() # “ this and ”->”this and”

    Beware ;nbsp& (approx 20 of these!)

    replace(,””) # “BigCo Ltd”->”BigCo”

    unidecode # “áéîöũ”->”aeiou”

    normalise unicode (e.g. >50 dash variants!)

    NLTK stemming & WordNet ISA relt.

    View full-size slide

  14. [email protected] @IanOzsvald
    EuroSciPy August 2015
    Representations of Null

    Just have 1 (not 4!)

    Consider Engarde & Hypothesis

    Write a schema-checker to check all of
    this on the source data & in your

    View full-size slide

  15. [email protected] @IanOzsvald
    EuroSciPy August 2015
    Automated Normalisation?

    My annotate.io

    Why not make the machine do this for
    No regular
    No fiddling!

    View full-size slide

  16. [email protected] @IanOzsvald
    EuroSciPy August 2015
    Visualising new data sources

    GlueViz (numeric)



    Do you have good tools?

    View full-size slide

  17. [email protected] @IanOzsvald
    EuroSciPy August 2015
    Augmenting Data

    Alchemy - sentiment and entities

    DBPedia - entities
    – http://dbpedia.org/data/Barclays.json



    Use ML... but don't forget regexs and other
    simple techniques

    View full-size slide

  18. [email protected] @IanOzsvald
    EuroSciPy August 2015
    Starting ML on Text

    SKLearn's CountVectorizer for binary

    BernoulliNaiveBayes (then LogReg)

    Favour better data & features

    Diagnose failures at each stage

    Avoid complex models until you need

    View full-size slide

  19. [email protected] @IanOzsvald
    EuroSciPy August 2015

    Give me feedback on annotate.io

    Give me your dirty-data horror stories, I want
    to fix some of these problems


    PyDataLondon monthly meetup

    Do you have data science deployment stories
    for my keynote at BudapestBIForum? What's
    “hardest” in (data) science for your team?

    View full-size slide