Data Cleaning on text to prepare for analysis and machine learning @ EuroSciPy 2015

Data Cleaning on text to prepare for analysis and machine
learning @ EuroSciPy 2015 Ian Ozsvald @IanOzsvald ModelInsight.io

[email protected] @IanOzsvald EuroSciPy August 2015 Who am I? • Past
speaker+teacher at PyDatas, EuroPythons, PyCons, PyConUKs • Co-org of PyDataLondon • O'Reilly Author • ModelInsight.io for NLP+ML IP creation in London • “I clean data” #sigh • Please learn from my mistakes

[email protected] @IanOzsvald EuroSciPy August 2015 Unstructured data->Value • Increasing rate
of growth and meant for human consumption • Hard to: • Extract • Parse • Make machine-readable • It is also very valuable...part of my consultancy - we're currently automating recruitment: • Uses: search, visualisation, new ML features • Most industrial problems messy, not “hard”, but time consuming! • How can we make it easier for ourselves? • “80% of our time is cleaning data” [The Internet]

[email protected] @IanOzsvald EuroSciPy August 2015 Extracting text from PDFs &
Word • http://textract.readthedocs.org/en/latest/ (Python) • Apache Tika (Java - via jnius?) • Difficulties • Formatting probably horrible • No semantic interpretation (e.g. CVs) • Keyword stuffing, images, out-of-order or multi-column text, tables #sigh • “Content ExtRactor and MINEr” (CERMINE) for academic papers • Commercial CV parsers (e.g. Sovren) • Do you know of other tools that add structure?

[email protected] @IanOzsvald EuroSciPy August 2015 Extracting tables from PDFs •
ScraperWiki's https://pdftables.com/ (builds on pdfminer) • http://tabula.technology/ (Ruby/Java OS, seems to require user intervention) • messytables (Python, lesser known, auto- guesses dtypes for CSV & HTML & PDFs) • Maybe you can help with better solutions?

[email protected] @IanOzsvald EuroSciPy August 2015 Fixing badly encoding text •
http://ftfy.readthedocs.org/en/latest/ • HTML unescaping: • chromium-compact-language-detector will guess human language from 80+ options (so you can choose your own decoding options) -> “Turkish I Problem (next)” • chardet: CP1252 Windows vs UTF-8

[email protected] @IanOzsvald EuroSciPy August 2015 The “Turkish I Problem” Irish:
dotted and dotless lowercase i mean the same thing

[email protected] @IanOzsvald EuroSciPy August 2015 Interpreting dtypes • Use pandas
to get text data (e.g. from JSON/CSV) • Categories (e.g. “male”/”female”) are easily spotted by eye • [“33cm”, “22inches”, ...] could be easily converted • Date parsing: • The default is for US-style (MMDD), not Euro-style (DDMM) • pd.from_csv(parse_dates=[cols], dayfirst=False) • Labix dateutil, delorean, arrow, parsedatetime (NLP) • Could you write a module to suggest possible conversions on dataframe for the user (and notify if ambiguities are present e.g. 1/1 to 12/12...MM/DD or DD/MM)?

[email protected] @IanOzsvald EuroSciPy August 2015 Automate feature extraction? • Can
we extract features from e.g. Id columns (products/rooms/categories)? • We could identify categorical labels and suggest Boolean column equivalents • We could remove some of the leg-work...you avoid missing possibilities, junior data scientists get “free help” • What tools do you know of and use?

[email protected] @IanOzsvald EuroSciPy August 2015 Automated validation? • Use 'known'
to validate 'unknown' DF?

[email protected] @IanOzsvald EuroSciPy August 2015 Automated validation?

[email protected] @IanOzsvald EuroSciPy August 2015 Merging two data sources •
pd.merge(df1, df2) # exact keys, SQL- like • fuzzywuzzy/metaphone for approximate string matching • DataMade's dedupe.readthedocs.org to identify duplicates (or OpenRefine)

[email protected] @IanOzsvald EuroSciPy August 2015 Manual Normalisation • Eyeball the
problem, solve by hand • Lots of unit-tests! • lower() # “Accenture”->”accenture” • strip() # “ this and ”->”this and” • Beware ;nbsp& (approx 20 of these!) • replace(<pattern>,””) # “BigCo Ltd”->”BigCo” • unidecode # “áéîöũ”->”aeiou” • normalise unicode (e.g. >50 dash variants!) • NLTK stemming & WordNet ISA relt.

[email protected] @IanOzsvald EuroSciPy August 2015 Representations of Null • Just
have 1 (not 4!) • Consider Engarde & Hypothesis • Write a schema-checker to check all of this on the source data & in your DataFrames!

[email protected] @IanOzsvald EuroSciPy August 2015 Automated Normalisation? • My annotate.io
• Why not make the machine do this for us? No regular expressions! No fiddling!

[email protected] @IanOzsvald EuroSciPy August 2015 Visualising new data sources •
GlueViz (numeric) • SeaBorn • setosa.io/csv-fingerprint/ • Do you have good tools?

[email protected] @IanOzsvald EuroSciPy August 2015 Augmenting Data • Alchemy -
sentiment and entities • DBPedia - entities – http://dbpedia.org/data/Barclays.json • NLTK • SpaCy • Use ML... but don't forget regexs and other simple techniques

[email protected] @IanOzsvald EuroSciPy August 2015 Starting ML on Text •
SKLearn's CountVectorizer for binary features • BernoulliNaiveBayes (then LogReg) • Favour better data & features • Diagnose failures at each stage • Avoid complex models until you need them

[email protected] @IanOzsvald EuroSciPy August 2015 Closing... • Give me feedback
on annotate.io • Give me your dirty-data horror stories, I want to fix some of these problems • http://ianozsvald.com/ • PyDataLondon monthly meetup • Do you have data science deployment stories for my keynote at BudapestBIForum? What's “hardest” in (data) science for your team?

Data Cleaning on text to prepare for analysis a...

Data Cleaning on text to prepare for analysis and machine learning @ EuroSciPy 2015

ianozsvald

More Decks by ianozsvald

Other Decks in Science

Featured

Transcript

Data Cleaning on text to prepare for analysis and machine

[email protected] @IanOzsvald EuroSciPy August 2015 Who am I? • Past

[email protected] @IanOzsvald EuroSciPy August 2015 Unstructured data->Value • Increasing rate

[email protected] @IanOzsvald EuroSciPy August 2015 Extracting text from PDFs &

[email protected] @IanOzsvald EuroSciPy August 2015 Extracting tables from PDFs •

[email protected] @IanOzsvald EuroSciPy August 2015 Fixing badly encoding text •

[email protected] @IanOzsvald EuroSciPy August 2015 The “Turkish I Problem” Irish:

[email protected] @IanOzsvald EuroSciPy August 2015 Interpreting dtypes • Use pandas

[email protected] @IanOzsvald EuroSciPy August 2015 Automate feature extraction? • Can

[email protected] @IanOzsvald EuroSciPy August 2015 Automated validation? • Use 'known'

[email protected] @IanOzsvald EuroSciPy August 2015 Automated validation?

[email protected] @IanOzsvald EuroSciPy August 2015 Merging two data sources •

[email protected] @IanOzsvald EuroSciPy August 2015 Manual Normalisation • Eyeball the

[email protected] @IanOzsvald EuroSciPy August 2015 Representations of Null • Just

[email protected] @IanOzsvald EuroSciPy August 2015 Automated Normalisation? • My annotate.io

[email protected] @IanOzsvald EuroSciPy August 2015 Visualising new data sources •

[email protected] @IanOzsvald EuroSciPy August 2015 Augmenting Data • Alchemy -

[email protected] @IanOzsvald EuroSciPy August 2015 Starting ML on Text •

[email protected] @IanOzsvald EuroSciPy August 2015 Closing... • Give me feedback