Data Cleaning on text to prepare for analysis and machine learning @ EuroSciPy 2015

Slide 1

Slide 1 text

Data Cleaning on text to prepare for analysis and machine learning @ EuroSciPy 2015 Ian Ozsvald @IanOzsvald ModelInsight.io

Slide 2

Slide 2 text

[email protected] @IanOzsvald EuroSciPy August 2015 Who am I? ● Past speaker+teacher at PyDatas, EuroPythons, PyCons, PyConUKs ● Co-org of PyDataLondon ● O'Reilly Author ● ModelInsight.io for NLP+ML IP creation in London ● “I clean data” #sigh ● Please learn from my mistakes

Slide 3

Slide 3 text

[email protected] @IanOzsvald EuroSciPy August 2015 Unstructured data->Value ● Increasing rate of growth and meant for human consumption ● Hard to: ● Extract ● Parse ● Make machine-readable ● It is also very valuable...part of my consultancy - we're currently automating recruitment: ● Uses: search, visualisation, new ML features ● Most industrial problems messy, not “hard”, but time consuming! ● How can we make it easier for ourselves? ● “80% of our time is cleaning data” [The Internet]

Slide 4

Slide 4 text

[email protected] @IanOzsvald EuroSciPy August 2015 Extracting text from PDFs & Word ● http://textract.readthedocs.org/en/latest/ (Python) ● Apache Tika (Java - via jnius?) ● Difficulties ● Formatting probably horrible ● No semantic interpretation (e.g. CVs) ● Keyword stuffing, images, out-of-order or multi-column text, tables #sigh ● “Content ExtRactor and MINEr” (CERMINE) for academic papers ● Commercial CV parsers (e.g. Sovren) ● Do you know of other tools that add structure?

Slide 5

Slide 5 text

[email protected] @IanOzsvald EuroSciPy August 2015 Extracting tables from PDFs ● ScraperWiki's https://pdftables.com/ (builds on pdfminer) ● http://tabula.technology/ (Ruby/Java OS, seems to require user intervention) ● messytables (Python, lesser known, auto- guesses dtypes for CSV & HTML & PDFs) ● Maybe you can help with better solutions?

Slide 6

Slide 6 text

[email protected] @IanOzsvald EuroSciPy August 2015 Fixing badly encoding text ● http://ftfy.readthedocs.org/en/latest/ ● HTML unescaping: ● chromium-compact-language-detector will guess human language from 80+ options (so you can choose your own decoding options) -> “Turkish I Problem (next)” ● chardet: CP1252 Windows vs UTF-8

Slide 7

Slide 7 text

[email protected] @IanOzsvald EuroSciPy August 2015 The “Turkish I Problem” Irish: dotted and dotless lowercase i mean the same thing

Slide 8

Slide 8 text

[email protected] @IanOzsvald EuroSciPy August 2015 Interpreting dtypes ● Use pandas to get text data (e.g. from JSON/CSV) ● Categories (e.g. “male”/”female”) are easily spotted by eye ● [“33cm”, “22inches”, ...] could be easily converted ● Date parsing: ● The default is for US-style (MMDD), not Euro-style (DDMM) ● pd.from_csv(parse_dates=[cols], dayfirst=False) ● Labix dateutil, delorean, arrow, parsedatetime (NLP) ● Could you write a module to suggest possible conversions on dataframe for the user (and notify if ambiguities are present e.g. 1/1 to 12/12...MM/DD or DD/MM)?

Slide 9

Slide 9 text

[email protected] @IanOzsvald EuroSciPy August 2015 Automate feature extraction? ● Can we extract features from e.g. Id columns (products/rooms/categories)? ● We could identify categorical labels and suggest Boolean column equivalents ● We could remove some of the leg-work...you avoid missing possibilities, junior data scientists get “free help” ● What tools do you know of and use?

Slide 10

Slide 10 text

[email protected] @IanOzsvald EuroSciPy August 2015 Automated validation? ● Use 'known' to validate 'unknown' DF?

Slide 11

Slide 11 text

[email protected] @IanOzsvald EuroSciPy August 2015 Automated validation?

Slide 12

Slide 12 text

[email protected] @IanOzsvald EuroSciPy August 2015 Merging two data sources ● pd.merge(df1, df2) # exact keys, SQL- like ● fuzzywuzzy/metaphone for approximate string matching ● DataMade's dedupe.readthedocs.org to identify duplicates (or OpenRefine)

Slide 13

Slide 13 text

[email protected] @IanOzsvald EuroSciPy August 2015 Manual Normalisation ● Eyeball the problem, solve by hand ● Lots of unit-tests! ● lower() # “Accenture”->”accenture” ● strip() # “ this and ”->”this and” ● Beware ;nbsp& (approx 20 of these!) ● replace(,””) # “BigCo Ltd”->”BigCo” ● unidecode # “áéîöũ”->”aeiou” ● normalise unicode (e.g. >50 dash variants!) ● NLTK stemming & WordNet ISA relt.

Slide 14

Slide 14 text

[email protected] @IanOzsvald EuroSciPy August 2015 Representations of Null ● Just have 1 (not 4!) ● Consider Engarde & Hypothesis ● Write a schema-checker to check all of this on the source data & in your DataFrames!

Slide 15

Slide 15 text

[email protected] @IanOzsvald EuroSciPy August 2015 Automated Normalisation? ● My annotate.io ● Why not make the machine do this for us? No regular expressions! No fiddling!

Slide 16

Slide 16 text

[email protected] @IanOzsvald EuroSciPy August 2015 Visualising new data sources ● GlueViz (numeric) ● SeaBorn ● setosa.io/csv-fingerprint/ ● Do you have good tools?

Slide 17

Slide 17 text

[email protected] @IanOzsvald EuroSciPy August 2015 Augmenting Data ● Alchemy - sentiment and entities ● DBPedia - entities – http://dbpedia.org/data/Barclays.json ● NLTK ● SpaCy ● Use ML... but don't forget regexs and other simple techniques

Slide 18

Slide 18 text

[email protected] @IanOzsvald EuroSciPy August 2015 Starting ML on Text ● SKLearn's CountVectorizer for binary features ● BernoulliNaiveBayes (then LogReg) ● Favour better data & features ● Diagnose failures at each stage ● Avoid complex models until you need them

Slide 19

Slide 19 text

[email protected] @IanOzsvald EuroSciPy August 2015 Closing... ● Give me feedback on annotate.io ● Give me your dirty-data horror stories, I want to fix some of these problems ● http://ianozsvald.com/ ● PyDataLondon monthly meetup ● Do you have data science deployment stories for my keynote at BudapestBIForum? What's “hardest” in (data) science for your team?