Slide 1

Slide 1 text

Annotate.io – automated data cleaning Ian Ozsvald @IanOzsvald ModelInsight.io

Slide 2

Slide 2 text

[email protected] @IanOzsvald PyDataLondon February 2015 Automated data cleaning ● Cleaning raw text is boring, hinders new projects ● Help non-programmers investigate dirty data ● Keynote “The real unsolved problems in data science” ● What tooling exists? ● Can't the machine do this for us?

Slide 3

Slide 3 text

[email protected] @IanOzsvald PyDataLondon February 2015 CV Cleaning What I have Cleaning operation? Accenture PLC Remove PLC? ACCENTURE Lowercase? Lancôme Remove 'foreign characters'? Lancôme Strip pre/post whitespace? Société Générale Deal with badly encoded data? Huge variey of company names, all hand-written

Slide 4

Slide 4 text

[email protected] @IanOzsvald PyDataLondon February 2015 Declarative CV Cleaning What I have What I want Accenture PLC accenture ACCENTURE accenture Lancôme lancome Lancôme lancome Société Générale societe generale

Slide 5

Slide 5 text

[email protected] @IanOzsvald PyDataLondon February 2015 Salary extraction on adzuna

Slide 6

Slide 6 text

[email protected] @IanOzsvald PyDataLondon February 2015 Salary extraction What I have What I want To 53K w/benefits 53000 30000 OTE plus bonus 30000 £55000 salary 55000 Excellent “” Forty two thousand GBP 42000

Slide 7

Slide 7 text

[email protected] @IanOzsvald PyDataLondon February 2015 Salary extraction No regular expressions! No fiddling!

Slide 8

Slide 8 text

[email protected] @IanOzsvald PyDataLondon February 2015 Your data cleaning problems? ● Write-ups: http://ianozsvald.com/ ● http://annotate.io/ announce email list & working demo (Python 2.7&3.4) ● Tell me ● what data do you want to clean? ● where have you lost time cleaning before?