Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Annotate.io introduction

ianozsvald
February 04, 2015

Annotate.io introduction

Lightning talk at pydata london february 2015 discussing my new self-learning text cleaning service for data scientists: http://annotate.io

ianozsvald

February 04, 2015
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. [email protected] @IanOzsvald PyDataLondon February 2015 Automated data cleaning • Cleaning

    raw text is boring, hinders new projects • Help non-programmers investigate dirty data • Keynote “The real unsolved problems in data science” • What tooling exists? • Can't the machine do this for us?
  2. [email protected] @IanOzsvald PyDataLondon February 2015 CV Cleaning What I have

    Cleaning operation? Accenture PLC Remove PLC? ACCENTURE Lowercase? Lancôme Remove 'foreign characters'? Lancôme Strip pre/post whitespace? Société Générale Deal with badly encoded data? Huge variey of company names, all hand-written
  3. [email protected] @IanOzsvald PyDataLondon February 2015 Declarative CV Cleaning What I

    have What I want Accenture PLC accenture ACCENTURE accenture Lancôme lancome Lancôme lancome Société Générale societe generale
  4. [email protected] @IanOzsvald PyDataLondon February 2015 Salary extraction What I have

    What I want To 53K w/benefits 53000 30000 OTE plus bonus 30000 £55000 salary 55000 Excellent “” Forty two thousand GBP 42000
  5. [email protected] @IanOzsvald PyDataLondon February 2015 Your data cleaning problems? •

    Write-ups: http://ianozsvald.com/ • http://annotate.io/ announce email list & working demo (Python 2.7&3.4) • Tell me • what data do you want to clean? • where have you lost time cleaning before?