Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Annotate.io introduction

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for ianozsvald ianozsvald
February 04, 2015

Annotate.io introduction

Lightning talk at pydata london february 2015 discussing my new self-learning text cleaning service for data scientists: http://annotate.io

Avatar for ianozsvald

ianozsvald

February 04, 2015
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. [email protected] @IanOzsvald PyDataLondon February 2015 Automated data cleaning • Cleaning

    raw text is boring, hinders new projects • Help non-programmers investigate dirty data • Keynote “The real unsolved problems in data science” • What tooling exists? • Can't the machine do this for us?
  2. [email protected] @IanOzsvald PyDataLondon February 2015 CV Cleaning What I have

    Cleaning operation? Accenture PLC Remove PLC? ACCENTURE Lowercase? Lancôme Remove 'foreign characters'? Lancôme Strip pre/post whitespace? Société Générale Deal with badly encoded data? Huge variey of company names, all hand-written
  3. [email protected] @IanOzsvald PyDataLondon February 2015 Declarative CV Cleaning What I

    have What I want Accenture PLC accenture ACCENTURE accenture Lancôme lancome Lancôme lancome Société Générale societe generale
  4. [email protected] @IanOzsvald PyDataLondon February 2015 Salary extraction What I have

    What I want To 53K w/benefits 53000 30000 OTE plus bonus 30000 £55000 salary 55000 Excellent “” Forty two thousand GBP 42000
  5. [email protected] @IanOzsvald PyDataLondon February 2015 Your data cleaning problems? •

    Write-ups: http://ianozsvald.com/ • http://annotate.io/ announce email list & working demo (Python 2.7&3.4) • Tell me • what data do you want to clean? • where have you lost time cleaning before?