Van Der Walt via EuroSciPy 2014 Evolutionary behavioural genetics and population structure of the Great White Shark Carcharodon Carcharias, Sara Andreotti
• You have small volumes of data • Speed isn't important • Reproducibility is a low priority • → Use manual approaches (e.g. humans) • So...what gets in the way of Data Science?
lines or few genuine examples • Missing fields and illegal contents • Undocumented schema • ASCII vs UTF-8 vs CP-1252 → "" ”” “” • Booleans (2 types or 3 or more?) • 3/4/2012 and dateutil • MM-DD-YY vs DD-MM-YY vs YY-MM-DD • "J.P. Morgan" "jpmc - Project X" – are they similar? • What are the common paths to solutions?
data • Current project – 9 months invested cleaning company names • Chief Data Scientists cite as significant expense • On-going 'below the surface' costs with adding dirty data, maintaining data integrity, keeping pipeline consistent • Do a Data Audit to understand what you have • We need more data cleaning tools and better integration to non-Python systems • We can only do clever things if we have clean data • Garbage in, garbage out...
help? • Normalise company/place/people - names and addresses (new US-address-parser?) • General “join on these columns” tool (Duke/Dedupe) • Named Entity Recognition • Recognise product photos • Label reader from photos • Domain-specific sentiment analysis • Do you have APIs you could publish? 20
help? • Please tell me exactly what datetime I have in my dataset • What's wrong with my addresses? • What are the closest Wikipedia pages to my names/companies/places • Does the sex column match the names column? • Is this photo upside down? • We need more automation here
too • Efficient algorithms • Profilers/Compilers • Multi-core • Clusters • Julia perceived as 'fast solution' • R has better stats support so you 'work faster' • Need better 'go-fast' ideas
we can share tooling with other languages? • Shared data frames? • Do you want 2 languages in your head? radar.oreilly.com/2014/01/ipython-a-unified-environment-for-interactive-data-analysis.html
Have a clear objective • Get lots of clean, tagged data • Visualise it • Make a classifier • Use open datasets for practice • Kaggle • Where to find more?