Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyConIreland 2014 Keynote "The Real Unsolved Pr...

ianozsvald
October 11, 2014

PyConIreland 2014 Keynote "The Real Unsolved Problems In Data Science"

ianozsvald

October 11, 2014
Tweet

More Decks by ianozsvald

Other Decks in Science

Transcript

  1. [email protected] @IanOzsvald PyConIreland October 2014 Who Am I? • Solving

    “Data Science” for 15 years in industry • Author • Teacher at PyCons
  2. [email protected] @IanOzsvald PyConIreland October 2014 Who is a Data Scientist?

    http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
  3. [email protected] @IanOzsvald PyConIreland October 2014 Who else is a Data

    Scientist? http://datascopeanalytics.com/what-we-think/2014/02/05/what-is-a-data-scientist
  4. [email protected] @IanOzsvald PyConIreland October 2014 Shark Fin Fingerprinting Attribution: Stefan

    Van Der Walt via EuroSciPy 2014 Evolutionary behavioural genetics and population structure of the Great White Shark Carcharodon Carcharias, Sara Andreotti
  5. [email protected] @IanOzsvald PyConIreland October 2014 Why 'now'? (hint: not Big

    Data!) http://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users
  6. [email protected] @IanOzsvald PyConIreland October 2014 When to avoid Data Science?

    • You have small volumes of data • Speed isn't important • Reproducibility is a low priority • → Use manual approaches (e.g. humans) • So...what gets in the way of Data Science?
  7. [email protected] @IanOzsvald PyConIreland October 2014 Request: Magic Quickly! • Explain

    unrealistic requests • R&D != Engineering • Data quality • Time frames • Expertise required • Need more success stories • Attribution: xkcd.com/1425/ 10
  8. [email protected] @IanOzsvald PyConIreland October 2014 Poor quality data • Few

    lines or few genuine examples • Missing fields and illegal contents • Undocumented schema • ASCII vs UTF-8 vs CP-1252 → "" ”” “” • Booleans (2 types or 3 or more?) • 3/4/2012 and dateutil • MM-DD-YY vs DD-MM-YY vs YY-MM-DD • "J.P. Morgan" "jpmc - Project X" – are they similar? • What are the common paths to solutions?
  9. [email protected] @IanOzsvald PyConIreland October 2014 The cost of poor quality

    data • Current project – 9 months invested cleaning company names • Chief Data Scientists cite as significant expense • On-going 'below the surface' costs with adding dirty data, maintaining data integrity, keeping pipeline consistent • Do a Data Audit to understand what you have • We need more data cleaning tools and better integration to non-Python systems • We can only do clever things if we have clean data • Garbage in, garbage out...
  10. [email protected] @IanOzsvald PyConIreland October 2014 Camera-OCR (generally still bad) •

    http://vbridge.co.uk/2012/11/05/how-we-t uned-tesseract-to-perform-as-well-as-a-c ommercial-ocr-package/
  11. [email protected] @IanOzsvald PyConIreland October 2014 New APIs – can you

    help? • Normalise company/place/people - names and addresses (new US-address-parser?) • General “join on these columns” tool (Duke/Dedupe) • Named Entity Recognition • Recognise product photos • Label reader from photos • Domain-specific sentiment analysis • Do you have APIs you could publish? 20
  12. [email protected] @IanOzsvald PyConIreland October 2014 Data checkers – too low-level

    • How many ints bools strs? • setosa.io csv fingerprint • What about human-level data?
  13. [email protected] @IanOzsvald PyConIreland October 2014 New APIs – can you

    help? • Please tell me exactly what datetime I have in my dataset • What's wrong with my addresses? • What are the closest Wikipedia pages to my names/companies/places • Does the sex column match the names column? • Is this photo upside down? • We need more automation here
  14. [email protected] @IanOzsvald PyConIreland October 2014 Visualisation – still too hard

    • matplotlib • clunky • unsexy • SeaBorn • ggplot2 • mplD3 • GIS is hard • R+ggplot == win for R • Bokeh
  15. [email protected] @IanOzsvald PyConIreland October 2014 Go fast if you need

    too • Efficient algorithms • Profilers/Compilers • Multi-core • Clusters • Julia perceived as 'fast solution' • R has better stats support so you 'work faster' • Need better 'go-fast' ideas
  16. [email protected] @IanOzsvald PyConIreland October 2014 Statisticians vs Engineers • Maths

    folk or coders – team balance? • Shared language? • “you should be looking for the structure” • “watch for high skew and kurtosis”
  17. [email protected] @IanOzsvald PyConIreland October 2014 Statisticians vs Engineers • “watch

    for high skew and kurtosis” • How do we cross this barrier? • How does one “debug data”? http://en.wikipedia.org/wiki/Fat-tailed_distribution
  18. [email protected] @IanOzsvald PyConIreland October 2014 It's a heterogeneous world We

    need a "LAMP Stack" for data science with Python as a more integral part – from ingestion to visualisation
  19. [email protected] @IanOzsvald PyConIreland October 2014 Project Jupyter • What if

    we can share tooling with other languages? • Shared data frames? • Do you want 2 languages in your head? radar.oreilly.com/2014/01/ipython-a-unified-environment-for-interactive-data-analysis.html
  20. [email protected] @IanOzsvald PyConIreland October 2014 How to get started •

    Have a clear objective • Get lots of clean, tagged data • Visualise it • Make a classifier • Use open datasets for practice • Kaggle • Where to find more?
  21. [email protected] @IanOzsvald PyConIreland October 2014 What comes next? • Lots

    more (open) data • HealthKit (200million phones?) • How do we we automatically unmangle this data? • Takeaway – data cleanliness is fundamental
  22. [email protected] @IanOzsvald PyConIreland October 2014 Final thoughts • Design Patterns

    for Python Data Science? • Python can be the bedrock for “doing data science”