Slide 1

Slide 1 text

The Real Unsolved Problems in Data Science Ian Ozsvald @IanOzsvald ModelInsight.io

Slide 2

Slide 2 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Who Am I? ● Solving “Data Science” for 15 years in industry ● Author ● Teacher at PyCons

Slide 3

Slide 3 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Who is a Data Scientist? http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/

Slide 4

Slide 4 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 How long have we “existed”?

Slide 5

Slide 5 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Who else is a Data Scientist? http://datascopeanalytics.com/what-we-think/2014/02/05/what-is-a-data-scientist

Slide 6

Slide 6 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Who benefits from it?

Slide 7

Slide 7 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Lyst's image deduplication

Slide 8

Slide 8 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Shark Fin Fingerprinting Attribution: Stefan Van Der Walt via EuroSciPy 2014 Evolutionary behavioural genetics and population structure of the Great White Shark Carcharodon Carcharias, Sara Andreotti

Slide 9

Slide 9 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Clean Water in Tanzania http://taarifa.org/ via Dirk Gorissen at PyDataLondon 4th meetup

Slide 10

Slide 10 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Why 'now'? (hint: not Big Data!) http://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users

Slide 11

Slide 11 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 When to avoid Data Science? ● You have small volumes of data ● Speed isn't important ● Reproducibility is a low priority ● → Use manual approaches (e.g. humans) ● So...what gets in the way of Data Science?

Slide 12

Slide 12 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Request: Magic Quickly! ● Explain unrealistic requests ● R&D != Engineering ● Data quality ● Time frames ● Expertise required ● Need more success stories ● Attribution: xkcd.com/1425/ 10

Slide 13

Slide 13 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Survey results

Slide 14

Slide 14 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Survey results

Slide 15

Slide 15 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Poor quality data http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hur dle-to-insights-is-janitor-work.html?_r=0

Slide 16

Slide 16 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Poor quality data ● Few lines or few genuine examples ● Missing fields and illegal contents ● Undocumented schema ● ASCII vs UTF-8 vs CP-1252 → "" ”” “” ● Booleans (2 types or 3 or more?) ● 3/4/2012 and dateutil ● MM-DD-YY vs DD-MM-YY vs YY-MM-DD ● "J.P. Morgan" "jpmc - Project X" – are they similar? ● What are the common paths to solutions?

Slide 17

Slide 17 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 The cost of poor quality data ● Current project – 9 months invested cleaning company names ● Chief Data Scientists cite as significant expense ● On-going 'below the surface' costs with adding dirty data, maintaining data integrity, keeping pipeline consistent ● Do a Data Audit to understand what you have ● We need more data cleaning tools and better integration to non-Python systems ● We can only do clever things if we have clean data ● Garbage in, garbage out...

Slide 18

Slide 18 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Camera-OCR (generally still bad) ● http://vbridge.co.uk/2012/11/05/how-we-t uned-tesseract-to-perform-as-well-as-a-c ommercial-ocr-package/

Slide 19

Slide 19 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 New APIs – can you help? ● Normalise company/place/people - names and addresses (new US-address-parser?) ● General “join on these columns” tool (Duke/Dedupe) ● Named Entity Recognition ● Recognise product photos ● Label reader from photos ● Domain-specific sentiment analysis ● Do you have APIs you could publish? 20

Slide 20

Slide 20 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Data checkers – too low-level ● How many ints bools strs? ● setosa.io csv fingerprint ● What about human-level data?

Slide 21

Slide 21 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 New APIs – can you help? ● Please tell me exactly what datetime I have in my dataset ● What's wrong with my addresses? ● What are the closest Wikipedia pages to my names/companies/places ● Does the sex column match the names column? ● Is this photo upside down? ● We need more automation here

Slide 22

Slide 22 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Visualisation – still too hard ● matplotlib ● clunky ● unsexy ● SeaBorn ● ggplot2 ● mplD3 ● GIS is hard ● R+ggplot == win for R ● Bokeh

Slide 23

Slide 23 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Visualisation – make it easier

Slide 24

Slide 24 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Go fast if you need too ● Efficient algorithms ● Profilers/Compilers ● Multi-core ● Clusters ● Julia perceived as 'fast solution' ● R has better stats support so you 'work faster' ● Need better 'go-fast' ideas

Slide 25

Slide 25 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Statisticians vs Engineers ● Maths folk or coders – team balance? ● Shared language? ● “you should be looking for the structure” ● “watch for high skew and kurtosis”

Slide 26

Slide 26 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Statisticians vs Engineers ● “watch for high skew and kurtosis” ● How do we cross this barrier? ● How does one “debug data”? http://en.wikipedia.org/wiki/Fat-tailed_distribution

Slide 27

Slide 27 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 It's a heterogeneous world We need a "LAMP Stack" for data science with Python as a more integral part – from ingestion to visualisation

Slide 28

Slide 28 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Collaboration still too hard 30

Slide 29

Slide 29 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Custom data cleaning tools We need pre-built tools

Slide 30

Slide 30 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Project Jupyter ● What if we can share tooling with other languages? ● Shared data frames? ● Do you want 2 languages in your head? radar.oreilly.com/2014/01/ipython-a-unified-environment-for-interactive-data-analysis.html

Slide 31

Slide 31 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Linked Open Data 2009

Slide 32

Slide 32 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Linked Open Data 2014

Slide 33

Slide 33 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Strong data sources

Slide 34

Slide 34 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 APIs for data enrichment Remember - companies get acquired!

Slide 35

Slide 35 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Our 'common tool language' Maintainers needed-> All Python 3+ compatible!

Slide 36

Slide 36 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 How to get started ● Have a clear objective ● Get lots of clean, tagged data ● Visualise it ● Make a classifier ● Use open datasets for practice ● Kaggle ● Where to find more?

Slide 37

Slide 37 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 What comes next? ● Lots more (open) data ● HealthKit (200million phones?) ● How do we we automatically unmangle this data? ● Takeaway – data cleanliness is fundamental

Slide 38

Slide 38 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Join a usergroup

Slide 39

Slide 39 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Final thoughts ● Design Patterns for Python Data Science? ● Python can be the bedrock for “doing data science”

Slide 40

Slide 40 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Addendum: Py 3 Adoption

Slide 41

Slide 41 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Addendum: Py 3 Adoption

Slide 42

Slide 42 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConIreland October 2014 Addendum: Tools to learn