Slide 1

Slide 1 text

Ship It! Turning raw data into valuable services PyDataLondon 2015 Conference Ian Ozsvald @IanOzsvald ModelInsight.io

Slide 2

Slide 2 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Who Am I? ● “Industrial Data Science” for 15 years ● Consultant ● O'Reilly Author ● Teacher at PyCons

Slide 3

Slide 3 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Which projects succeed? ● Explain existing data (visualisation!) ● Automate repetitive/slow processes (higher accuracy, more repeatable) ● Augment data to make new data (e.g. for search engines and ML) ● Predict the future (e.g. replace human intuition or use subtler relationships)

Slide 4

Slide 4 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Why is it valuable?

Slide 5

Slide 5 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Visualising data ● Most data isn't interesting... ● Requires human curation + detective skills to get the good stuff ● Probably needs an engineer/researcher + business person

Slide 6

Slide 6 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Extracting data from binary files ● Copy/pasting PDF/PNG data is laborious ● How can we scale it? ● textract - unified interface ● Apache's Tika (maybe) better ● Specialised tools e.g. Sovren ● Think on pipelines of transforms ● This might take months!

Slide 7

Slide 7 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Optical Character Recognition

Slide 8

Slide 8 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Optical Character Recognition

Slide 9

Slide 9 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Augmenting data ● Identifying people, places, brands, sentiment ● “i love my apple phone” ● Context-sensitive (e.g movies vs products) ● Accurately count mentions & sentiment

Slide 10

Slide 10 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Medical data (anti-allergy) ML prediction using food+alcohol, pollen, pollution, location, cats, ...

Slide 11

Slide 11 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Machine Learning ● PyMC (Markov Chain Monte Carlo) Please cite these projects! (it helps their funding)

Slide 12

Slide 12 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Debugging Machine Learning? ● Thoughts from you? ● No obvious tools to show me: ● these examples were well-fitted ● these always wrongly-fitted ● these always uncertain ● No data-diagnostics to validate inputs (e.g. for Logistic Regression) ● No visualisers for most of the models ● Your hard-won knowledge->new debug tools? (PLEASE!)

Slide 13

Slide 13 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Debugging Machine Learning? “How “good” is your model, and how can you make it better?” Chih-Chun Chen, Elena Chatzimichali at PyDataLondon 2015

Slide 14

Slide 14 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Debugging Machine Learning? Roelof Pieters PyDataLondon2015

Slide 15

Slide 15 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Delivery: Keep It Simple (Stupid!) ● We're (probably) not publishing the best result ● Debuggability is key - 3am Sunday CTO beeper alert is no time for complexity ● “cult of the imperfect” Watson-Watt ● Dumb models + clean data beat other combinations

Slide 16

Slide 16 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Don't Kill It! ● Your data is missing, it is poor and it lies ● Missing data kills projects! ● Log everything! ● Make data quality tools & reports ● Note! More data->desynchronisation ● DESYNCH IS BAAAAAAD! ● R&D != Engineering ● Discovery-based ● Iterative ● Success and failure equally useful dsadd

Slide 17

Slide 17 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Internal deployment ● Scripts to drive report ● CSVs/Reports ● Database updates ● IPython Notebook (not secure though!) ● Bokeh

Slide 18

Slide 18 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Deploying live systems ● Spyre (locked-down) ● Microservices ● Flask is my go-to tool ● Swagger docs ● (git pull / fabric / provisioned machines) ● Docker + Amazon ECS

Slide 19

Slide 19 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 flask-restful-swagger

Slide 20

Slide 20 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Reproducibility

Slide 21

Slide 21 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Reproducibility (Notebooks)

Slide 22

Slide 22 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Python Deployment ● Make Python modules (setup.py) ● python setup.py develop # symlink ● Unit tests + coverage ● Use a config system (e.g. my github.com /ianozsvald/python_template_with_config) ● Use git with a deployment branch ● Post-commit git hooks for unit testing ● Keep Separation of Concerns! ● “12 Factor App” useful ideas

Slide 23

Slide 23 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Some common gotchas ● MySQL UTF8 is 3 byte by default #sigh ● JavaScript months are 0-based (not 1) ● Never compromise on datetimes (ISO 8601) ● iOS NSDate's epoch is 2001 ● Windows CP1252 text (strongly prefer UTF8) ● MongoDB no_timeout_cursor=True ● Github's 100MB file limits (new Large File Support) ● Never throw data away! Never overwrite original data! Always transform it (e.g. Luigi) ● Data duplication bites you in the end...

Slide 24

Slide 24 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 (Perhaps) Avoid Big Data ● Don't be in a rush - 50,000 lines of good data will beat a pile of Bad Big Data ● 244GB RAM EC2+many Xeons $2.80/hr

Slide 25

Slide 25 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyDataLondon Conf June 2015 Closing ● Tell me your dirty data stories (I want to automate some of this) ● Keep It Simple ● Come talk on your projects at our PyDataLondon monthly meetup... ● It isn't what you know but who you know ● (I do coaching on this stuff in ModelInsight.io)