Slide 1

Slide 1 text

Shipping Data Science Products! Turning raw data into valuable services BudapestBI Forum 2015 License: CC By Attribution Ian Ozsvald @IanOzsvald ModelInsight.io

Slide 2

Slide 2 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 Who Am I? ● “Industrial Data Science” for 15 years ● Data Product Builder ● O'Reilly Author ● Teacher at PyCons

Slide 3

Slide 3 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 Who are you? ● Type A(nalysis) or B(building) ● Robert Chang - “Doing Data Science at Twitter”

Slide 4

Slide 4 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 What frustrations do we share? ● Lack of useful data ● Biggest time sink - cleaning & transforming ● Conservative management ● How can we derisk projects? ● Medium Data ● luckily we have Wes in the room

Slide 5

Slide 5 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 Which projects succeed? ● Explain existing data (visualisation!) ● Automate repetitive/slow processes (higher accuracy, more repeatable) ● Augment data to make new data (e.g. for search engines and ML) ● Predict the future (e.g. replace human intuition or use subtler relationships)

Slide 6

Slide 6 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 Why is it valuable?

Slide 7

Slide 7 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 Visualising data ● Most data isn't interesting... ● Requires human curation + detective skills to get the good stuff ● Couple a researcher + a business person

Slide 8

Slide 8 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 Medical data (anti-allergy) Perceived complexity might make sign-off more difficult...

Slide 9

Slide 9 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 Medical data (anti-allergy) Predict using: ● food ● alcohol ● pollen ● pollution ● location ● cats ● ...

Slide 10

Slide 10 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 Extracting data from binary files ● Copy/pasting PDF/PNG data is laborious ● How can we scale it? ● textract/Tika - unified interface ● Specialised tools e.g. Sovren ● This might take months!

Slide 11

Slide 11 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 Augmenting data ● Identifying people, places, brands, sentiment ● “i love my apple phone” ● Context-sensitive (e.g movies vs products) ● Build custom machine-learned tools ● Augment job titles ● Reconcile the same order in 2 tables

Slide 12

Slide 12 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 Machine Learning ● PyMC (Markov Chain Monte Carlo) Please cite these projects! (it helps their funding)

Slide 13

Slide 13 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 Debugging Machine Learning? ● Thoughts from you? ● No obvious tools to show me: ● these examples were well-fitted ● these always wrongly-fitted ● these always uncertain ● No data-diagnostics to validate inputs (e.g. for Logistic Regression) ● No visualisers for most of the models ● Your hard-won knowledge->new debug tools? (PLEASE!)

Slide 14

Slide 14 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 Debugging Machine Learning? Roelof Pieters PyDataLondon2015

Slide 15

Slide 15 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 Delivery: Keep It Simple (Stupid!) ● We're (probably) not publishing the best result ● Debuggability is key - 3am Sunday CTO beeper alert is no time for complexity ● “cult of the imperfect” Watson-Watt ● Dumb models + clean data beat other combinations

Slide 16

Slide 16 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 Don't Kill It! ● Your data is missing, it is poor and it lies ● Missing data kills projects! ● Log everything! ● Make data quality tools & reports ● More data->desynchronisation ● R&D != Engineering ● Discovery-based ● Success and failure equally useful engarde

Slide 17

Slide 17 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 Internal deployment ● CSVs/Reports ● Database updates ● IPython Notebook (not secure though!)

Slide 18

Slide 18 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 Deploying live systems ● Spyre (locked-down) ● Microservices ● Flask is my go-to tool ● Swagger docs ● (git pull / fabric / provisioned machines) ● Docker + Amazon ECS

Slide 19

Slide 19 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 Python Deployment ● Make Python modules (setup.py) ● python setup.py develop # symlink ● Unit tests + coverage ● Use a config system (e.g. github.com/ianozsvald/ python_template_with_config) ● Keep Separation of Concerns! ● “12 Factor App” useful ideas

Slide 20

Slide 20 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 Some common gotchas ● MySQL UTF8 is 3 byte by default #sigh ● JavaScript months are 0-based (not 1) ● Never compromise on datetimes (ISO 8601) ● iOS NSDate's epoch is 2001 ● Windows CP1252 text (strongly prefer UTF8) ● MongoDB no_timeout_cursor=True ● Github's 100MB file limits (new Large File Support) ● Never throw data away! Never overwrite original data! Always transform it (e.g. Luigi) ● Data duplication bites you in the end...

Slide 21

Slide 21 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 (Perhaps) Avoid Big Data ● Don't be in a rush - 50,000 lines of good data will beat a pile of Bad Big Data ● 244GB RAM EC2+many Xeons $2.80/hr

Slide 22

Slide 22 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 “Data Science Delivered” ● New mini project / pamphlet ● Includes dirty data strategies, ways to debug ML, thoughts on managing projects - 15 yrs experience (please critique and file bugs!) ● https://github.com/ianozsvald/ data_science_delivered ● Please give me your feedback

Slide 23

Slide 23 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald BudapestBI Forum October 2015 Closing ● Tell me your dirty data stories, perhaps in a Ruin Pub? (I am automating some of this) ● Takehome - Keep it clean, keep it simple ● Come talk on your projects at our PyDataLondon monthly meetup or start your own!