Slide 1

Slide 1 text

Ship Data Science Products Turning raw data into valuable services PyConUK 2015 Conference Ian Ozsvald @IanOzsvald ModelInsight.io

Slide 2

Slide 2 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK Conf Sept 2015 Who Am I? ● “Industrial Data Science” for 15 years ● Consultant ● O'Reilly Author ● Teacher at PyCons

Slide 3

Slide 3 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK Conf Sept 2015 Which projects succeed? ● Explain existing data (visualisation!) ● Automate repetitive/slow processes (higher accuracy, more repeatable) ● Augment data to make new data (e.g. for search engines and ML) ● Predict the future (e.g. replace human intuition or use subtler relationships)

Slide 4

Slide 4 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK Conf Sept 2015 Why is it valuable?

Slide 5

Slide 5 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK Conf Sept 2015 Visualising data ● Most data isn't interesting... ● Requires human curation + detective skills to get the good stuff ● Probably needs an engineer/researcher + business person

Slide 6

Slide 6 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK Conf Sept 2015 Extracting data from binary files ● Copy/pasting PDF/PNG data is laborious ● How can we scale it? ● textract - unified interface ● Apache's Tika (maybe) better ● Specialised tools e.g. Sovren ● This might take months!

Slide 7

Slide 7 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK Conf Sept 2015 Augmenting data ● Identifying people, places, brands, sentiment ● “i love my apple phone” ● Context-sensitive (e.g movies vs products) ● Accurately count mentions & sentiment

Slide 8

Slide 8 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK Conf Sept 2015 Medical data (anti-allergy) ML prediction using food+alcohol, pollen, pollution, location, cats, ...

Slide 9

Slide 9 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK Conf Sept 2015 Machine Learning ● PyMC (Markov Chain Monte Carlo) Please cite these projects! (it helps their funding)

Slide 10

Slide 10 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK Conf Sept 2015 Delivery: Keep It Simple (Stupid!) ● We're (probably) not publishing the best result ● Debuggability is key - 3am Sunday CTO beeper alert is no time for complexity ● “cult of the imperfect” Watson-Watt ● Dumb models + clean data beat other combinations

Slide 11

Slide 11 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK Conf Sept 2015 Don't Kill It! ● Your data is missing, it is poor and it lies ● Missing data kills projects! ● Log everything! ● Make data quality tools & reports ● Note! More data->desynchronisation ● DESYNCH IS BAAAAAAD! ● R&D != Engineering ● Discovery-based ● Iterative ● Success and failure equally useful engarde

Slide 12

Slide 12 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK Conf Sept 2015 Internal deployment ● CSVs/Reports ● Database updates ● IPython Notebook (not secure though!) ● Bokeh

Slide 13

Slide 13 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK Conf Sept 2015 Deploying live systems ● Spyre (locked-down) ● Microservices ● Docker + Amazon ECS ● Flask+Swagger

Slide 14

Slide 14 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK Conf Sept 2015 Python Deployment ● Make Python modules (setup.py) ● python setup.py develop # symlink ● Unit tests + coverage ● Use a config system (e.g. my github.com /ianozsvald/python_template_with_config) ● Use git with a deployment branch ● Post-commit git hooks for unit testing ● Keep Separation of Concerns! ● “12 Factor App” useful ideas

Slide 15

Slide 15 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK Conf Sept 2015 Some common gotchas ● MySQL UTF8 is 3 byte by default #sigh ● JavaScript months are 0-based (not 1) ● Never compromise on datetimes (ISO 8601) ● iOS NSDate's epoch is 2001 ● Windows CP1252 text (strongly prefer UTF8) ● MongoDB no_timeout_cursor=True ● Github's 100MB file limits (new Large File Support) ● Never throw data away! Never overwrite original data! Always transform it (e.g. Luigi) ● Data duplication bites you in the end...

Slide 16

Slide 16 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK Conf Sept 2015 (Perhaps) Avoid Big Data ● Don't be in a rush - 50,000 lines of good data will beat a pile of Bad Big Data ● 244GB RAM EC2+many Xeons $2.80/hr

Slide 17

Slide 17 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK Conf Sept 2015 Closing ● Tell me your dirty data stories (I'm starting to automate some of this) ● Keep It Simple ● Come talk on your projects at our PyDataLondon monthly meetup... ● It isn't what you know but who you know ● (I do coaching and consulting on this stuff in ModelInsight.io)