Ship It!
Turning raw data into valuable services
PyDataLondon 2015 Conference
Ian Ozsvald @IanOzsvald ModelInsight.io
Slide 2
Slide 2 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Who Am I?
●
“Industrial Data Science” for 15 years
●
Consultant
●
O'Reilly Author
●
Teacher at PyCons
Slide 3
Slide 3 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Which projects succeed?
●
Explain existing data (visualisation!)
●
Automate repetitive/slow processes
(higher accuracy, more repeatable)
●
Augment data to make new data (e.g. for
search engines and ML)
●
Predict the future (e.g. replace human
intuition or use subtler relationships)
Slide 4
Slide 4 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Why is it valuable?
Slide 5
Slide 5 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Visualising data
●
Most data isn't interesting...
●
Requires human curation + detective
skills to get the good stuff
●
Probably needs an engineer/researcher
+ business person
Slide 6
Slide 6 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Extracting data from binary files
●
Copy/pasting PDF/PNG data is laborious
●
How can we scale it?
●
textract - unified interface
●
Apache's Tika (maybe) better
●
Specialised tools e.g. Sovren
●
Think on pipelines of transforms
●
This might take months!
Slide 7
Slide 7 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Optical Character Recognition
Slide 8
Slide 8 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Optical Character Recognition
Slide 9
Slide 9 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Augmenting data
●
Identifying people, places, brands,
sentiment
●
“i love my apple phone”
●
Context-sensitive (e.g movies vs
products)
●
Accurately count mentions & sentiment
Slide 10
Slide 10 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Medical data (anti-allergy)
ML prediction using food+alcohol, pollen, pollution, location, cats, ...
Slide 11
Slide 11 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Machine Learning
●
PyMC (Markov Chain Monte Carlo)
Please cite these projects!
(it helps their funding)
Slide 12
Slide 12 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Debugging Machine Learning?
●
Thoughts from you?
●
No obvious tools to show me:
●
these examples were well-fitted
●
these always wrongly-fitted
●
these always uncertain
●
No data-diagnostics to validate inputs (e.g. for Logistic
Regression)
●
No visualisers for most of the models
●
Your hard-won knowledge->new debug tools? (PLEASE!)
Slide 13
Slide 13 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Debugging Machine Learning?
“How “good” is your model, and how can you make
it better?” Chih-Chun Chen, Elena Chatzimichali at
PyDataLondon 2015
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Delivery: Keep It Simple (Stupid!)
●
We're (probably) not publishing the best
result
●
Debuggability is key - 3am Sunday CTO
beeper alert is no time for complexity
●
“cult of the imperfect” Watson-Watt
●
Dumb models + clean data beat other
combinations
Slide 16
Slide 16 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Don't Kill It!
●
Your data is missing, it is poor and it lies
●
Missing data kills projects!
●
Log everything!
●
Make data quality tools & reports
●
Note! More data->desynchronisation
●
DESYNCH IS BAAAAAAD!
●
R&D != Engineering
●
Discovery-based
●
Iterative
●
Success and failure equally useful
dsadd
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Deploying live systems
●
Spyre (locked-down)
●
Microservices
●
Flask is my go-to tool
●
Swagger docs
●
(git pull / fabric / provisioned machines)
●
Docker + Amazon ECS
Slide 19
Slide 19 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
flask-restful-swagger
Slide 20
Slide 20 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Reproducibility
Slide 21
Slide 21 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Reproducibility (Notebooks)
Slide 22
Slide 22 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Python Deployment
●
Make Python modules (setup.py)
●
python setup.py develop # symlink
●
Unit tests + coverage
●
Use a config system (e.g. my github.com
/ianozsvald/python_template_with_config)
●
Use git with a deployment branch
●
Post-commit git hooks for unit testing
●
Keep Separation of Concerns!
●
“12 Factor App” useful ideas
Slide 23
Slide 23 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Some common gotchas
●
MySQL UTF8 is 3 byte by default #sigh
●
JavaScript months are 0-based (not 1)
●
Never compromise on datetimes (ISO 8601)
●
iOS NSDate's epoch is 2001
●
Windows CP1252 text (strongly prefer UTF8)
●
MongoDB no_timeout_cursor=True
●
Github's 100MB file limits (new Large File Support)
●
Never throw data away! Never overwrite original data! Always
transform it (e.g. Luigi)
●
Data duplication bites you in the end...
Slide 24
Slide 24 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
(Perhaps) Avoid Big Data
●
Don't be in a rush - 50,000 lines of good
data will beat a pile of Bad Big Data
●
244GB RAM EC2+many Xeons $2.80/hr
Slide 25
Slide 25 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyDataLondon Conf June 2015
Closing
●
Tell me your dirty data stories (I want to
automate some of this)
●
Keep It Simple
●
Come talk on your projects at our
PyDataLondon monthly meetup...
●
It isn't what you know but who you know
●
(I do coaching on this stuff in ModelInsight.io)