Ship Data Science Products
Turning raw data into valuable services
PyConUK 2015 Conference
Ian Ozsvald @IanOzsvald ModelInsight.io
Slide 2
Slide 2 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyConUK Conf Sept 2015
Who Am I?
●
“Industrial Data Science” for 15 years
●
Consultant
●
O'Reilly Author
●
Teacher at PyCons
Slide 3
Slide 3 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyConUK Conf Sept 2015
Which projects succeed?
●
Explain existing data (visualisation!)
●
Automate repetitive/slow processes
(higher accuracy, more repeatable)
●
Augment data to make new data (e.g. for
search engines and ML)
●
Predict the future (e.g. replace human
intuition or use subtler relationships)
Slide 4
Slide 4 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyConUK Conf Sept 2015
Why is it valuable?
Slide 5
Slide 5 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyConUK Conf Sept 2015
Visualising data
●
Most data isn't interesting...
●
Requires human curation + detective
skills to get the good stuff
●
Probably needs an engineer/researcher
+ business person
Slide 6
Slide 6 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyConUK Conf Sept 2015
Extracting data from binary files
●
Copy/pasting PDF/PNG data is laborious
●
How can we scale it?
●
textract - unified interface
●
Apache's Tika (maybe) better
●
Specialised tools e.g. Sovren
●
This might take months!
Slide 7
Slide 7 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyConUK Conf Sept 2015
Augmenting data
●
Identifying people, places, brands,
sentiment
●
“i love my apple phone”
●
Context-sensitive (e.g movies vs
products)
●
Accurately count mentions & sentiment
Slide 8
Slide 8 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyConUK Conf Sept 2015
Medical data (anti-allergy)
ML prediction using food+alcohol, pollen, pollution, location, cats, ...
Slide 9
Slide 9 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyConUK Conf Sept 2015
Machine Learning
●
PyMC (Markov Chain Monte Carlo)
Please cite these projects!
(it helps their funding)
Slide 10
Slide 10 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyConUK Conf Sept 2015
Delivery: Keep It Simple (Stupid!)
●
We're (probably) not publishing the best
result
●
Debuggability is key - 3am Sunday CTO
beeper alert is no time for complexity
●
“cult of the imperfect” Watson-Watt
●
Dumb models + clean data beat other
combinations
Slide 11
Slide 11 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyConUK Conf Sept 2015
Don't Kill It!
●
Your data is missing, it is poor and it lies
●
Missing data kills projects!
●
Log everything!
●
Make data quality tools & reports
●
Note! More data->desynchronisation
●
DESYNCH IS BAAAAAAD!
●
R&D != Engineering
●
Discovery-based
●
Iterative
●
Success and failure equally useful
engarde
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyConUK Conf Sept 2015
Python Deployment
●
Make Python modules (setup.py)
●
python setup.py develop # symlink
●
Unit tests + coverage
●
Use a config system (e.g. my github.com
/ianozsvald/python_template_with_config)
●
Use git with a deployment branch
●
Post-commit git hooks for unit testing
●
Keep Separation of Concerns!
●
“12 Factor App” useful ideas
Slide 15
Slide 15 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyConUK Conf Sept 2015
Some common gotchas
●
MySQL UTF8 is 3 byte by default #sigh
●
JavaScript months are 0-based (not 1)
●
Never compromise on datetimes (ISO 8601)
●
iOS NSDate's epoch is 2001
●
Windows CP1252 text (strongly prefer UTF8)
●
MongoDB no_timeout_cursor=True
●
Github's 100MB file limits (new Large File Support)
●
Never throw data away! Never overwrite original data! Always
transform it (e.g. Luigi)
●
Data duplication bites you in the end...
Slide 16
Slide 16 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyConUK Conf Sept 2015
(Perhaps) Avoid Big Data
●
Don't be in a rush - 50,000 lines of good
data will beat a pile of Bad Big Data
●
244GB RAM EC2+many Xeons $2.80/hr
Slide 17
Slide 17 text
Ian.Ozsvald@ModelInsight.io @IanOzsvald
PyConUK Conf Sept 2015
Closing
●
Tell me your dirty data stories (I'm starting to
automate some of this)
●
Keep It Simple
●
Come talk on your projects at our
PyDataLondon monthly meetup...
●
It isn't what you know but who you know
●
(I do coaching and consulting on this stuff in
ModelInsight.io)