Shipping data science products is hard! Learn from 10 years of my experience for how to visualise, extract, annotate and machine learn on data sets through to successful strategies for deployment and reproducibility.
Explain existing data (visualisation!) • Automate repetitive/slow processes (higher accuracy, more repeatable) • Augment data to make new data (e.g. for search engines and ML) • Predict the future (e.g. replace human intuition or use subtler relationships)
files • Copy/pasting PDF/PNG data is laborious • How can we scale it? • textract - unified interface • Apache's Tika (maybe) better • Specialised tools e.g. Sovren • Think on pipelines of transforms • This might take months!
Thoughts from you? • No obvious tools to show me: • these examples were well-fitted • these always wrongly-fitted • these always uncertain • No data-diagnostics to validate inputs (e.g. for Logistic Regression) • No visualisers for most of the models • Your hard-won knowledge->new debug tools? (PLEASE!)
(Stupid!) • We're (probably) not publishing the best result • Debuggability is key - 3am Sunday CTO beeper alert is no time for complexity • “cult of the imperfect” Watson-Watt • Dumb models + clean data beat other combinations
Your data is missing, it is poor and it lies • Missing data kills projects! • Log everything! • Make data quality tools & reports • Note! More data->desynchronisation • DESYNCH IS BAAAAAAD! • R&D != Engineering • Discovery-based • Iterative • Success and failure equally useful dsadd
Python modules (setup.py) • python setup.py develop # symlink • Unit tests + coverage • Use a config system (e.g. my github.com /ianozsvald/python_template_with_config) • Use git with a deployment branch • Post-commit git hooks for unit testing • Keep Separation of Concerns! • “12 Factor App” useful ideas
MySQL UTF8 is 3 byte by default #sigh • JavaScript months are 0-based (not 1) • Never compromise on datetimes (ISO 8601) • iOS NSDate's epoch is 2001 • Windows CP1252 text (strongly prefer UTF8) • MongoDB no_timeout_cursor=True • Github's 100MB file limits (new Large File Support) • Never throw data away! Never overwrite original data! Always transform it (e.g. Luigi) • Data duplication bites you in the end...
your dirty data stories (I want to automate some of this) • Keep It Simple • Come talk on your projects at our PyDataLondon monthly meetup... • It isn't what you know but who you know • (I do coaching on this stuff in ModelInsight.io)