Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ship It! [PyDataLondon 2015]

Ship It! [PyDataLondon 2015]

Shipping data science products is hard! Learn from 10 years of my experience for how to visualise, extract, annotate and machine learn on data sets through to successful strategies for deployment and reproducibility.

ianozsvald

June 21, 2015
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. Ship It!
    Turning raw data into valuable services
    PyDataLondon 2015 Conference
    Ian Ozsvald @IanOzsvald ModelInsight.io

    View full-size slide

  2. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Who Am I?

    “Industrial Data Science” for 15 years

    Consultant

    O'Reilly Author

    Teacher at PyCons

    View full-size slide

  3. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Which projects succeed?

    Explain existing data (visualisation!)

    Automate repetitive/slow processes
    (higher accuracy, more repeatable)

    Augment data to make new data (e.g. for
    search engines and ML)

    Predict the future (e.g. replace human
    intuition or use subtler relationships)

    View full-size slide

  4. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Why is it valuable?

    View full-size slide

  5. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Visualising data

    Most data isn't interesting...

    Requires human curation + detective
    skills to get the good stuff

    Probably needs an engineer/researcher
    + business person

    View full-size slide

  6. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Extracting data from binary files

    Copy/pasting PDF/PNG data is laborious

    How can we scale it?

    textract - unified interface

    Apache's Tika (maybe) better

    Specialised tools e.g. Sovren

    Think on pipelines of transforms

    This might take months!

    View full-size slide

  7. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Optical Character Recognition

    View full-size slide

  8. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Optical Character Recognition

    View full-size slide

  9. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Augmenting data

    Identifying people, places, brands,
    sentiment

    “i love my apple phone”

    Context-sensitive (e.g movies vs
    products)

    Accurately count mentions & sentiment

    View full-size slide

  10. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Medical data (anti-allergy)
    ML prediction using food+alcohol, pollen, pollution, location, cats, ...

    View full-size slide

  11. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Machine Learning

    PyMC (Markov Chain Monte Carlo)
    Please cite these projects!
    (it helps their funding)

    View full-size slide

  12. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Debugging Machine Learning?

    Thoughts from you?

    No obvious tools to show me:

    these examples were well-fitted

    these always wrongly-fitted

    these always uncertain

    No data-diagnostics to validate inputs (e.g. for Logistic
    Regression)

    No visualisers for most of the models

    Your hard-won knowledge->new debug tools? (PLEASE!)

    View full-size slide

  13. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Debugging Machine Learning?
    “How “good” is your model, and how can you make
    it better?” Chih-Chun Chen, Elena Chatzimichali at
    PyDataLondon 2015

    View full-size slide

  14. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Debugging Machine Learning?
    Roelof Pieters PyDataLondon2015

    View full-size slide

  15. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Delivery: Keep It Simple (Stupid!)

    We're (probably) not publishing the best
    result

    Debuggability is key - 3am Sunday CTO
    beeper alert is no time for complexity

    “cult of the imperfect” Watson-Watt

    Dumb models + clean data beat other
    combinations

    View full-size slide

  16. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Don't Kill It!

    Your data is missing, it is poor and it lies

    Missing data kills projects!

    Log everything!

    Make data quality tools & reports

    Note! More data->desynchronisation

    DESYNCH IS BAAAAAAD!

    R&D != Engineering

    Discovery-based

    Iterative

    Success and failure equally useful
    dsadd

    View full-size slide

  17. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Internal deployment

    Scripts to drive
    report

    CSVs/Reports

    Database updates

    IPython Notebook
    (not secure though!)

    Bokeh

    View full-size slide

  18. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Deploying live systems

    Spyre (locked-down)

    Microservices

    Flask is my go-to tool

    Swagger docs

    (git pull / fabric / provisioned machines)

    Docker + Amazon ECS

    View full-size slide

  19. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    flask-restful-swagger

    View full-size slide

  20. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Reproducibility

    View full-size slide

  21. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Reproducibility (Notebooks)

    View full-size slide

  22. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Python Deployment

    Make Python modules (setup.py)

    python setup.py develop # symlink

    Unit tests + coverage

    Use a config system (e.g. my github.com
    /ianozsvald/python_template_with_config)

    Use git with a deployment branch

    Post-commit git hooks for unit testing

    Keep Separation of Concerns!

    “12 Factor App” useful ideas

    View full-size slide

  23. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Some common gotchas

    MySQL UTF8 is 3 byte by default #sigh

    JavaScript months are 0-based (not 1)

    Never compromise on datetimes (ISO 8601)

    iOS NSDate's epoch is 2001

    Windows CP1252 text (strongly prefer UTF8)

    MongoDB no_timeout_cursor=True

    Github's 100MB file limits (new Large File Support)

    Never throw data away! Never overwrite original data! Always
    transform it (e.g. Luigi)

    Data duplication bites you in the end...

    View full-size slide

  24. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    (Perhaps) Avoid Big Data

    Don't be in a rush - 50,000 lines of good
    data will beat a pile of Bad Big Data

    244GB RAM EC2+many Xeons $2.80/hr

    View full-size slide

  25. [email protected] @IanOzsvald
    PyDataLondon Conf June 2015
    Closing

    Tell me your dirty data stories (I want to
    automate some of this)

    Keep It Simple

    Come talk on your projects at our
    PyDataLondon monthly meetup...

    It isn't what you know but who you know

    (I do coaching on this stuff in ModelInsight.io)

    View full-size slide