Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyConSE 2015 Opening Keynote "Data Science Delivered"

PyConSE 2015 Opening Keynote "Data Science Delivered"

Opening keynote for PyCon Sweden 2015 discussing how to start, develop and deploy a successful data product using Python

ianozsvald

May 13, 2015
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. Data Science Deployed
    Turning raw data into valuable services
    Ian Ozsvald @IanOzsvald ModelInsight.io

    View Slide

  2. [email protected] @IanOzsvald
    PyConSE May 2015
    Who Am I?

    “Industrial Data Science” for 15 years

    O'Reilly Author

    Teacher at PyCons

    View Slide

  3. [email protected] @IanOzsvald
    PyConSE May 2015
    PyDataLondon Meetups

    View Slide

  4. [email protected] @IanOzsvald
    PyConSE May 2015
    I want to encourage you to...

    Mix “data people” and “engineers” to deliver
    high-value products so we can...

    Go faster than humans

    Be more accurate than humans

    Be consistent and reproducible

    I want you to become a
    data scientist
    Attrib: http://www.xara.com/news/april07/tutorial2.asp

    View Slide

  5. [email protected] @IanOzsvald
    PyConSE May 2015
    Who is a Data Scientist?
    http://datascopeanalytics.com/what-we-think/2014/02/05/what-is-a-data-scientist

    View Slide

  6. [email protected] @IanOzsvald
    PyConSE May 2015
    Why 'now'?
    http://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users

    View Slide

  7. [email protected] @IanOzsvald
    PyConSE May 2015
    Why is it valuable?

    “Massively customised service”

    Data Moats are hard to copy

    View Slide

  8. [email protected] @IanOzsvald
    PyConSE May 2015
    Why is it valuable?

    View Slide

  9. [email protected] @IanOzsvald
    PyConSE May 2015
    “A day in my life”

    “How can I turn our data into business
    value?”

    Thinking on our data quality and
    transformations to improve quality

    How can I better predict or classify
    something that's valuable?

    Deploying, testing, documenting

    View Slide

  10. [email protected] @IanOzsvald
    PyConSE May 2015
    Starting your first project

    Need: High value & easy problem

    Share insight, augment data, automate a
    process or predict the future

    Deliver value at the end of day 1, day 2,
    week 1, week 2, month 1 etc

    Tutorials on my blog (IanOzsvald.com)

    View Slide

  11. [email protected] @IanOzsvald
    PyConSE May 2015
    Example of “insight”
    Data via: https://twitter.com/echen/status/594353863374737409
    http://ianozsvald.com/2015/05/03/talkpay-tweet-salary-visualisation/

    View Slide

  12. [email protected] @IanOzsvald
    PyConSE May 2015
    Example of “insight”

    View Slide

  13. [email protected] @IanOzsvald
    PyConSE May 2015
    Example of “insight”

    View Slide

  14. [email protected] @IanOzsvald
    PyConSE May 2015
    Extracting data from binary files

    Copy/pasting PDF/PNG data is laborious

    How can we scale it?

    textract - unified interface

    Apache's Tika (maybe) better

    Specialised tools e.g. Sovren

    Think on pipelines of transforms

    View Slide

  15. [email protected] @IanOzsvald
    PyConSE May 2015
    Optical Character Recognition

    View Slide

  16. [email protected] @IanOzsvald
    PyConSE May 2015
    Optical Character Recognition

    View Slide

  17. [email protected] @IanOzsvald
    PyConSE May 2015
    Augmenting data

    Identifying people, places, brands,
    sentiment

    “i love my apple phone”

    Context-sensitive (e.g movies vs
    products)

    Accurately count mentions & sentiment

    View Slide

  18. [email protected] @IanOzsvald
    PyConSE May 2015
    Augmenting images

    View Slide

  19. [email protected] @IanOzsvald
    PyConSE May 2015
    Predicting the unknown

    Forecasting the future or filling the gaps

    Demand prediction, life expectancy, price
    estimation

    View Slide

  20. [email protected] @IanOzsvald
    PyConSE May 2015
    Predicting the unknown

    View Slide

  21. [email protected] @IanOzsvald
    PyConSE May 2015
    Gaussian Process price estimates

    View Slide

  22. [email protected] @IanOzsvald
    PyConSE May 2015
    Classification

    “Is it X or is it something else?”

    Spam, malware, lead identification, text
    disambiguation, fraud classification

    Many examples online, lots of tutorials

    View Slide

  23. [email protected] @IanOzsvald
    PyConSE May 2015
    Digit classification

    View Slide

  24. [email protected] @IanOzsvald
    PyConSE May 2015
    More problems we can solve

    Text topic detection

    Duplicate detection

    Data cleaning

    Copyright violation (DMCA)

    Speech recognition for call centre
    automation

    View Slide

  25. [email protected] @IanOzsvald
    PyConSE May 2015
    Tooling
    IDE: Spyder (PyCharm)
    Notebooks great for tutorials
    & demos, not as an IDE

    View Slide

  26. [email protected] @IanOzsvald
    PyConSE May 2015
    First project: outline

    Iterate on:

    Visualise

    Seaborn/Bokeh

    Create milestones

    KISS!

    Think+hypothesise+test

    Communicate results

    IPython Notebook

    (Engineer a solution)

    View Slide

  27. [email protected] @IanOzsvald
    PyConSE May 2015
    Don't Kill It!

    Your data is missing, it is poor and it lies

    Missing data kills projects!

    Log everything!

    Make data quality reports

    R&D != Engineering

    Discovery-based

    Iterative

    Success and failure equally useful

    View Slide

  28. [email protected] @IanOzsvald
    PyConSE May 2015
    Internal deployment

    Scripts to drive
    report

    CSVs/Reports

    Database updates

    IPython Notebook
    (not secure though!)

    Bokeh

    View Slide

  29. [email protected] @IanOzsvald
    PyConSE May 2015
    Deploying live systems

    Spyre (locked-down)

    Microservices

    Flask is my go-to tool

    Swagger docs

    (git pull / fabric / provisioned machines)

    Docker + Amazon ECS

    View Slide

  30. [email protected] @IanOzsvald
    PyConSE May 2015
    flask-restful-swagger

    View Slide

  31. [email protected] @IanOzsvald
    PyConSE May 2015
    Avoid Big Data if possible...

    Don't be in a rush - 5000 lines of good
    data will beat a pile of Bad Big Data

    244GB RAM EC2+many Xeons $2.80/hr

    Scaling options:

    ElasticSearch + Jython/Java

    Azure/Amazon ML

    Apache Spark # if you have HDFS already

    View Slide

  32. [email protected] @IanOzsvald
    PyConSE May 2015
    Frågor?

    We have a crazy-good selection of tools!

    Don't worry about imposter syndrome - your
    business knowledge has a lot of value

    We need data science patterns - what's your
    story?

    Ask me how you can get started (I respond well
    to beer)

    ianozsvald.com

    View Slide