Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyConSE 2015 Opening Keynote "Data Science Delivered"

PyConSE 2015 Opening Keynote "Data Science Delivered"

Opening keynote for PyCon Sweden 2015 discussing how to start, develop and deploy a successful data product using Python

ianozsvald

May 13, 2015
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. Data Science Deployed Turning raw data into valuable services Ian

    Ozsvald @IanOzsvald ModelInsight.io
  2. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Who Am I? • “Industrial

    Data Science” for 15 years • O'Reilly Author • Teacher at PyCons
  3. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 PyDataLondon Meetups

  4. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 I want to encourage you

    to... • Mix “data people” and “engineers” to deliver high-value products so we can... • Go faster than humans • Be more accurate than humans • Be consistent and reproducible • I want you to become a data scientist Attrib: http://www.xara.com/news/april07/tutorial2.asp
  5. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Who is a Data Scientist?

    http://datascopeanalytics.com/what-we-think/2014/02/05/what-is-a-data-scientist
  6. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Why 'now'? http://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users

  7. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Why is it valuable? •

    “Massively customised service” • Data Moats are hard to copy
  8. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Why is it valuable?

  9. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 “A day in my life”

    • “How can I turn our data into business value?” • Thinking on our data quality and transformations to improve quality • How can I better predict or classify something that's valuable? • Deploying, testing, documenting
  10. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Starting your first project •

    Need: High value & easy problem • Share insight, augment data, automate a process or predict the future • Deliver value at the end of day 1, day 2, week 1, week 2, month 1 etc • Tutorials on my blog (IanOzsvald.com)
  11. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Example of “insight” Data via:

    https://twitter.com/echen/status/594353863374737409 http://ianozsvald.com/2015/05/03/talkpay-tweet-salary-visualisation/
  12. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Example of “insight”

  13. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Example of “insight”

  14. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Extracting data from binary files

    • Copy/pasting PDF/PNG data is laborious • How can we scale it? • textract - unified interface • Apache's Tika (maybe) better • Specialised tools e.g. Sovren • Think on pipelines of transforms
  15. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Optical Character Recognition

  16. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Optical Character Recognition

  17. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Augmenting data • Identifying people,

    places, brands, sentiment • “i love my apple phone” • Context-sensitive (e.g movies vs products) • Accurately count mentions & sentiment
  18. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Augmenting images

  19. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Predicting the unknown • Forecasting

    the future or filling the gaps • Demand prediction, life expectancy, price estimation
  20. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Predicting the unknown

  21. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Gaussian Process price estimates

  22. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Classification • “Is it X

    or is it something else?” • Spam, malware, lead identification, text disambiguation, fraud classification • Many examples online, lots of tutorials
  23. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Digit classification

  24. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 More problems we can solve

    • Text topic detection • Duplicate detection • Data cleaning • Copyright violation (DMCA) • Speech recognition for call centre automation
  25. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Tooling IDE: Spyder (PyCharm) Notebooks

    great for tutorials & demos, not as an IDE
  26. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 First project: outline • Iterate

    on: • Visualise • Seaborn/Bokeh • Create milestones • KISS! • Think+hypothesise+test • Communicate results • IPython Notebook • (Engineer a solution)
  27. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Don't Kill It! • Your

    data is missing, it is poor and it lies • Missing data kills projects! • Log everything! • Make data quality reports • R&D != Engineering • Discovery-based • Iterative • Success and failure equally useful
  28. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Internal deployment • Scripts to

    drive report • CSVs/Reports • Database updates • IPython Notebook (not secure though!) • Bokeh
  29. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Deploying live systems • Spyre

    (locked-down) • Microservices • Flask is my go-to tool • Swagger docs • (git pull / fabric / provisioned machines) • Docker + Amazon ECS
  30. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 flask-restful-swagger

  31. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Avoid Big Data if possible...

    • Don't be in a rush - 5000 lines of good data will beat a pile of Bad Big Data • 244GB RAM EC2+many Xeons $2.80/hr • Scaling options: • ElasticSearch + Jython/Java • Azure/Amazon ML • Apache Spark # if you have HDFS already
  32. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Frågor? • We have a

    crazy-good selection of tools! • Don't worry about imposter syndrome - your business knowledge has a lot of value • We need data science patterns - what's your story? • Ask me how you can get started (I respond well to beer) • ianozsvald.com