Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyConSE 2015 Opening Keynote "Data Science Deli...

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

PyConSE 2015 Opening Keynote "Data Science Delivered"

Opening keynote for PyCon Sweden 2015 discussing how to start, develop and deploy a successful data product using Python

Avatar for ianozsvald

ianozsvald

May 13, 2015
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. [email protected] @IanOzsvald PyConSE May 2015 Who Am I? • “Industrial

    Data Science” for 15 years • O'Reilly Author • Teacher at PyCons
  2. [email protected] @IanOzsvald PyConSE May 2015 I want to encourage you

    to... • Mix “data people” and “engineers” to deliver high-value products so we can... • Go faster than humans • Be more accurate than humans • Be consistent and reproducible • I want you to become a data scientist Attrib: http://www.xara.com/news/april07/tutorial2.asp
  3. [email protected] @IanOzsvald PyConSE May 2015 Who is a Data Scientist?

    http://datascopeanalytics.com/what-we-think/2014/02/05/what-is-a-data-scientist
  4. [email protected] @IanOzsvald PyConSE May 2015 Why is it valuable? •

    “Massively customised service” • Data Moats are hard to copy
  5. [email protected] @IanOzsvald PyConSE May 2015 “A day in my life”

    • “How can I turn our data into business value?” • Thinking on our data quality and transformations to improve quality • How can I better predict or classify something that's valuable? • Deploying, testing, documenting
  6. [email protected] @IanOzsvald PyConSE May 2015 Starting your first project •

    Need: High value & easy problem • Share insight, augment data, automate a process or predict the future • Deliver value at the end of day 1, day 2, week 1, week 2, month 1 etc • Tutorials on my blog (IanOzsvald.com)
  7. [email protected] @IanOzsvald PyConSE May 2015 Example of “insight” Data via:

    https://twitter.com/echen/status/594353863374737409 http://ianozsvald.com/2015/05/03/talkpay-tweet-salary-visualisation/
  8. [email protected] @IanOzsvald PyConSE May 2015 Extracting data from binary files

    • Copy/pasting PDF/PNG data is laborious • How can we scale it? • textract - unified interface • Apache's Tika (maybe) better • Specialised tools e.g. Sovren • Think on pipelines of transforms
  9. [email protected] @IanOzsvald PyConSE May 2015 Augmenting data • Identifying people,

    places, brands, sentiment • “i love my apple phone” • Context-sensitive (e.g movies vs products) • Accurately count mentions & sentiment
  10. [email protected] @IanOzsvald PyConSE May 2015 Predicting the unknown • Forecasting

    the future or filling the gaps • Demand prediction, life expectancy, price estimation
  11. [email protected] @IanOzsvald PyConSE May 2015 Classification • “Is it X

    or is it something else?” • Spam, malware, lead identification, text disambiguation, fraud classification • Many examples online, lots of tutorials
  12. [email protected] @IanOzsvald PyConSE May 2015 More problems we can solve

    • Text topic detection • Duplicate detection • Data cleaning • Copyright violation (DMCA) • Speech recognition for call centre automation
  13. [email protected] @IanOzsvald PyConSE May 2015 First project: outline • Iterate

    on: • Visualise • Seaborn/Bokeh • Create milestones • KISS! • Think+hypothesise+test • Communicate results • IPython Notebook • (Engineer a solution)
  14. [email protected] @IanOzsvald PyConSE May 2015 Don't Kill It! • Your

    data is missing, it is poor and it lies • Missing data kills projects! • Log everything! • Make data quality reports • R&D != Engineering • Discovery-based • Iterative • Success and failure equally useful
  15. [email protected] @IanOzsvald PyConSE May 2015 Internal deployment • Scripts to

    drive report • CSVs/Reports • Database updates • IPython Notebook (not secure though!) • Bokeh
  16. [email protected] @IanOzsvald PyConSE May 2015 Deploying live systems • Spyre

    (locked-down) • Microservices • Flask is my go-to tool • Swagger docs • (git pull / fabric / provisioned machines) • Docker + Amazon ECS
  17. [email protected] @IanOzsvald PyConSE May 2015 Avoid Big Data if possible...

    • Don't be in a rush - 5000 lines of good data will beat a pile of Bad Big Data • 244GB RAM EC2+many Xeons $2.80/hr • Scaling options: • ElasticSearch + Jython/Java • Azure/Amazon ML • Apache Spark # if you have HDFS already
  18. [email protected] @IanOzsvald PyConSE May 2015 Frågor? • We have a

    crazy-good selection of tools! • Don't worry about imposter syndrome - your business knowledge has a lot of value • We need data science patterns - what's your story? • Ask me how you can get started (I respond well to beer) • ianozsvald.com