Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyConSE 2015 Opening Keynote "Data Science Delivered"

PyConSE 2015 Opening Keynote "Data Science Delivered"

Opening keynote for PyCon Sweden 2015 discussing how to start, develop and deploy a successful data product using Python


May 13, 2015

More Decks by ianozsvald

Other Decks in Technology


  1. [email protected] @IanOzsvald PyConSE May 2015 Who Am I? • “Industrial

    Data Science” for 15 years • O'Reilly Author • Teacher at PyCons
  2. [email protected] @IanOzsvald PyConSE May 2015 I want to encourage you

    to... • Mix “data people” and “engineers” to deliver high-value products so we can... • Go faster than humans • Be more accurate than humans • Be consistent and reproducible • I want you to become a data scientist Attrib: http://www.xara.com/news/april07/tutorial2.asp
  3. [email protected] @IanOzsvald PyConSE May 2015 Who is a Data Scientist?

  4. [email protected] @IanOzsvald PyConSE May 2015 Why is it valuable? •

    “Massively customised service” • Data Moats are hard to copy
  5. [email protected] @IanOzsvald PyConSE May 2015 “A day in my life”

    • “How can I turn our data into business value?” • Thinking on our data quality and transformations to improve quality • How can I better predict or classify something that's valuable? • Deploying, testing, documenting
  6. [email protected] @IanOzsvald PyConSE May 2015 Starting your first project •

    Need: High value & easy problem • Share insight, augment data, automate a process or predict the future • Deliver value at the end of day 1, day 2, week 1, week 2, month 1 etc • Tutorials on my blog (IanOzsvald.com)
  7. [email protected] @IanOzsvald PyConSE May 2015 Example of “insight” Data via:

    https://twitter.com/echen/status/594353863374737409 http://ianozsvald.com/2015/05/03/talkpay-tweet-salary-visualisation/
  8. [email protected] @IanOzsvald PyConSE May 2015 Extracting data from binary files

    • Copy/pasting PDF/PNG data is laborious • How can we scale it? • textract - unified interface • Apache's Tika (maybe) better • Specialised tools e.g. Sovren • Think on pipelines of transforms
  9. [email protected] @IanOzsvald PyConSE May 2015 Augmenting data • Identifying people,

    places, brands, sentiment • “i love my apple phone” • Context-sensitive (e.g movies vs products) • Accurately count mentions & sentiment
  10. [email protected] @IanOzsvald PyConSE May 2015 Predicting the unknown • Forecasting

    the future or filling the gaps • Demand prediction, life expectancy, price estimation
  11. [email protected] @IanOzsvald PyConSE May 2015 Classification • “Is it X

    or is it something else?” • Spam, malware, lead identification, text disambiguation, fraud classification • Many examples online, lots of tutorials
  12. [email protected] @IanOzsvald PyConSE May 2015 More problems we can solve

    • Text topic detection • Duplicate detection • Data cleaning • Copyright violation (DMCA) • Speech recognition for call centre automation
  13. [email protected] @IanOzsvald PyConSE May 2015 First project: outline • Iterate

    on: • Visualise • Seaborn/Bokeh • Create milestones • KISS! • Think+hypothesise+test • Communicate results • IPython Notebook • (Engineer a solution)
  14. [email protected] @IanOzsvald PyConSE May 2015 Don't Kill It! • Your

    data is missing, it is poor and it lies • Missing data kills projects! • Log everything! • Make data quality reports • R&D != Engineering • Discovery-based • Iterative • Success and failure equally useful
  15. [email protected] @IanOzsvald PyConSE May 2015 Internal deployment • Scripts to

    drive report • CSVs/Reports • Database updates • IPython Notebook (not secure though!) • Bokeh
  16. [email protected] @IanOzsvald PyConSE May 2015 Deploying live systems • Spyre

    (locked-down) • Microservices • Flask is my go-to tool • Swagger docs • (git pull / fabric / provisioned machines) • Docker + Amazon ECS
  17. [email protected] @IanOzsvald PyConSE May 2015 Avoid Big Data if possible...

    • Don't be in a rush - 5000 lines of good data will beat a pile of Bad Big Data • 244GB RAM EC2+many Xeons $2.80/hr • Scaling options: • ElasticSearch + Jython/Java • Azure/Amazon ML • Apache Spark # if you have HDFS already
  18. [email protected] @IanOzsvald PyConSE May 2015 Frågor? • We have a

    crazy-good selection of tools! • Don't worry about imposter syndrome - your business knowledge has a lot of value • We need data science patterns - what's your story? • Ask me how you can get started (I respond well to beer) • ianozsvald.com