Slide 1

Slide 1 text

Data Science Deployed Turning raw data into valuable services Ian Ozsvald @IanOzsvald ModelInsight.io

Slide 2

Slide 2 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Who Am I? ● “Industrial Data Science” for 15 years ● O'Reilly Author ● Teacher at PyCons

Slide 3

Slide 3 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 PyDataLondon Meetups

Slide 4

Slide 4 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 I want to encourage you to... ● Mix “data people” and “engineers” to deliver high-value products so we can... ● Go faster than humans ● Be more accurate than humans ● Be consistent and reproducible ● I want you to become a data scientist Attrib: http://www.xara.com/news/april07/tutorial2.asp

Slide 5

Slide 5 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Who is a Data Scientist? http://datascopeanalytics.com/what-we-think/2014/02/05/what-is-a-data-scientist

Slide 6

Slide 6 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Why 'now'? http://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users

Slide 7

Slide 7 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Why is it valuable? ● “Massively customised service” ● Data Moats are hard to copy

Slide 8

Slide 8 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Why is it valuable?

Slide 9

Slide 9 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 “A day in my life” ● “How can I turn our data into business value?” ● Thinking on our data quality and transformations to improve quality ● How can I better predict or classify something that's valuable? ● Deploying, testing, documenting

Slide 10

Slide 10 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Starting your first project ● Need: High value & easy problem ● Share insight, augment data, automate a process or predict the future ● Deliver value at the end of day 1, day 2, week 1, week 2, month 1 etc ● Tutorials on my blog (IanOzsvald.com)

Slide 11

Slide 11 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Example of “insight” Data via: https://twitter.com/echen/status/594353863374737409 http://ianozsvald.com/2015/05/03/talkpay-tweet-salary-visualisation/

Slide 12

Slide 12 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Example of “insight”

Slide 13

Slide 13 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Example of “insight”

Slide 14

Slide 14 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Extracting data from binary files ● Copy/pasting PDF/PNG data is laborious ● How can we scale it? ● textract - unified interface ● Apache's Tika (maybe) better ● Specialised tools e.g. Sovren ● Think on pipelines of transforms

Slide 15

Slide 15 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Optical Character Recognition

Slide 16

Slide 16 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Optical Character Recognition

Slide 17

Slide 17 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Augmenting data ● Identifying people, places, brands, sentiment ● “i love my apple phone” ● Context-sensitive (e.g movies vs products) ● Accurately count mentions & sentiment

Slide 18

Slide 18 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Augmenting images

Slide 19

Slide 19 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Predicting the unknown ● Forecasting the future or filling the gaps ● Demand prediction, life expectancy, price estimation

Slide 20

Slide 20 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Predicting the unknown

Slide 21

Slide 21 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Gaussian Process price estimates

Slide 22

Slide 22 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Classification ● “Is it X or is it something else?” ● Spam, malware, lead identification, text disambiguation, fraud classification ● Many examples online, lots of tutorials

Slide 23

Slide 23 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Digit classification

Slide 24

Slide 24 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 More problems we can solve ● Text topic detection ● Duplicate detection ● Data cleaning ● Copyright violation (DMCA) ● Speech recognition for call centre automation

Slide 25

Slide 25 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Tooling IDE: Spyder (PyCharm) Notebooks great for tutorials & demos, not as an IDE

Slide 26

Slide 26 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 First project: outline ● Iterate on: ● Visualise ● Seaborn/Bokeh ● Create milestones ● KISS! ● Think+hypothesise+test ● Communicate results ● IPython Notebook ● (Engineer a solution)

Slide 27

Slide 27 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Don't Kill It! ● Your data is missing, it is poor and it lies ● Missing data kills projects! ● Log everything! ● Make data quality reports ● R&D != Engineering ● Discovery-based ● Iterative ● Success and failure equally useful

Slide 28

Slide 28 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Internal deployment ● Scripts to drive report ● CSVs/Reports ● Database updates ● IPython Notebook (not secure though!) ● Bokeh

Slide 29

Slide 29 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Deploying live systems ● Spyre (locked-down) ● Microservices ● Flask is my go-to tool ● Swagger docs ● (git pull / fabric / provisioned machines) ● Docker + Amazon ECS

Slide 30

Slide 30 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 flask-restful-swagger

Slide 31

Slide 31 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Avoid Big Data if possible... ● Don't be in a rush - 5000 lines of good data will beat a pile of Bad Big Data ● 244GB RAM EC2+many Xeons $2.80/hr ● Scaling options: ● ElasticSearch + Jython/Java ● Azure/Amazon ML ● Apache Spark # if you have HDFS already

Slide 32

Slide 32 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConSE May 2015 Frågor? ● We have a crazy-good selection of tools! ● Don't worry about imposter syndrome - your business knowledge has a lot of value ● We need data science patterns - what's your story? ● Ask me how you can get started (I respond well to beer) ● ianozsvald.com