PyConSE 2015 Opening Keynote "Data Science Delivered"

Data Science Deployed Turning raw data into valuable services Ian
Ozsvald @IanOzsvald ModelInsight.io

[email protected] @IanOzsvald PyConSE May 2015 Who Am I? • “Industrial
Data Science” for 15 years • O'Reilly Author • Teacher at PyCons

[email protected] @IanOzsvald PyConSE May 2015 PyDataLondon Meetups

[email protected] @IanOzsvald PyConSE May 2015 I want to encourage you
to... • Mix “data people” and “engineers” to deliver high-value products so we can... • Go faster than humans • Be more accurate than humans • Be consistent and reproducible • I want you to become a data scientist Attrib: http://www.xara.com/news/april07/tutorial2.asp

[email protected] @IanOzsvald PyConSE May 2015 Who is a Data Scientist?
http://datascopeanalytics.com/what-we-think/2014/02/05/what-is-a-data-scientist

[email protected] @IanOzsvald PyConSE May 2015 Why 'now'? http://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users

[email protected] @IanOzsvald PyConSE May 2015 Why is it valuable? •
“Massively customised service” • Data Moats are hard to copy

[email protected] @IanOzsvald PyConSE May 2015 Why is it valuable?

[email protected] @IanOzsvald PyConSE May 2015 “A day in my life”
• “How can I turn our data into business value?” • Thinking on our data quality and transformations to improve quality • How can I better predict or classify something that's valuable? • Deploying, testing, documenting

[email protected] @IanOzsvald PyConSE May 2015 Starting your first project •
Need: High value & easy problem • Share insight, augment data, automate a process or predict the future • Deliver value at the end of day 1, day 2, week 1, week 2, month 1 etc • Tutorials on my blog (IanOzsvald.com)

[email protected] @IanOzsvald PyConSE May 2015 Example of “insight” Data via:
https://twitter.com/echen/status/594353863374737409 http://ianozsvald.com/2015/05/03/talkpay-tweet-salary-visualisation/

[email protected] @IanOzsvald PyConSE May 2015 Example of “insight”

[email protected] @IanOzsvald PyConSE May 2015 Extracting data from binary files
• Copy/pasting PDF/PNG data is laborious • How can we scale it? • textract - unified interface • Apache's Tika (maybe) better • Specialised tools e.g. Sovren • Think on pipelines of transforms

[email protected] @IanOzsvald PyConSE May 2015 Optical Character Recognition

[email protected] @IanOzsvald PyConSE May 2015 Augmenting data • Identifying people,
places, brands, sentiment • “i love my apple phone” • Context-sensitive (e.g movies vs products) • Accurately count mentions & sentiment

[email protected] @IanOzsvald PyConSE May 2015 Augmenting images

[email protected] @IanOzsvald PyConSE May 2015 Predicting the unknown • Forecasting
the future or filling the gaps • Demand prediction, life expectancy, price estimation

[email protected] @IanOzsvald PyConSE May 2015 Predicting the unknown

[email protected] @IanOzsvald PyConSE May 2015 Gaussian Process price estimates

[email protected] @IanOzsvald PyConSE May 2015 Classification • “Is it X
or is it something else?” • Spam, malware, lead identification, text disambiguation, fraud classification • Many examples online, lots of tutorials

[email protected] @IanOzsvald PyConSE May 2015 Digit classification

[email protected] @IanOzsvald PyConSE May 2015 More problems we can solve
• Text topic detection • Duplicate detection • Data cleaning • Copyright violation (DMCA) • Speech recognition for call centre automation

[email protected] @IanOzsvald PyConSE May 2015 Tooling IDE: Spyder (PyCharm) Notebooks
great for tutorials & demos, not as an IDE

[email protected] @IanOzsvald PyConSE May 2015 First project: outline • Iterate
on: • Visualise • Seaborn/Bokeh • Create milestones • KISS! • Think+hypothesise+test • Communicate results • IPython Notebook • (Engineer a solution)

[email protected] @IanOzsvald PyConSE May 2015 Don't Kill It! • Your
data is missing, it is poor and it lies • Missing data kills projects! • Log everything! • Make data quality reports • R&D != Engineering • Discovery-based • Iterative • Success and failure equally useful

[email protected] @IanOzsvald PyConSE May 2015 Internal deployment • Scripts to
drive report • CSVs/Reports • Database updates • IPython Notebook (not secure though!) • Bokeh

[email protected] @IanOzsvald PyConSE May 2015 Deploying live systems • Spyre
(locked-down) • Microservices • Flask is my go-to tool • Swagger docs • (git pull / fabric / provisioned machines) • Docker + Amazon ECS

[email protected] @IanOzsvald PyConSE May 2015 flask-restful-swagger

[email protected] @IanOzsvald PyConSE May 2015 Avoid Big Data if possible...
• Don't be in a rush - 5000 lines of good data will beat a pile of Bad Big Data • 244GB RAM EC2+many Xeons $2.80/hr • Scaling options: • ElasticSearch + Jython/Java • Azure/Amazon ML • Apache Spark # if you have HDFS already

[email protected] @IanOzsvald PyConSE May 2015 Frågor? • We have a
crazy-good selection of tools! • Don't worry about imposter syndrome - your business knowledge has a lot of value • We need data science patterns - what's your story? • Ask me how you can get started (I respond well to beer) • ianozsvald.com

PyConSE 2015 Opening Keynote "Data Science Deli...

PyConSE 2015 Opening Keynote "Data Science Delivered"

ianozsvald

More Decks by ianozsvald

Other Decks in Technology

Featured

Transcript

Data Science Deployed Turning raw data into valuable services Ian

[email protected] @IanOzsvald PyConSE May 2015 Who Am I? • “Industrial

[email protected] @IanOzsvald PyConSE May 2015 PyDataLondon Meetups

[email protected] @IanOzsvald PyConSE May 2015 I want to encourage you

[email protected] @IanOzsvald PyConSE May 2015 Who is a Data Scientist?

[email protected] @IanOzsvald PyConSE May 2015 Why 'now'? http://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users

[email protected] @IanOzsvald PyConSE May 2015 Why is it valuable? •

[email protected] @IanOzsvald PyConSE May 2015 Why is it valuable?

[email protected] @IanOzsvald PyConSE May 2015 “A day in my life”

[email protected] @IanOzsvald PyConSE May 2015 Starting your first project •

[email protected] @IanOzsvald PyConSE May 2015 Example of “insight” Data via:

[email protected] @IanOzsvald PyConSE May 2015 Example of “insight”

[email protected] @IanOzsvald PyConSE May 2015 Example of “insight”

[email protected] @IanOzsvald PyConSE May 2015 Extracting data from binary files

[email protected] @IanOzsvald PyConSE May 2015 Optical Character Recognition

[email protected] @IanOzsvald PyConSE May 2015 Optical Character Recognition

[email protected] @IanOzsvald PyConSE May 2015 Augmenting data • Identifying people,

[email protected] @IanOzsvald PyConSE May 2015 Augmenting images

[email protected] @IanOzsvald PyConSE May 2015 Predicting the unknown • Forecasting

[email protected] @IanOzsvald PyConSE May 2015 Predicting the unknown

[email protected] @IanOzsvald PyConSE May 2015 Gaussian Process price estimates

[email protected] @IanOzsvald PyConSE May 2015 Classification • “Is it X

[email protected] @IanOzsvald PyConSE May 2015 Digit classification

[email protected] @IanOzsvald PyConSE May 2015 More problems we can solve

[email protected] @IanOzsvald PyConSE May 2015 Tooling IDE: Spyder (PyCharm) Notebooks

[email protected] @IanOzsvald PyConSE May 2015 First project: outline • Iterate

[email protected] @IanOzsvald PyConSE May 2015 Don't Kill It! • Your

[email protected] @IanOzsvald PyConSE May 2015 Internal deployment • Scripts to

[email protected] @IanOzsvald PyConSE May 2015 Deploying live systems • Spyre

[email protected] @IanOzsvald PyConSE May 2015 flask-restful-swagger

[email protected] @IanOzsvald PyConSE May 2015 Avoid Big Data if possible...

[email protected] @IanOzsvald PyConSE May 2015 Frågor? • We have a