[email protected] @IanOzsvald PyConSE May 2015 I want to encourage you to... ● Mix “data people” and “engineers” to deliver high-value products so we can... ● Go faster than humans ● Be more accurate than humans ● Be consistent and reproducible ● I want you to become a data scientist Attrib: http://www.xara.com/news/april07/tutorial2.asp
[email protected] @IanOzsvald PyConSE May 2015 Who is a Data Scientist? http://datascopeanalytics.com/what-we-think/2014/02/05/what-is-a-data-scientist
[email protected] @IanOzsvald PyConSE May 2015 “A day in my life” ● “How can I turn our data into business value?” ● Thinking on our data quality and transformations to improve quality ● How can I better predict or classify something that's valuable? ● Deploying, testing, documenting
[email protected] @IanOzsvald PyConSE May 2015 Starting your first project ● Need: High value & easy problem ● Share insight, augment data, automate a process or predict the future ● Deliver value at the end of day 1, day 2, week 1, week 2, month 1 etc ● Tutorials on my blog (IanOzsvald.com)
[email protected] @IanOzsvald PyConSE May 2015 Example of “insight” Data via: https://twitter.com/echen/status/594353863374737409 http://ianozsvald.com/2015/05/03/talkpay-tweet-salary-visualisation/
[email protected] @IanOzsvald PyConSE May 2015 Extracting data from binary files ● Copy/pasting PDF/PNG data is laborious ● How can we scale it? ● textract - unified interface ● Apache's Tika (maybe) better ● Specialised tools e.g. Sovren ● Think on pipelines of transforms
[email protected] @IanOzsvald PyConSE May 2015 Augmenting data ● Identifying people, places, brands, sentiment ● “i love my apple phone” ● Context-sensitive (e.g movies vs products) ● Accurately count mentions & sentiment
[email protected] @IanOzsvald PyConSE May 2015 Predicting the unknown ● Forecasting the future or filling the gaps ● Demand prediction, life expectancy, price estimation
[email protected] @IanOzsvald PyConSE May 2015 Classification ● “Is it X or is it something else?” ● Spam, malware, lead identification, text disambiguation, fraud classification ● Many examples online, lots of tutorials
[email protected] @IanOzsvald PyConSE May 2015 More problems we can solve ● Text topic detection ● Duplicate detection ● Data cleaning ● Copyright violation (DMCA) ● Speech recognition for call centre automation
[email protected] @IanOzsvald PyConSE May 2015 Don't Kill It! ● Your data is missing, it is poor and it lies ● Missing data kills projects! ● Log everything! ● Make data quality reports ● R&D != Engineering ● Discovery-based ● Iterative ● Success and failure equally useful
[email protected] @IanOzsvald PyConSE May 2015 Avoid Big Data if possible... ● Don't be in a rush - 5000 lines of good data will beat a pile of Bad Big Data ● 244GB RAM EC2+many Xeons $2.80/hr ● Scaling options: ● ElasticSearch + Jython/Java ● Azure/Amazon ML ● Apache Spark # if you have HDFS already
[email protected] @IanOzsvald PyConSE May 2015 Frågor? ● We have a crazy-good selection of tools! ● Don't worry about imposter syndrome - your business knowledge has a lot of value ● We need data science patterns - what's your story? ● Ask me how you can get started (I respond well to beer) ● ianozsvald.com