to... • Mix “data people” and “engineers” to deliver high-value products so we can... • Go faster than humans • Be more accurate than humans • Be consistent and reproducible • I want you to become a data scientist Attrib: http://www.xara.com/news/april07/tutorial2.asp
• “How can I turn our data into business value?” • Thinking on our data quality and transformations to improve quality • How can I better predict or classify something that's valuable? • Deploying, testing, documenting
Need: High value & easy problem • Share insight, augment data, automate a process or predict the future • Deliver value at the end of day 1, day 2, week 1, week 2, month 1 etc • Tutorials on my blog (IanOzsvald.com)
• Copy/pasting PDF/PNG data is laborious • How can we scale it? • textract - unified interface • Apache's Tika (maybe) better • Specialised tools e.g. Sovren • Think on pipelines of transforms
data is missing, it is poor and it lies • Missing data kills projects! • Log everything! • Make data quality reports • R&D != Engineering • Discovery-based • Iterative • Success and failure equally useful
• Don't be in a rush - 5000 lines of good data will beat a pile of Bad Big Data • 244GB RAM EC2+many Xeons $2.80/hr • Scaling options: • ElasticSearch + Jython/Java • Azure/Amazon ML • Apache Spark # if you have HDFS already
crazy-good selection of tools! • Don't worry about imposter syndrome - your business knowledge has a lot of value • We need data science patterns - what's your story? • Ask me how you can get started (I respond well to beer) • ianozsvald.com