Computing Division Context • McKinsey report – Technology and techniques – Mind the gap • 140-160k deep analytic talent • 1.5m data savvy managers – UK Top 6 in producing talent • EMC Global Survey • Kaggle.com • Our clients
Computing Division Does it matter? Where’s the next generation coming from? Big data: The next frontier for innovation, competition, and productivity, http://tinyurl.com/74tdfdv
Computing Division What’s been trending – courtesy of Datasift http://tinyurl.com/6vek5ge • # views – 1747072 What is a data scientist? | Datablog http://t.co/tFfVvstm – 1537330 I love Oxford-style debates. This one at #strataconf: the data science debate: domain expertise or machine learning? http://t.co/jKGhx8AY – 1536264 #strataconf is amazing. Data science is the new black. • Most popular links: – 2812012-03-02 14:01-What is a data scientist? | Datablog | News | guardian.co.uk – 2082012-03-08 11:25-bitly's Hilary Mason on \"What is A Data Scientist?\" - Forbes – 1752012-03-02 15:24-A Data Scientist You’ve Never Heard of Is Now the Master of Your Domain
Computing Division Snapshot of you Tools in your toolbox • Warehousing/Analytics (single response) – SQL -> IBM -> Oracle • How do you manipulate your data (multi-response) – Excel -> SQL -> Python • How do you analyse your data (multi-response) – SAS -> STATA -> SPSS (R was last) • How do you visualise your data (multi-response) – MS BI tools -> Oracel -> IBM -> SAP -> Microstrategy
Computing Division Snapshot of you Your traits • You want a full set of data (53% ) • Only 13% were comfortable working with complete data • “I explore the data|report what it says” – evenly distributed • “My findings drive decisions, or report what has happened” – evenly distributed
Computing Division Why EMC? • Award winner for enterprise development (TSIA 2011) • Commitment to open source initiatives (chorus) • Relationships with 700 universities around world – many already take our course content • Carnegie Mellon • Berkeley
Computing Division Where’s the bar? • Solid understanding of statistics • Experience with a scripting language (Jave, Perl, Python, R) • Experience with SQL (or PSQL)
Computing Division Is this for you? Practice, practice, practice • “Tell me and I forget. Show me and I remember. Involve me and I understand” • 40% is hands on lab • Take “dirty” data, tidy it up, start exploring data, basic statistics, simple plots, complex stats, beautiful graphs, build models, test models, present your findings
Computing Division Show me : what problem do I have How do I solve it Problem to solve Technique (e.g.) Need to group items by similarity Need to discover relationships between actions or items Want to determine relationship between outcome and input variables Want to assign (known) labels to items Want to analyse my text
Computing Division Show me : what problem do I have How do I solve it Problem to solve Technique (e.g.) Need to group items by similarity Clustering (k-means) Need to discover relationships between actions or items Association rules (a priori) Want to determine relationship between outcome and input variables Regression (linear/logistic) Want to assign (known) labels to items Classification (Naïve Bayes, decision trees) Want to analyse my text Regular expressions, Bag of words
Computing Division So what now? What about you? • If you know all of this already – congratulations! • If you’d like to know more – our course goes live 26 March (register at http://education.emc.com) • If you couldn’t care less – you’re probably in the wrong room