Slide 1

Slide 1 text

1 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division How to develop a data scientist What business has requested Big Data Meet Up 21 March 2012 Brendan Moran, EMC Data Scientist

Slide 2

Slide 2 text

2 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division Context •  McKinsey report –  Technology and techniques –  Mind the gap •  140-160k deep analytic talent •  1.5m data savvy managers –  UK Top 6 in producing talent •  EMC Global Survey •  Kaggle.com •  Our clients

Slide 3

Slide 3 text

3 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division Does it matter? Where’s the next generation coming from? Big data: The next frontier for innovation, competition, and productivity, http://tinyurl.com/74tdfdv

Slide 4

Slide 4 text

4 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division What about the UK? Big data: The next frontier for innovation, competition, and productivity, http://tinyurl.com/74tdfdv

Slide 5

Slide 5 text

5 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division What’s been trending – courtesy of Datasift http://tinyurl.com/6vek5ge •  # views –  1747072 What is a data scientist? | Datablog http://t.co/tFfVvstm –  1537330 I love Oxford-style debates. This one at #strataconf: the data science debate: domain expertise or machine learning? http://t.co/jKGhx8AY –  1536264 #strataconf is amazing. Data science is the new black. •  Most popular links: –  2812012-03-02 14:01-What is a data scientist? | Datablog | News | guardian.co.uk –  2082012-03-08 11:25-bitly's Hilary Mason on \"What is A Data Scientist?\" - Forbes –  1752012-03-02 15:24-A Data Scientist You’ve Never Heard of Is Now the Master of Your Domain

Slide 6

Slide 6 text

6 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division Who do we have in tonight? Show of hands…. •  Are you a data scientist? –  Beginner –  Proficient –  Expert?

Slide 7

Slide 7 text

7 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division Snapshot of you Tools in your toolbox •  Warehousing/Analytics (single response) –  SQL -> IBM -> Oracle •  How do you manipulate your data (multi-response) –  Excel -> SQL -> Python •  How do you analyse your data (multi-response) –  SAS -> STATA -> SPSS (R was last) •  How do you visualise your data (multi-response) –  MS BI tools -> Oracel -> IBM -> SAP -> Microstrategy

Slide 8

Slide 8 text

8 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division Snapshot of you Your traits •  You want a full set of data (53% ) •  Only 13% were comfortable working with complete data •  “I explore the data|report what it says” – evenly distributed •  “My findings drive decisions, or report what has happened” – evenly distributed

Slide 9

Slide 9 text

9 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division Why EMC? •  Award winner for enterprise development (TSIA 2011) •  Commitment to open source initiatives (chorus) •  Relationships with 700 universities around world – many already take our course content •  Carnegie Mellon •  Berkeley

Slide 10

Slide 10 text

10 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division How did we form the content? •  Experts •  Kaggle •  Universities •  Our enterprise clients

Slide 11

Slide 11 text

11 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division Guiding principles •  Open source – R, Rstudio, SQL, Python •  Vendor neutral •  No licensing implications (important for universities) –  Community editions of MPP DB, Hadoop •  Applied learning – lots of labs (~40% time) •  Foundation course

Slide 12

Slide 12 text

12 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division Where’s the bar? •  Solid understanding of statistics •  Experience with a scripting language (Jave, Perl, Python, R) •  Experience with SQL (or PSQL)

Slide 13

Slide 13 text

13 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division What’s on the course?

Slide 14

Slide 14 text

14 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division Is this for you? Practice, practice, practice •  “Tell me and I forget. Show me and I remember. Involve me and I understand” •  40% is hands on lab •  Take “dirty” data, tidy it up, start exploring data, basic statistics, simple plots, complex stats, beautiful graphs, build models, test models, present your findings

Slide 15

Slide 15 text

15 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division Show me : hypothesis testing Null & Alternative hypotheses •  is there a difference?

Slide 16

Slide 16 text

16 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division Show me : hypothesis testing How good is my model? •  Receiver Operating Characteristics (ROC) –  False positives –  True positives

Slide 17

Slide 17 text

17 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division Show me: visualising your data Study hours by education #Code #http://tinyurl.com/7rbx7qs library(arules) data ("AdultUCI") dframe = AdultUCI[, c ("education", "hours-per- week")] colnames(dframe) = c ("education", "hours_per_week") library(ggplot2) ggplot (dframe, aes(x=education, y=hours_per_week)) + geom_point(colour="lightblue", alpha=0.1, position="jitter") + geom_boxplot(outlier.size=0, alpha=0.2) + coord_flip()

Slide 18

Slide 18 text

18 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division Show me : what problem do I have How do I solve it Problem to solve Technique (e.g.) Need to group items by similarity Need to discover relationships between actions or items Want to determine relationship between outcome and input variables Want to assign (known) labels to items Want to analyse my text

Slide 19

Slide 19 text

19 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division Show me : what problem do I have How do I solve it Problem to solve Technique (e.g.) Need to group items by similarity Clustering (k-means) Need to discover relationships between actions or items Association rules (a priori) Want to determine relationship between outcome and input variables Regression (linear/logistic) Want to assign (known) labels to items Classification (Naïve Bayes, decision trees) Want to analyse my text Regular expressions, Bag of words

Slide 20

Slide 20 text

20 © Copyright 2010 EMC Corporation. All rights reserved. Data Computing Division So what now? What about you? •  If you know all of this already – congratulations! •  If you’d like to know more – our course goes live 26 March (register at http://education.emc.com) •  If you couldn’t care less – you’re probably in the wrong room