Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to develop a data scientist – What business...

How to develop a data scientist – What business has requested v02

Presentation by Brendan Moran, Data Scientist @Greenplum EMC at Data Science London 21/03/12

Data Science London

July 03, 2012
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. 1 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division How to develop a data scientist What business has requested Big Data Meet Up 21 March 2012 Brendan Moran, EMC Data Scientist
  2. 2 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division Context •  McKinsey report –  Technology and techniques –  Mind the gap •  140-160k deep analytic talent •  1.5m data savvy managers –  UK Top 6 in producing talent •  EMC Global Survey •  Kaggle.com •  Our clients
  3. 3 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division Does it matter? Where’s the next generation coming from? Big data: The next frontier for innovation, competition, and productivity, http://tinyurl.com/74tdfdv
  4. 4 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division What about the UK? Big data: The next frontier for innovation, competition, and productivity, http://tinyurl.com/74tdfdv
  5. 5 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division What’s been trending – courtesy of Datasift http://tinyurl.com/6vek5ge •  # views –  1747072 What is a data scientist? | Datablog http://t.co/tFfVvstm –  1537330 I love Oxford-style debates. This one at #strataconf: the data science debate: domain expertise or machine learning? http://t.co/jKGhx8AY –  1536264 #strataconf is amazing. Data science is the new black. •  Most popular links: –  2812012-03-02 14:01-What is a data scientist? | Datablog | News | guardian.co.uk –  2082012-03-08 11:25-bitly's Hilary Mason on \"What is A Data Scientist?\" - Forbes –  1752012-03-02 15:24-A Data Scientist You’ve Never Heard of Is Now the Master of Your Domain
  6. 6 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division Who do we have in tonight? Show of hands…. •  Are you a data scientist? –  Beginner –  Proficient –  Expert?
  7. 7 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division Snapshot of you Tools in your toolbox •  Warehousing/Analytics (single response) –  SQL -> IBM -> Oracle •  How do you manipulate your data (multi-response) –  Excel -> SQL -> Python •  How do you analyse your data (multi-response) –  SAS -> STATA -> SPSS (R was last) •  How do you visualise your data (multi-response) –  MS BI tools -> Oracel -> IBM -> SAP -> Microstrategy
  8. 8 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division Snapshot of you Your traits •  You want a full set of data (53% ) •  Only 13% were comfortable working with complete data •  “I explore the data|report what it says” – evenly distributed •  “My findings drive decisions, or report what has happened” – evenly distributed
  9. 9 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division Why EMC? •  Award winner for enterprise development (TSIA 2011) •  Commitment to open source initiatives (chorus) •  Relationships with 700 universities around world – many already take our course content •  Carnegie Mellon •  Berkeley
  10. 10 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division How did we form the content? •  Experts •  Kaggle •  Universities •  Our enterprise clients
  11. 11 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division Guiding principles •  Open source – R, Rstudio, SQL, Python •  Vendor neutral •  No licensing implications (important for universities) –  Community editions of MPP DB, Hadoop •  Applied learning – lots of labs (~40% time) •  Foundation course
  12. 12 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division Where’s the bar? •  Solid understanding of statistics •  Experience with a scripting language (Jave, Perl, Python, R) •  Experience with SQL (or PSQL)
  13. 13 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division What’s on the course?
  14. 14 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division Is this for you? Practice, practice, practice •  “Tell me and I forget. Show me and I remember. Involve me and I understand” •  40% is hands on lab •  Take “dirty” data, tidy it up, start exploring data, basic statistics, simple plots, complex stats, beautiful graphs, build models, test models, present your findings
  15. 15 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division Show me : hypothesis testing Null & Alternative hypotheses •  is there a difference?
  16. 16 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division Show me : hypothesis testing How good is my model? •  Receiver Operating Characteristics (ROC) –  False positives –  True positives
  17. 17 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division Show me: visualising your data Study hours by education #Code #http://tinyurl.com/7rbx7qs library(arules) data ("AdultUCI") dframe = AdultUCI[, c ("education", "hours-per- week")] colnames(dframe) = c ("education", "hours_per_week") library(ggplot2) ggplot (dframe, aes(x=education, y=hours_per_week)) + geom_point(colour="lightblue", alpha=0.1, position="jitter") + geom_boxplot(outlier.size=0, alpha=0.2) + coord_flip()
  18. 18 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division Show me : what problem do I have How do I solve it Problem to solve Technique (e.g.) Need to group items by similarity Need to discover relationships between actions or items Want to determine relationship between outcome and input variables Want to assign (known) labels to items Want to analyse my text
  19. 19 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division Show me : what problem do I have How do I solve it Problem to solve Technique (e.g.) Need to group items by similarity Clustering (k-means) Need to discover relationships between actions or items Association rules (a priori) Want to determine relationship between outcome and input variables Regression (linear/logistic) Want to assign (known) labels to items Classification (Naïve Bayes, decision trees) Want to analyse my text Regular expressions, Bag of words
  20. 20 © Copyright 2010 EMC Corporation. All rights reserved. Data

    Computing Division So what now? What about you? •  If you know all of this already – congratulations! •  If you’d like to know more – our course goes live 26 March (register at http://education.emc.com) •  If you couldn’t care less – you’re probably in the wrong room