Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science: Methods and Tools - Big Data Camp LA - June 2014

Ce8e94cc306ba164175f693fb01aa8b0?s=47 szilard
June 14, 2014
530

Data Science: Methods and Tools - Big Data Camp LA - June 2014

Ce8e94cc306ba164175f693fb01aa8b0?s=128

szilard

June 14, 2014
Tweet

Transcript

  1. Data Science: Methods and Tools Szilárd Pafka, PhD Chief Scientist,

    Epoch Organizer, LA Data Meetups Big Data Camp LA June 2014
  2. About

  3. Data Science Similar to KDD process (1996), CRISP-DM (1999) or

    SEMMA
  4. https://blog.cloudera.com/blog/2013/02/big-datas-new-use-cases-transformation-active-archive-and-exploration/

  5. None
  6. “the data preparation process prepares both the data and the

    modeler”
  7. None
  8. Hastie, Tibshirani and Friedman: The Elements of Statistical Learning

  9. Hastie, Tibshirani and Friedman: The Elements of Statistical Learning

  10. http://icml2008.cs.helsinki.fi/papers/632.pdf

  11. http://icml2008.cs.helsinki.fi/papers/632.pdf http://www.datanami.com/2014/03/26/forget_the_algorithms_and_start_cleaning_your_data/ http://www.nltk.org/book/ch06.html

  12. None
  13. None
  14. https://www.youtube.com/watch?v=hVimVzgtD6w (2007)

  15. Tools I Use ...

  16. Tools Others Use (Survey) LA Data Science/ML meetup, Apr 2014,

    200 people • Data munging: R 60%, Python 50%, SQL 40%, Hadoop (mostly Hive) 30%, Unix shell 20%, Excel 10% + Perl, Matlab, SAS, Impala, Pig, Shark...
  17. Tools Others Use (Survey) LA Data Science/ML meetup, Apr 2014,

    200 people • Data munging: R 60%, Python 50%, SQL 40%, Hadoop (mostly Hive) 30%, Unix shell 20%, Excel 10% + Perl, Matlab, SAS, Impala, Pig, Shark... • Visualization: R 40%,Python 30%, Tableau 10%, Javascript 10% + Matlab, Excel...
  18. Tools Others Use (Survey) LA Data Science/ML meetup, Apr 2014,

    200 people • Data munging: R 60%, Python 50%, SQL 40%, Hadoop (mostly Hive) 30%, Unix shell 20%, Excel 10% + Perl, Matlab, SAS, Impala, Pig, Shark... • Visualization: R 40%,Python 30%, Tableau 10%, Javascript 10% + Matlab, Excel... • Machine learning/modeling: R 30%, Python 30% + Vowpal Wabbit, Matlab, Mahout, SAS, SPSS... http://bit.ly/datasc-tools-survey many other surveys, but...
  19. Is Data Science New? Data Science is a newly emerging

    field dedicated to analyzing and manipulating data to derive insights and build data products. https://www.kaggle.com/wiki/WhatIsDataScience The United States alone faces a shortage of 140,000 to 190,000 people with deep analytical skills. [2011] http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation
  20. Four major influences act on data analysis today: • The

    formal theory of statistics • Revolutionary developments in computers and display devices • The challenge, in many fields, of more and ever larger bodies of data • The accelerating emphasis on quantification in an ever wider variety of disciplines
  21. Four major influences act on data analysis today: • The

    formal theory of statistics • Revolutionary developments in computers and display devices • The challenge, in many fields, of more and ever larger bodies of data • The accelerating emphasis on quantification in an ever wider variety of disciplines Tukey & Wilk, 1965 Tukey, J.W., & Wilk, M.B. (1965). Data analysis and statistics: techniques and approaches Reprinted in The Collected Works of John W. Tukey, Vol. V, Graphics 1965- 1985, 1-22 (1988)
  22. (meaning the entire data mining process)

  23. Largest Impact on My Workflow: • 2009: ggplot2 • 2010:

    useR! conference • 2011: Rstudio • 2012: knitr • 2013: Impala (Hadoop interactive SQL) • 2014: see next...
  24. Need Tools Like This

  25. Got Tools Like This

  26. • Grammar for data manipulation - filter, mutate, select, arrange,

    summarize, group_by, various joins (SQL is not that dumb after all), %>% • Same API for several “backends” - data.frame, data.table, MySQL, PostgreSQL... • Fast – Rcpp/C++ Demo...
  27. LA Data Science Community

  28. None
  29. None
  30. [ email removed ] @DataScienceLA www.linkedin.com/in/szilard