Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science: Methods and Tools - Big Data Camp LA - June 2014

szilard
June 14, 2014
570

Data Science: Methods and Tools - Big Data Camp LA - June 2014

szilard

June 14, 2014
Tweet

More Decks by szilard

Transcript

  1. Data Science: Methods and Tools
    Szilárd Pafka, PhD
    Chief Scientist, Epoch
    Organizer, LA Data Meetups
    Big Data Camp LA
    June 2014

    View Slide

  2. About

    View Slide

  3. Data Science
    Similar to KDD process (1996), CRISP-DM (1999) or SEMMA

    View Slide

  4. https://blog.cloudera.com/blog/2013/02/big-datas-new-use-cases-transformation-active-archive-and-exploration/

    View Slide

  5. View Slide

  6. “the data preparation process
    prepares both the data and the
    modeler”

    View Slide

  7. View Slide

  8. Hastie, Tibshirani and Friedman: The Elements of Statistical Learning

    View Slide

  9. Hastie, Tibshirani and Friedman: The Elements of Statistical Learning

    View Slide

  10. http://icml2008.cs.helsinki.fi/papers/632.pdf

    View Slide

  11. http://icml2008.cs.helsinki.fi/papers/632.pdf
    http://www.datanami.com/2014/03/26/forget_the_algorithms_and_start_cleaning_your_data/
    http://www.nltk.org/book/ch06.html

    View Slide

  12. View Slide

  13. View Slide

  14. https://www.youtube.com/watch?v=hVimVzgtD6w (2007)

    View Slide

  15. Tools I Use
    ...

    View Slide

  16. Tools Others Use (Survey)
    LA Data Science/ML meetup, Apr 2014, 200 people

    Data munging: R 60%, Python 50%, SQL 40%,
    Hadoop (mostly Hive) 30%, Unix shell 20%, Excel
    10% + Perl, Matlab, SAS, Impala, Pig, Shark...

    View Slide

  17. Tools Others Use (Survey)
    LA Data Science/ML meetup, Apr 2014, 200 people

    Data munging: R 60%, Python 50%, SQL 40%,
    Hadoop (mostly Hive) 30%, Unix shell 20%, Excel
    10% + Perl, Matlab, SAS, Impala, Pig, Shark...

    Visualization: R 40%,Python 30%, Tableau 10%,
    Javascript 10% + Matlab, Excel...

    View Slide

  18. Tools Others Use (Survey)
    LA Data Science/ML meetup, Apr 2014, 200 people

    Data munging: R 60%, Python 50%, SQL 40%,
    Hadoop (mostly Hive) 30%, Unix shell 20%, Excel
    10% + Perl, Matlab, SAS, Impala, Pig, Shark...

    Visualization: R 40%,Python 30%, Tableau 10%,
    Javascript 10% + Matlab, Excel...

    Machine learning/modeling: R 30%, Python 30% +
    Vowpal Wabbit, Matlab, Mahout, SAS, SPSS...
    http://bit.ly/datasc-tools-survey
    many other surveys, but...

    View Slide

  19. Is Data Science New?
    Data Science is a newly emerging field dedicated to analyzing and
    manipulating data to derive insights and build data products.
    https://www.kaggle.com/wiki/WhatIsDataScience
    The United States alone faces a shortage of 140,000 to 190,000 people
    with deep analytical skills. [2011]
    http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation

    View Slide

  20. Four major influences act on data analysis today:

    The formal theory of statistics

    Revolutionary developments in computers and display
    devices

    The challenge, in many fields, of more and ever larger
    bodies of data

    The accelerating emphasis on quantification in an ever
    wider variety of disciplines

    View Slide

  21. Four major influences act on data analysis today:

    The formal theory of statistics

    Revolutionary developments in computers and display
    devices

    The challenge, in many fields, of more and ever larger
    bodies of data

    The accelerating emphasis on quantification in an ever
    wider variety of disciplines
    Tukey & Wilk, 1965
    Tukey, J.W., & Wilk, M.B. (1965). Data analysis and statistics: techniques and approaches
    Reprinted in The Collected Works of John W. Tukey, Vol. V, Graphics 1965- 1985, 1-22 (1988)

    View Slide

  22. (meaning the entire data mining process)

    View Slide

  23. Largest Impact on My Workflow:

    2009: ggplot2

    2010: useR! conference

    2011: Rstudio

    2012: knitr

    2013: Impala (Hadoop interactive SQL)

    2014: see next...

    View Slide

  24. Need Tools Like This

    View Slide

  25. Got Tools Like This

    View Slide


  26. Grammar for data manipulation - filter, mutate,
    select, arrange, summarize, group_by, various
    joins (SQL is not that dumb after all), %>%

    Same API for several “backends” - data.frame,
    data.table, MySQL, PostgreSQL...

    Fast – Rcpp/C++
    Demo...

    View Slide

  27. LA Data Science Community

    View Slide

  28. View Slide

  29. View Slide

  30. [ email removed ]
    @DataScienceLA
    www.linkedin.com/in/szilard

    View Slide