Slide 1

Slide 1 text

Data Science: Methods and Tools Szilárd Pafka, PhD Chief Scientist, Epoch Organizer, LA Data Meetups Big Data Camp LA June 2014

Slide 2

Slide 2 text

About

Slide 3

Slide 3 text

Data Science Similar to KDD process (1996), CRISP-DM (1999) or SEMMA

Slide 4

Slide 4 text

https://blog.cloudera.com/blog/2013/02/big-datas-new-use-cases-transformation-active-archive-and-exploration/

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

“the data preparation process prepares both the data and the modeler”

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Hastie, Tibshirani and Friedman: The Elements of Statistical Learning

Slide 9

Slide 9 text

Hastie, Tibshirani and Friedman: The Elements of Statistical Learning

Slide 10

Slide 10 text

http://icml2008.cs.helsinki.fi/papers/632.pdf

Slide 11

Slide 11 text

http://icml2008.cs.helsinki.fi/papers/632.pdf http://www.datanami.com/2014/03/26/forget_the_algorithms_and_start_cleaning_your_data/ http://www.nltk.org/book/ch06.html

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

https://www.youtube.com/watch?v=hVimVzgtD6w (2007)

Slide 15

Slide 15 text

Tools I Use ...

Slide 16

Slide 16 text

Tools Others Use (Survey) LA Data Science/ML meetup, Apr 2014, 200 people ● Data munging: R 60%, Python 50%, SQL 40%, Hadoop (mostly Hive) 30%, Unix shell 20%, Excel 10% + Perl, Matlab, SAS, Impala, Pig, Shark...

Slide 17

Slide 17 text

Tools Others Use (Survey) LA Data Science/ML meetup, Apr 2014, 200 people ● Data munging: R 60%, Python 50%, SQL 40%, Hadoop (mostly Hive) 30%, Unix shell 20%, Excel 10% + Perl, Matlab, SAS, Impala, Pig, Shark... ● Visualization: R 40%,Python 30%, Tableau 10%, Javascript 10% + Matlab, Excel...

Slide 18

Slide 18 text

Tools Others Use (Survey) LA Data Science/ML meetup, Apr 2014, 200 people ● Data munging: R 60%, Python 50%, SQL 40%, Hadoop (mostly Hive) 30%, Unix shell 20%, Excel 10% + Perl, Matlab, SAS, Impala, Pig, Shark... ● Visualization: R 40%,Python 30%, Tableau 10%, Javascript 10% + Matlab, Excel... ● Machine learning/modeling: R 30%, Python 30% + Vowpal Wabbit, Matlab, Mahout, SAS, SPSS... http://bit.ly/datasc-tools-survey many other surveys, but...

Slide 19

Slide 19 text

Is Data Science New? Data Science is a newly emerging field dedicated to analyzing and manipulating data to derive insights and build data products. https://www.kaggle.com/wiki/WhatIsDataScience The United States alone faces a shortage of 140,000 to 190,000 people with deep analytical skills. [2011] http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation

Slide 20

Slide 20 text

Four major influences act on data analysis today: ● The formal theory of statistics ● Revolutionary developments in computers and display devices ● The challenge, in many fields, of more and ever larger bodies of data ● The accelerating emphasis on quantification in an ever wider variety of disciplines

Slide 21

Slide 21 text

Four major influences act on data analysis today: ● The formal theory of statistics ● Revolutionary developments in computers and display devices ● The challenge, in many fields, of more and ever larger bodies of data ● The accelerating emphasis on quantification in an ever wider variety of disciplines Tukey & Wilk, 1965 Tukey, J.W., & Wilk, M.B. (1965). Data analysis and statistics: techniques and approaches Reprinted in The Collected Works of John W. Tukey, Vol. V, Graphics 1965- 1985, 1-22 (1988)

Slide 22

Slide 22 text

(meaning the entire data mining process)

Slide 23

Slide 23 text

Largest Impact on My Workflow: ● 2009: ggplot2 ● 2010: useR! conference ● 2011: Rstudio ● 2012: knitr ● 2013: Impala (Hadoop interactive SQL) ● 2014: see next...

Slide 24

Slide 24 text

Need Tools Like This

Slide 25

Slide 25 text

Got Tools Like This

Slide 26

Slide 26 text

● Grammar for data manipulation - filter, mutate, select, arrange, summarize, group_by, various joins (SQL is not that dumb after all), %>% ● Same API for several “backends” - data.frame, data.table, MySQL, PostgreSQL... ● Fast – Rcpp/C++ Demo...

Slide 27

Slide 27 text

LA Data Science Community

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

[ email removed ] @DataScienceLA www.linkedin.com/in/szilard