Slide 1

Slide 1 text

study materials and online courses by @dspadawan BIG DATA AND DATA SCIENCE Background © Jim Kaskade: Big Data

Slide 2

Slide 2 text

2 Copyright © 2013-2014 by Teradata. All rights reserved. WHAT IS DATA SCIENCE THE DATA SCIENCE VENN DIAGRAM @dspadawan

Slide 3

Slide 3 text

3 Copyright © 2013-2014 by Teradata. All rights reserved. 1. Data Science (Fundamentals) 2. Statistics 3. Programming languages 4. Machine Learning / Data Mining 5. Text Mining / Natural Language Processing 6. Data Visualization 7. Big Data (Hadoop, MapReduce, NoSQL) 8. Data Ingestion 9. Data Munging or Data Wrangling 10. Toolbox (Weka, …, Spark, Storm, …, Sqoop, RHIPE, etc.) DATA SCIENCE DOMAINS All links go to Wiki. If you are not sure what something means you can learn. @dspadawan

Slide 4

Slide 4 text

4 Copyright © 2013-2014 by Teradata. All rights reserved. DATA SCIENCE METRO MAP BECOMING A DATA SCIENTIST

Slide 5

Slide 5 text

5 Copyright © 2013-2014 by Teradata. All rights reserved. • Aggregator > http://www.mooc-list.com • Platforms > https://www.coursera.org > https://www.edx.org > https://www.open2study.com > https://www.udacity.com > https://www.udemy.com > http://online.stanford.edu • Interactive platforms > http://www.codecademy.com > https://www.datacamp.com MASSIVE OPEN ONLINE COURSES (MOOC) @dspadawan

Slide 6

Slide 6 text

6 Copyright © 2013-2014 by Teradata. All rights reserved. WANT TO WORK AS DATA SCIENTIST? @dspadawan

Slide 7

Slide 7 text

7 Copyright © 2013-2014 by Teradata. All rights reserved. DATA SCIENCE & ANALYTICS • Coursera > Core Concepts in Data Analysis https://www.coursera.org/course/datan > Introduction to Data Science: https://www.coursera.org/course/datasci > Data Science Specialization: https://www.coursera.org/specialization/jhudatascience/1 – 9 courses + 1 capstone project – Each course or capstone takes 4 weeks – You can do it for free or you can pay 49 USD for certification > Welcome To Process Mining: Data science in Action! https://www.coursera.org/course/procmin 1 @dspadawan

Slide 8

Slide 8 text

8 Copyright © 2013-2014 by Teradata. All rights reserved. • Edx > The Analytics Edge http://www.edx.org/course/mitx/mitx-15-071x-analytics-edge- 1416 > Data, Analytics and Learning http://www.edx.org/course/utarlingtonx/utarlingtonx-link5-10x- data-analytics-2186 • Udacity > Intro to Data Science https://www.udacity.com/course/ud359 DATA SCIENCE & ANALYTICS 1 $ @dspadawan

Slide 9

Slide 9 text

9 Copyright © 2013-2014 by Teradata. All rights reserved. MATH DANCE @dspadawan

Slide 10

Slide 10 text

10 Copyright © 2013-2014 by Teradata. All rights reserved. STATISTICS COURSES • Coursera > Data analysis and statistical inference: https://www.coursera.org/course/statistics > Statistical inference and exploratory data analysis: https://www.coursera.org/specialization/jhudatascience/1/courses • EdX > Introduction to Statistics: Descriptive Statistics http://www.edx.org/course/uc-berkeleyx/uc-berkeleyx-stat2-1x- introduction-1138 > Introduction to Statistics: Probability http://www.edx.org/course/uc-berkeleyx/uc-berkeleyx-stat2-2x- introduction-1534 > Introduction to Statistics: Inference http://www.edx.org/course/uc-berkeleyx/uc-berkeleyx-stat2-3x- introduction-1533 2 @dspadawan

Slide 11

Slide 11 text

11 Copyright © 2013-2014 by Teradata. All rights reserved. • Udacity > Intro to statistics: https://www.udacity.com/course/st101 > Exploratory data analysis: https://www.udacity.com/course/ud651 > Intro to Inferential Statistics https://www.udacity.com/course/ud201 • Mathematical monk > https://www.youtube.com/playlist?list=PL17567A1A3F5DB5E4 STATISTICS COURSES CONT. 2 $ @dspadawan

Slide 12

Slide 12 text

12 Copyright © 2013-2014 by Teradata. All rights reserved. PROGRAMMING LANGUAGES • Analysis/Data mining: > R language > Python > SQL > (Perl) > (Octave) • Big Data (Hadoop) > Java (!) > Python • Visualization > JavaScript 3 @dspadawan

Slide 13

Slide 13 text

13 Copyright © 2013-2014 by Teradata. All rights reserved. R LANGUAGE • Basic info and SW > R Language: http://www.r-project.org > R Studio (IDE): http://www.rstudio.com • Courses > R Programming: https://www.coursera.org/course/rprog • Practice > Interactive courses: https://www.datacamp.com/courses > Data mining examples in R: http://www.rdatamining.com 3 @dspadawan

Slide 14

Slide 14 text

14 Copyright © 2013-2014 by Teradata. All rights reserved. PYTHON • Basic info and SW: > Python language: https://www.python.org > Eclipse Python: http://pydev.org • Python for Java developers: > http://www.sthurlow.com/python • Google's Python Class > https://developers.google.com/edu/python • Code Academy Python > http://www.codecademy.com/tracks/python 3 @dspadawan

Slide 15

Slide 15 text

15 Copyright © 2013-2014 by Teradata. All rights reserved. • Basic info and SW: > http://octave.sourceforge.net > https://gnu.org/software/octave > http://en.wikipedia.org/wiki/GNU_Octave • Coursera: > Machine learning: https://www.coursera.org/course/ml OCTAVE Octave is mostly compatible with MatLab. 3 @dspadawan

Slide 16

Slide 16 text

16 Copyright © 2013-2014 by Teradata. All rights reserved. MACHINE LEARNING COURSES Subfield of computer science and artificial intelligence about learn from data. • Coursera > Machine Learning (Stanford): https://www.coursera.org/course/ml > Machine Learning: (University of Washington) https://www.coursera.org/course/machlearning > Practical Machine Learning (Johns Hopkins): https://www.coursera.org/course/predmachlearn – part of Data Science Specialization • Udacity > Machine Learning (Supervised, Reinforcement, Unsupervised) https://www.udacity.com/course/ud675 https://www.udacity.com/course/ud820 https://www.udacity.com/course/ud741 4A $ @dspadawan

Slide 17

Slide 17 text

17 Copyright © 2013-2014 by Teradata. All rights reserved. MACHINE LEARNING VIDEOS • Udemy > Hilary Mason: An Intro to Machine Learning with Web Data https://www.udemy.com/hilary-mason-an-intro-to-machine- learning-with-web-data > Hilary Mason: Advanced Machine Learning https://www.udemy.com/hilary-mason-advanced-machine- learning/ • Mathematical monk > https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA • Videolectures.net > http://blog.videolectures.net/100-most-popular-machine- learning-talks-at-videolectures-net/ 4A $ @dspadawan

Slide 18

Slide 18 text

18 Copyright © 2013-2014 by Teradata. All rights reserved. DATA MINING COURSES Process of discovery patterns in large data sets via machine learning or statistics. • Coursera > Mining Massive Datasets (Stanford) https://www.coursera.org/course/mmds • Udemy > Matthew Russell on Mining the Social Web https://www.udemy.com/matthew-russell-on-mining-the-social- web/ > Data Mining https://www.udemy.com/data-mining • Web page > http://www.rdatamining.com 4B $ @dspadawan

Slide 19

Slide 19 text

19 Copyright © 2013-2014 by Teradata. All rights reserved. DATA MINING COURSES & TOOLS • Courses: > Data Mining with Weka: https://weka.waikato.ac.nz/dataminingwithweka/preview > More Data Mining with Weka: https://weka.waikato.ac.nz/moredataminingwithweka • Weka > SW: http://www.cs.waikato.ac.nz/ml/weka • Knime > SW: https://www.knime.org/downloads/overview • RapidMiner > Official site: http://rapidminer.com > SW: http://sourceforge.net/projects/rapidminer 4B @dspadawan

Slide 20

Slide 20 text

20 Copyright © 2013-2014 by Teradata. All rights reserved. • R Data Mining (Word Cloud) > http://www.rdatamining.com/examples/text-mining • Videolectures.net > http://videolectures.net/Top/Computer_Science/Text_Mining • Tool (Word Cloud) > Wordle.net TEXT MINING 5A TOP RECURRING THEMES ABOUT BIG DATA @dspadawan

Slide 21

Slide 21 text

21 Copyright © 2013-2014 by Teradata. All rights reserved. • Coursera > Natural Language Processing (Columbia University): https://www.coursera.org/course/nlangp > Natural Language Processing (Stanford): https://www.coursera.org/course/nlp • Deeper Learning MOOC > http://dlmooc.deeper-learning.org/ • Wikipedia > http://en.wikipedia.org/wiki/Natural_language_processing NATURAL LANGUAGE PROCESSING COURSES Subfield of computer science and artificial intelligence and linguistics. 5B @dspadawan

Slide 22

Slide 22 text

22 Copyright © 2013-2014 by Teradata. All rights reserved. • Tableau > http://www.tableausoftware.com > Commercial visualization software • D3.js > http://d3js.org > Data Driven document visualization library • GraphViz > http://www.graphviz.org > Graph visualization tools • Gephi > https://gephi.github.io > Visualization platform VISUALIZATION TOOLS 6 @dspadawan

Slide 23

Slide 23 text

23 Copyright © 2013-2014 by Teradata. All rights reserved. • Trainings > http://www.tableausoftware.com/learn/training > On demand > Live Online planned for specific topic • Download > Tableau Public: http://www.tableausoftware.com/public > Tableau Trial: http://www.tableausoftware.com/products/trial • Certification > Desktop (Qualified associate, Certified Professional) > Server (Qualified associate, Certified Professional) > http://www.tableausoftware.com/support/certification TABLEAU 6 @dspadawan

Slide 24

Slide 24 text

24 Copyright © 2013-2014 by Teradata. All rights reserved. HOW BIG, IS BIG ENOUGH? @dspadawan

Slide 25

Slide 25 text

25 Copyright © 2013-2014 by Teradata. All rights reserved. • MOOC > http://bigdatauniversity.com > http://bigdatacourse.appspot.com • Coursera > Web Intelligence and Big Data https://www.coursera.org/course/bigdata • Udemy > Big Data and Hadoop Essentials https://www.udemy.com/big-data-and-hadoop-essentials-free- tutorial • Open2Study > Big Data for Better Performance http://www.open2study.com/courses/big-data-for-better- performance BIG DATA STUDY 7 $ @dspadawan

Slide 26

Slide 26 text

26 Copyright © 2013-2014 by Teradata. All rights reserved. BIG DATA TOOLS • Hadoop – Big Data Framework • Hive – DWH infrastructure build on top of Hadoop • HBase – Non-relational, distributed DB • Pig – Hadoop programming tool • Storm – Real time computation system for Hadoop • Solr – Search platform • Falcon – Data management and processing for Hadoop • Sqoop – CMD application for transfer data into Hadoop • Flume – Large scale log aggregation framework • Oozie – Workflow scheduler for Hadoop • Ambari – Simpler management for Hadoop clusters • Mahout – Machine Learning algorithms implemented on Hadoop • ZooKeeper – Coordination service for distributed applications • Knox - REST API Gateway for interacting with Hadoop clusters 7 @dspadawan

Slide 27

Slide 27 text

27 Copyright © 2013-2014 by Teradata. All rights reserved. HADOOP STUDY • Hadoop providers > http://www.cloudera.com > http://hortonworks.com > http://www.mapr.com > http://www.teradata.com/aster • Udacity > Intro to Hadoop and MapReduce https://www.udacity.com/course/ud617 • Udemy > Become a Certified Hadoop Developer | Training | Tutorial https://www.udemy.com/hadoop-tutorial 7 There is more Hadoop providers: IBM, Pivotal, etc. $ $ @dspadawan

Slide 28

Slide 28 text

28 Copyright © 2013-2014 by Teradata. All rights reserved. NOT ONLY SQL DATABASES • MongoDB – JSON document store > http://www.mongodb.com > https://university.mongodb.com • CouchDB – JSON document store > http://couchdb.apache.org • CasandraDB – High performance column oriented DB > http://cassandra.apache.org • VoltDB – In-memory database > http://voltdb.com • Redis – High performance column oriented DB > http://redis.io • NuoDB – Distributed SQL DB > http://www.nuodb.com 7 @dspadawan

Slide 29

Slide 29 text

29 Copyright © 2013-2014 by Teradata. All rights reserved. • Big Data Courses path: > Big Data Fundamentals > Hadoop Fundamentals > Moving Data into Hadoop (Sqoop and Flume tools) > Query languages for Hadoop (Hive, Pig and Jaql) > SQL Access for Hadoop > Using HBase for Real-time Access to your Big Data > Accessing Hadoop Data Using Hive > Introduction to Pig > Controlling Hadoop Jobs using Oozie > Hadoop Reporting and Analysis > Introduction to MapReduce Programming • Courses are provided by IBM BIG DATA UNIVERSITY 7 @dspadawan

Slide 30

Slide 30 text

30 Copyright © 2013-2014 by Teradata. All rights reserved. IT IS EVEN BETTER, DON’T YOU THINK? @dspadawan

Slide 31

Slide 31 text

31 Copyright © 2013-2014 by Teradata. All rights reserved. • Tutorials > 8 different paths > On demand and free > Lectured together with Udacity (paid on monthly basis) > http://cloudera.com/content/cloudera/en/training/courses.html > http://cloudera.com/content/cloudera/en/training/library.html • Sandbox > http://cloudera.com/content/support/en/downloads/quickstart_v ms/cdh-5-1-x1.html • Certification > 200 USD per exam > http://cloudera.com/content/cloudera/en/training/certification.ht ml CLOUDERA HADOOP 7 @dspadawan

Slide 32

Slide 32 text

32 Copyright © 2013-2014 by Teradata. All rights reserved. • Tutorials > http://hortonworks.com/tutorials > 3 paths for – Developers – Administrators – Data Scientists • Sandbox > http://hortonworks.com/hdp/downloads • Certifications > 200 USD per exam > http://hortonworks.com/training/certification HORTONWORKS HADOOP 7 @dspadawan

Slide 33

Slide 33 text

33 Copyright © 2013-2014 by Teradata. All rights reserved. • Tutorials > https://www.mapr.com/services/mapr-academy/training-videos > 3 paths for – Developers – Administrators – Business users • Sandbox > https://www.mapr.com/products/mapr-sandbox-hadoop • Certification > For administrator only > You must pass Hadoop Cluster Administration on MapR course > https://www.mapr.com/services/mapr-academy/certification MAPR HADOOP 7 @dspadawan

Slide 34

Slide 34 text

34 Copyright © 2013-2014 by Teradata. All rights reserved. STREAMING – NO BIG DEAL @dspadawan

Slide 35

Slide 35 text

35 Copyright © 2013-2014 by Teradata. All rights reserved. STREAMING DATA PROCESSING • Storm (https://storm.incubator.apache.org) • Open source (ASF) real-time Hadoop • Twitter project • Spark (https://spark.apache.org) • Open source (ASF) in-memory Hadoop • Apache project • S4 (http://incubator.apache.org/s4) • Open source (ASF) processing of stream data • Yahoo project • Samza (http://samza.incubator.apache.org) • Open source processing messagining data • LinkedIn project 7 @dspadawan

Slide 36

Slide 36 text

36 Copyright © 2013-2014 by Teradata. All rights reserved. • Techniques > Data import and export > Data fusion – integration multiple data > Data sampling – selection of data subset (rows) > Data discovery – detection patterns in data > Exploratory data analysis – summarize main data characteristics > Feature extraction – selection of data subset (columns) > Data scrubbing – data error correction > Missing data values – data correction > Etc. DATA INGESTION 8 Process of obtaining, importing and processing data for later use or storage. @dspadawan

Slide 37

Slide 37 text

37 Copyright © 2013-2014 by Teradata. All rights reserved. • Coursera > Getting and Cleaning Data part of Data Science Specialization https://www.coursera.org/course/getdata • Udacity > Data Wrangling with MongoDB https://www.udacity.com/course/ud032 • School of Data > Many different courses http://schoolofdata.org • Tools > OpenRefine, DataWrangler – clean up and transform tools > Talend, Pentaho – integration DATA WRANGLING / DATA MUNGING 9 Converting or mapping data from one "raw" form into another format. $ @dspadawan

Slide 38

Slide 38 text

38 Copyright © 2013-2014 by Teradata. All rights reserved. • Hadoop and realtime > Apache Scibe • Machine Learning > H2O – In memory machine learning • Data Mining > Rattle – GUI for DM using R • Python and NLP > NLTK = Natural Language ToolKit for Python • R and Hadoop > RHIPE = R + Hadoop Integrated Programming Environment • Visualization > Many Eyes – Online visualization system from IBM TOOLBOX 10 @dspadawan

Slide 39

Slide 39 text

39 Copyright © 2013-2014 by Teradata. All rights reserved. ONLINE SOURCES • Data Science Servers: > http://www.datasciencecentral.com > http://www.hadoop360.com > http://www.datascienceweekly.org • Aggregators > https://trello.com/b/rbpEfMld/data-science • Blogs • http://datasciencemasters.org • http://www.kdnuggets.com • http://www.zipfianacademy.com/blog/post/46864003608/a-practical- intro-to-data-science • http://datascience101.wordpress.com • http://fivethirtyeight.blogs.nytimes.com @dspadawan

Slide 40

Slide 40 text

40 Copyright © 2013-2014 by Teradata. All rights reserved. FREE BOOKS • Data Science > Doing Data Science > Agile Data Science > Data Science for Business • Statistics > Think Stats • Programming > R language – 25 Recipes for Getting Started with R – Learning R > Python – Learning Python, 5th Edition – Think Python @dspadawan

Slide 41

Slide 41 text

41 Copyright © 2013-2014 by Teradata. All rights reserved. • Machine Learning / Data Mining > Machine Learning for Hackers > Mining the Social Web • Visualization > Visualizing Data > Getting Started with D3 > Communicating Data with Tableau • Text mining / Natural Language Processing > 21 Recipes for Mining Twitter > Natural Language Processing with Python > Natural Language Annotation for Machine Learning FREE BOOKS CONTINUED @dspadawan

Slide 42

Slide 42 text

42 Copyright © 2013-2014 by Teradata. All rights reserved. • Big Data > Hadoop: The Definitive Guide, 3rd Edition > Ethics of Big Data > Big Data Analytics with R and Hadoop • Data Ingestion > Data Analysis with Open Source Tools > Python for Data Analysis • Data Wrangling and Munging > Using OpenRefine • Toolbox > Getting Started with Storm > Fast Data Processing with Spark FREE BOOKS CONTINUED $ @dspadawan

Slide 43

Slide 43 text

43 Copyright © 2013-2014 by Teradata. All rights reserved. QUESTIONS AND ANSWERS By Tara Laskowski @dspadawan Contact me at [email protected] Follow me at twitter @dspadawan Read my blog http://datasciencepadawan.blogspot.com