Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data and Data Science study - study materials and online courses

Big Data and Data Science study - study materials and online courses

Big Data and Data Science study with subtitle "study materials and online courses" is little bit more over 40 slides presentation about 10 domains of Data Science covered by online free and paid MOOC courses, study materials and free books.

Based on my almost year of study, investigate and collect of materials, tutorials, courses, books, links, etc. I have prepared distillation of the best in this short presentation.

Of course list is not full, because there is always something new, undiscovered and better than before. But it contains the most important information for those who want to start or don't know where exactly follow up when they already begun.

Data Science Padawan

September 26, 2014
Tweet

Other Decks in Education

Transcript

  1. study materials and online courses by @dspadawan BIG DATA AND

    DATA SCIENCE Background © Jim Kaskade: Big Data
  2. 2 Copyright © 2013-2014 by Teradata. All rights reserved. WHAT

    IS DATA SCIENCE THE DATA SCIENCE VENN DIAGRAM @dspadawan
  3. 3 Copyright © 2013-2014 by Teradata. All rights reserved. 1.

    Data Science (Fundamentals) 2. Statistics 3. Programming languages 4. Machine Learning / Data Mining 5. Text Mining / Natural Language Processing 6. Data Visualization 7. Big Data (Hadoop, MapReduce, NoSQL) 8. Data Ingestion 9. Data Munging or Data Wrangling 10. Toolbox (Weka, …, Spark, Storm, …, Sqoop, RHIPE, etc.) DATA SCIENCE DOMAINS All links go to Wiki. If you are not sure what something means you can learn. @dspadawan
  4. 4 Copyright © 2013-2014 by Teradata. All rights reserved. DATA

    SCIENCE METRO MAP BECOMING A DATA SCIENTIST
  5. 5 Copyright © 2013-2014 by Teradata. All rights reserved. •

    Aggregator > http://www.mooc-list.com • Platforms > https://www.coursera.org > https://www.edx.org > https://www.open2study.com > https://www.udacity.com > https://www.udemy.com > http://online.stanford.edu • Interactive platforms > http://www.codecademy.com > https://www.datacamp.com MASSIVE OPEN ONLINE COURSES (MOOC) @dspadawan
  6. 7 Copyright © 2013-2014 by Teradata. All rights reserved. DATA

    SCIENCE & ANALYTICS • Coursera > Core Concepts in Data Analysis https://www.coursera.org/course/datan > Introduction to Data Science: https://www.coursera.org/course/datasci > Data Science Specialization: https://www.coursera.org/specialization/jhudatascience/1 – 9 courses + 1 capstone project – Each course or capstone takes 4 weeks – You can do it for free or you can pay 49 USD for certification > Welcome To Process Mining: Data science in Action! https://www.coursera.org/course/procmin 1 @dspadawan
  7. 8 Copyright © 2013-2014 by Teradata. All rights reserved. •

    Edx > The Analytics Edge http://www.edx.org/course/mitx/mitx-15-071x-analytics-edge- 1416 > Data, Analytics and Learning http://www.edx.org/course/utarlingtonx/utarlingtonx-link5-10x- data-analytics-2186 • Udacity > Intro to Data Science https://www.udacity.com/course/ud359 DATA SCIENCE & ANALYTICS 1 $ @dspadawan
  8. 10 Copyright © 2013-2014 by Teradata. All rights reserved. STATISTICS

    COURSES • Coursera > Data analysis and statistical inference: https://www.coursera.org/course/statistics > Statistical inference and exploratory data analysis: https://www.coursera.org/specialization/jhudatascience/1/courses • EdX > Introduction to Statistics: Descriptive Statistics http://www.edx.org/course/uc-berkeleyx/uc-berkeleyx-stat2-1x- introduction-1138 > Introduction to Statistics: Probability http://www.edx.org/course/uc-berkeleyx/uc-berkeleyx-stat2-2x- introduction-1534 > Introduction to Statistics: Inference http://www.edx.org/course/uc-berkeleyx/uc-berkeleyx-stat2-3x- introduction-1533 2 @dspadawan
  9. 11 Copyright © 2013-2014 by Teradata. All rights reserved. •

    Udacity > Intro to statistics: https://www.udacity.com/course/st101 > Exploratory data analysis: https://www.udacity.com/course/ud651 > Intro to Inferential Statistics https://www.udacity.com/course/ud201 • Mathematical monk > https://www.youtube.com/playlist?list=PL17567A1A3F5DB5E4 STATISTICS COURSES CONT. 2 $ @dspadawan
  10. 12 Copyright © 2013-2014 by Teradata. All rights reserved. PROGRAMMING

    LANGUAGES • Analysis/Data mining: > R language > Python > SQL > (Perl) > (Octave) • Big Data (Hadoop) > Java (!) > Python • Visualization > JavaScript 3 @dspadawan
  11. 13 Copyright © 2013-2014 by Teradata. All rights reserved. R

    LANGUAGE • Basic info and SW > R Language: http://www.r-project.org > R Studio (IDE): http://www.rstudio.com • Courses > R Programming: https://www.coursera.org/course/rprog • Practice > Interactive courses: https://www.datacamp.com/courses > Data mining examples in R: http://www.rdatamining.com 3 @dspadawan
  12. 14 Copyright © 2013-2014 by Teradata. All rights reserved. PYTHON

    • Basic info and SW: > Python language: https://www.python.org > Eclipse Python: http://pydev.org • Python for Java developers: > http://www.sthurlow.com/python • Google's Python Class > https://developers.google.com/edu/python • Code Academy Python > http://www.codecademy.com/tracks/python 3 @dspadawan
  13. 15 Copyright © 2013-2014 by Teradata. All rights reserved. •

    Basic info and SW: > http://octave.sourceforge.net > https://gnu.org/software/octave > http://en.wikipedia.org/wiki/GNU_Octave • Coursera: > Machine learning: https://www.coursera.org/course/ml OCTAVE Octave is mostly compatible with MatLab. 3 @dspadawan
  14. 16 Copyright © 2013-2014 by Teradata. All rights reserved. MACHINE

    LEARNING COURSES Subfield of computer science and artificial intelligence about learn from data. • Coursera > Machine Learning (Stanford): https://www.coursera.org/course/ml > Machine Learning: (University of Washington) https://www.coursera.org/course/machlearning > Practical Machine Learning (Johns Hopkins): https://www.coursera.org/course/predmachlearn – part of Data Science Specialization • Udacity > Machine Learning (Supervised, Reinforcement, Unsupervised) https://www.udacity.com/course/ud675 https://www.udacity.com/course/ud820 https://www.udacity.com/course/ud741 4A $ @dspadawan
  15. 17 Copyright © 2013-2014 by Teradata. All rights reserved. MACHINE

    LEARNING VIDEOS • Udemy > Hilary Mason: An Intro to Machine Learning with Web Data https://www.udemy.com/hilary-mason-an-intro-to-machine- learning-with-web-data > Hilary Mason: Advanced Machine Learning https://www.udemy.com/hilary-mason-advanced-machine- learning/ • Mathematical monk > https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA • Videolectures.net > http://blog.videolectures.net/100-most-popular-machine- learning-talks-at-videolectures-net/ 4A $ @dspadawan
  16. 18 Copyright © 2013-2014 by Teradata. All rights reserved. DATA

    MINING COURSES Process of discovery patterns in large data sets via machine learning or statistics. • Coursera > Mining Massive Datasets (Stanford) https://www.coursera.org/course/mmds • Udemy > Matthew Russell on Mining the Social Web https://www.udemy.com/matthew-russell-on-mining-the-social- web/ > Data Mining https://www.udemy.com/data-mining • Web page > http://www.rdatamining.com 4B $ @dspadawan
  17. 19 Copyright © 2013-2014 by Teradata. All rights reserved. DATA

    MINING COURSES & TOOLS • Courses: > Data Mining with Weka: https://weka.waikato.ac.nz/dataminingwithweka/preview > More Data Mining with Weka: https://weka.waikato.ac.nz/moredataminingwithweka • Weka > SW: http://www.cs.waikato.ac.nz/ml/weka • Knime > SW: https://www.knime.org/downloads/overview • RapidMiner > Official site: http://rapidminer.com > SW: http://sourceforge.net/projects/rapidminer 4B @dspadawan
  18. 20 Copyright © 2013-2014 by Teradata. All rights reserved. •

    R Data Mining (Word Cloud) > http://www.rdatamining.com/examples/text-mining • Videolectures.net > http://videolectures.net/Top/Computer_Science/Text_Mining • Tool (Word Cloud) > Wordle.net TEXT MINING 5A TOP RECURRING THEMES ABOUT BIG DATA @dspadawan
  19. 21 Copyright © 2013-2014 by Teradata. All rights reserved. •

    Coursera > Natural Language Processing (Columbia University): https://www.coursera.org/course/nlangp > Natural Language Processing (Stanford): https://www.coursera.org/course/nlp • Deeper Learning MOOC > http://dlmooc.deeper-learning.org/ • Wikipedia > http://en.wikipedia.org/wiki/Natural_language_processing NATURAL LANGUAGE PROCESSING COURSES Subfield of computer science and artificial intelligence and linguistics. 5B @dspadawan
  20. 22 Copyright © 2013-2014 by Teradata. All rights reserved. •

    Tableau > http://www.tableausoftware.com > Commercial visualization software • D3.js > http://d3js.org > Data Driven document visualization library • GraphViz > http://www.graphviz.org > Graph visualization tools • Gephi > https://gephi.github.io > Visualization platform VISUALIZATION TOOLS 6 @dspadawan
  21. 23 Copyright © 2013-2014 by Teradata. All rights reserved. •

    Trainings > http://www.tableausoftware.com/learn/training > On demand > Live Online planned for specific topic • Download > Tableau Public: http://www.tableausoftware.com/public > Tableau Trial: http://www.tableausoftware.com/products/trial • Certification > Desktop (Qualified associate, Certified Professional) > Server (Qualified associate, Certified Professional) > http://www.tableausoftware.com/support/certification TABLEAU 6 @dspadawan
  22. 25 Copyright © 2013-2014 by Teradata. All rights reserved. •

    MOOC > http://bigdatauniversity.com > http://bigdatacourse.appspot.com • Coursera > Web Intelligence and Big Data https://www.coursera.org/course/bigdata • Udemy > Big Data and Hadoop Essentials https://www.udemy.com/big-data-and-hadoop-essentials-free- tutorial • Open2Study > Big Data for Better Performance http://www.open2study.com/courses/big-data-for-better- performance BIG DATA STUDY 7 $ @dspadawan
  23. 26 Copyright © 2013-2014 by Teradata. All rights reserved. BIG

    DATA TOOLS • Hadoop – Big Data Framework • Hive – DWH infrastructure build on top of Hadoop • HBase – Non-relational, distributed DB • Pig – Hadoop programming tool • Storm – Real time computation system for Hadoop • Solr – Search platform • Falcon – Data management and processing for Hadoop • Sqoop – CMD application for transfer data into Hadoop • Flume – Large scale log aggregation framework • Oozie – Workflow scheduler for Hadoop • Ambari – Simpler management for Hadoop clusters • Mahout – Machine Learning algorithms implemented on Hadoop • ZooKeeper – Coordination service for distributed applications • Knox - REST API Gateway for interacting with Hadoop clusters 7 @dspadawan
  24. 27 Copyright © 2013-2014 by Teradata. All rights reserved. HADOOP

    STUDY • Hadoop providers > http://www.cloudera.com > http://hortonworks.com > http://www.mapr.com > http://www.teradata.com/aster • Udacity > Intro to Hadoop and MapReduce https://www.udacity.com/course/ud617 • Udemy > Become a Certified Hadoop Developer | Training | Tutorial https://www.udemy.com/hadoop-tutorial 7 There is more Hadoop providers: IBM, Pivotal, etc. $ $ @dspadawan
  25. 28 Copyright © 2013-2014 by Teradata. All rights reserved. NOT

    ONLY SQL DATABASES • MongoDB – JSON document store > http://www.mongodb.com > https://university.mongodb.com • CouchDB – JSON document store > http://couchdb.apache.org • CasandraDB – High performance column oriented DB > http://cassandra.apache.org • VoltDB – In-memory database > http://voltdb.com • Redis – High performance column oriented DB > http://redis.io • NuoDB – Distributed SQL DB > http://www.nuodb.com 7 @dspadawan
  26. 29 Copyright © 2013-2014 by Teradata. All rights reserved. •

    Big Data Courses path: > Big Data Fundamentals > Hadoop Fundamentals > Moving Data into Hadoop (Sqoop and Flume tools) > Query languages for Hadoop (Hive, Pig and Jaql) > SQL Access for Hadoop > Using HBase for Real-time Access to your Big Data > Accessing Hadoop Data Using Hive > Introduction to Pig > Controlling Hadoop Jobs using Oozie > Hadoop Reporting and Analysis > Introduction to MapReduce Programming • Courses are provided by IBM BIG DATA UNIVERSITY 7 @dspadawan
  27. 30 Copyright © 2013-2014 by Teradata. All rights reserved. IT

    IS EVEN BETTER, DON’T YOU THINK? @dspadawan
  28. 31 Copyright © 2013-2014 by Teradata. All rights reserved. •

    Tutorials > 8 different paths > On demand and free > Lectured together with Udacity (paid on monthly basis) > http://cloudera.com/content/cloudera/en/training/courses.html > http://cloudera.com/content/cloudera/en/training/library.html • Sandbox > http://cloudera.com/content/support/en/downloads/quickstart_v ms/cdh-5-1-x1.html • Certification > 200 USD per exam > http://cloudera.com/content/cloudera/en/training/certification.ht ml CLOUDERA HADOOP 7 @dspadawan
  29. 32 Copyright © 2013-2014 by Teradata. All rights reserved. •

    Tutorials > http://hortonworks.com/tutorials > 3 paths for – Developers – Administrators – Data Scientists • Sandbox > http://hortonworks.com/hdp/downloads • Certifications > 200 USD per exam > http://hortonworks.com/training/certification HORTONWORKS HADOOP 7 @dspadawan
  30. 33 Copyright © 2013-2014 by Teradata. All rights reserved. •

    Tutorials > https://www.mapr.com/services/mapr-academy/training-videos > 3 paths for – Developers – Administrators – Business users • Sandbox > https://www.mapr.com/products/mapr-sandbox-hadoop • Certification > For administrator only > You must pass Hadoop Cluster Administration on MapR course > https://www.mapr.com/services/mapr-academy/certification MAPR HADOOP 7 @dspadawan
  31. 35 Copyright © 2013-2014 by Teradata. All rights reserved. STREAMING

    DATA PROCESSING • Storm (https://storm.incubator.apache.org) • Open source (ASF) real-time Hadoop • Twitter project • Spark (https://spark.apache.org) • Open source (ASF) in-memory Hadoop • Apache project • S4 (http://incubator.apache.org/s4) • Open source (ASF) processing of stream data • Yahoo project • Samza (http://samza.incubator.apache.org) • Open source processing messagining data • LinkedIn project 7 @dspadawan
  32. 36 Copyright © 2013-2014 by Teradata. All rights reserved. •

    Techniques > Data import and export > Data fusion – integration multiple data > Data sampling – selection of data subset (rows) > Data discovery – detection patterns in data > Exploratory data analysis – summarize main data characteristics > Feature extraction – selection of data subset (columns) > Data scrubbing – data error correction > Missing data values – data correction > Etc. DATA INGESTION 8 Process of obtaining, importing and processing data for later use or storage. @dspadawan
  33. 37 Copyright © 2013-2014 by Teradata. All rights reserved. •

    Coursera > Getting and Cleaning Data part of Data Science Specialization https://www.coursera.org/course/getdata • Udacity > Data Wrangling with MongoDB https://www.udacity.com/course/ud032 • School of Data > Many different courses http://schoolofdata.org • Tools > OpenRefine, DataWrangler – clean up and transform tools > Talend, Pentaho – integration DATA WRANGLING / DATA MUNGING 9 Converting or mapping data from one "raw" form into another format. $ @dspadawan
  34. 38 Copyright © 2013-2014 by Teradata. All rights reserved. •

    Hadoop and realtime > Apache Scibe • Machine Learning > H2O – In memory machine learning • Data Mining > Rattle – GUI for DM using R • Python and NLP > NLTK = Natural Language ToolKit for Python • R and Hadoop > RHIPE = R + Hadoop Integrated Programming Environment • Visualization > Many Eyes – Online visualization system from IBM TOOLBOX 10 @dspadawan
  35. 39 Copyright © 2013-2014 by Teradata. All rights reserved. ONLINE

    SOURCES • Data Science Servers: > http://www.datasciencecentral.com > http://www.hadoop360.com > http://www.datascienceweekly.org • Aggregators > https://trello.com/b/rbpEfMld/data-science • Blogs • http://datasciencemasters.org • http://www.kdnuggets.com • http://www.zipfianacademy.com/blog/post/46864003608/a-practical- intro-to-data-science • http://datascience101.wordpress.com • http://fivethirtyeight.blogs.nytimes.com @dspadawan
  36. 40 Copyright © 2013-2014 by Teradata. All rights reserved. FREE

    BOOKS • Data Science > Doing Data Science > Agile Data Science > Data Science for Business • Statistics > Think Stats • Programming > R language – 25 Recipes for Getting Started with R – Learning R > Python – Learning Python, 5th Edition – Think Python @dspadawan
  37. 41 Copyright © 2013-2014 by Teradata. All rights reserved. •

    Machine Learning / Data Mining > Machine Learning for Hackers > Mining the Social Web • Visualization > Visualizing Data > Getting Started with D3 > Communicating Data with Tableau • Text mining / Natural Language Processing > 21 Recipes for Mining Twitter > Natural Language Processing with Python > Natural Language Annotation for Machine Learning FREE BOOKS CONTINUED @dspadawan
  38. 42 Copyright © 2013-2014 by Teradata. All rights reserved. •

    Big Data > Hadoop: The Definitive Guide, 3rd Edition > Ethics of Big Data > Big Data Analytics with R and Hadoop • Data Ingestion > Data Analysis with Open Source Tools > Python for Data Analysis • Data Wrangling and Munging > Using OpenRefine • Toolbox > Getting Started with Storm > Fast Data Processing with Spark FREE BOOKS CONTINUED $ @dspadawan
  39. 43 Copyright © 2013-2014 by Teradata. All rights reserved. QUESTIONS

    AND ANSWERS By Tara Laskowski @dspadawan Contact me at [email protected] Follow me at twitter @dspadawan Read my blog http://datasciencepadawan.blogspot.com