Slide 1

Slide 1 text

Data Science 101 - An overview medley - Dev.Talk December 2015 Marcel Körtgen

Slide 2

Slide 2 text

Disclaimer - What this is •A medley of great talks on DS (see references) •copied together & compressed → To fit into a 20 min. overview talk on the topic

Slide 3

Slide 3 text

Agenda •What is Data Science? •Why Data Science? •What is a Data Scientist? •How to become a Data Scientist?

Slide 4

Slide 4 text

What is Data Sciene?

Slide 5

Slide 5 text

D.Conways Venn Diagram

Slide 6

Slide 6 text

Data Science: The Origins 1970s: Peter Naur introduces “data science” as a synonym to “computer science” 1997: Jeff Wu claims “statisticians” are “data scientists”. 2001: William Cleveland introduces data science as an independent discipline, extending statistics. 2008: DJ Patil (LinkedIn) and Jeff Hammerbacher (Facebook) describe their job role as that of “Data Scientist”

Slide 7

Slide 7 text

Data Science: The Origins Term became trending since 2008 38 years

Slide 8

Slide 8 text

What about Big Data? •Volume SQL → HDFS •Velocity complex events processing, apache storm, apache spark streaming •Variety structured | semi-structured | unstructured social graphs, system logs, tweets/blogs, CCTV many variables, sampling variability (e.g., spatiotemporal)

Slide 9

Slide 9 text

What about Big Data? •Volume •Velocity •Variety Nobody wants data. Everybody wants data-driven reliable actionable insights

Slide 10

Slide 10 text

What is Data Science? Big Data ≠ Data Science

Slide 11

Slide 11 text

Really, what is data science? According to NIST “Data science is the empirical synthesis of actionable knowledge from raw data through the complete data lifecycle process.” According to NIST Big Data Framework

Slide 12

Slide 12 text

Why Data Sciene?

Slide 13

Slide 13 text

Man on the Moon - Small Data! Computer Program Date: 1969 64 Kb, 2Kb RAM, Fortran Must work lst time Apollo XI Speed: 3,500 km/hour Weight: 13,500 kg Lots of complex data Man on the Moon Distance: 356,000 km Never been there before Must return to Earth

Slide 14

Slide 14 text

Think About lt - We live in Crazy Times! Apollo Xl, 1969 64 Kb SkyDive Stratos, 2012 Tens of Gigabytes

Slide 15

Slide 15 text

Data Analyst Shortage Source: http://www.delphianalytics.netiwp-content/uploads/2013/04/GrowthOfDataVsDataAnalysts.png

Slide 16

Slide 16 text

What is a Data Scientist?

Slide 17

Slide 17 text

What is a Data Scientist?

Slide 18

Slide 18 text

What is a Data Scientist?

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

How to become a Data Scientist?

Slide 22

Slide 22 text

How to become a Data Scientist? Golden times for autodidacts • So much Open Source & Open Data (GitHub) • Never been easier to get in touch (Twitter, Social) • Low-cost Compute resources (Cloud, SAAS, PAAS) → However, you will finally hit a wall.

Slide 23

Slide 23 text

How to become a Data Scientist? At the latest then get in touch with • bootcamps, retreats, courses such as... • NYC DS bootcamp, DS Retreat, Udacity, Education.emc For now some references to get started...

Slide 24

Slide 24 text

References • 10 myths about Data Scientists (J. Kobelius) • Big Data [sorry] & Data Science (Data Science London) • Intro to Data Science (P. Nathan) • Data Science in 2016: Moving Up (P. Nathan) • How to Become a Data Scientist (R. Orban) • Jose Quesada über Skills der Data Scientists (Heise) • slideshare.net/urlwolf (Jose Quesada at SlideShare)

Slide 25

Slide 25 text

References Statistical Modeling: The Two Cultures by Leo Breiman, Statistical Science, 2001

Slide 26

Slide 26 text

References Data Quality by Jack Olson, Morgan Kaufmann, 2003

Slide 27

Slide 27 text

References Building Data Science Teams by DJ Patil, O’Reilly, 2011

Slide 28

Slide 28 text

References Data Jujitsu by DJ Patil, O’Reilly, 2012

Slide 29

Slide 29 text

References RStudio download and run on your laptop https://www.rstudio.com

Slide 30

Slide 30 text

Some great tools

Slide 31

Slide 31 text

[Sort of] Data Scientist Toolkit • Java, R, Python... (bonus: Clojure, Haskell, Scala) • Hadoop, HDFS & Map Reduce... (bonus: Spark, Storm) • HBase, Pig & Hive... (bonus: Shark, Impala, Cascalog) • ETL, Webscrapers, Flume, Sqoop... (bonus: Hume) • SQL RDBMS,DW,OLAP… • Knime, Weka, RapidMiner...(bonus: SciPy, NumPy, scikit-learn, pandas) • D3.js, Gephi, ggplot2, Tableu, Flare, Shiny… • SPSS, Matlab, SAS... (the enterprise man) • NoSQL, Mongo DB, Couchbase, Cassandra… • And Yes! ... MS-Excel: the most used, most underrated DS tool

Slide 32

Slide 32 text

Some great algorithms

Slide 33

Slide 33 text

Thank You Time for Questions!