State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

STATE OF PLAY SEAN OWEN DIRECTOR OF DATA SCIENCE CLOUDERA

State of Play Data Science on Hadoop in 2015 Sean
Owen // Director, Data Science @ Cloudera

2 About … • Engineer • Data Science @ Cloudera
• Oryx project founder • Committer, erstwhile VP Apache Mahout • Apache Spark contributor / personality • Co-author, Mahout in Action / Advanced Analytics on Spark • [email protected] / @sean_r_owen

3 Where Is My Magic Wand?

4 We Like Hadoop Because … • (Was) Shiny New
Toy • Be Like Yahoo, Google, FB • Data as Strategy • Free – Just Add Hardware • Open, Standard • Cost-Savings Projects • Bigger and Faster is Better • Fewer Hacks to Survive Scale • Do The Previously Impossible It’s Aspirational It Costs Less We Get More Computing www.avalonconsulting.net/blog/485-thinking-beyond-shiny- and-new www.pianta.co.uk/massive-sale-now-on/ www.google.com/about/careers/locations/mayes-county/

5 Incremental Today vs. Revolutionary Tomorrow • We set up
a prototype Hadoop cluster as part of a big data POC • We cut our IT budget by 22% by moving some operations to Hadoop • Our SQL queries are 3 times faster and overnight reports finish in 39 minutes now • We do the same things with data, but do them notably better. • We want to become a real-time product business that reacts to new machine sensor data in seconds, not days • We want to predict which merchants will take out a business loan this month • We want a complete customer profile that “understands” what they want at any time • We think there is a magic wand available?

6 Phase 1. Collect Data Phase 2. Data Science? Phase
3. Profit!

7 Demystifying with Data Science • Machine Learning is not
new • Big Machine Learning is qualitatively different – More data beats algorithm improvement – Scale trumps noise and sample size effects – Can brute-force manual tasks • Feature selection • Hyperparameter tuning • Engineering “Big” is Difficult – Build new scalable data platforms – Re-engineering parallel algorithms

8 What is Data Science? What skill sets does it
require? What tools are commonly used? How do we architect data products? How do we get started?

9 Three Camps

10 s3.amazonaws.com/aws.drewconway.com/viz/venn_diagram/data_science.html

11 Business

12 Business

13 Engineering vs. Statistics Programming languages Systems languages Latency, throughput
Huge data Online problems Automated Developers, Engineers Statistical environments, BI tools High-level languages Accuracy Medium-sized data Offline work Ad-hoc Statisticians, Analysts vs.

14 Data Science + Hadoop

15 Engineering, Statistics & Hadoop: Before Gap.

16 Engineering, Statistics & Hadoop: 2014 YARN RM

17 Apache Spark: Something for Everyone • Now Apache TLP
– From UC Berkeley AMPLab – … inspired by MS DryadLINQ • Scala-based – Expressive, efficient – JVM-based • Scala-like abstractions – RDD: Resilient Distributed (immutable) Dataset – Distributed works like local – Like Apache Crunch is Collection-like • Read-Evaluate-Print-Loop – Interactive – No compile/deploy cycle needed • Python API too • Natively Distributed • Hadoop-friendly – Integrate with where data already is – ETL no longer separate • Subprojects: MLlib and more

18 Statisticians: Shell, Concise Syntax <row Id="4" ... Tags="...c#...winforms..."/> (4,"c#")
(4,"winforms") ... (4,3104,1.0) (4,2148819,1.0) ... scala> val postIDTags = postsXML.flatMap { line => val idTagRegex = "Id=\"(\\d+)\".+Tags=\"([^\"]+)\"".r val tagRegex = "<([^&]+)>".r idTagRegex.findFirstMatchIn(line) match { case None => None case Some(m) => { val postID = m.group(1).toInt val tagsString = m.group(2) val tags = tagRegex.findAllMatchIn(tagsString) .map(_.group(1)).toList tags.map((postID,_)) } } }

19 Engineers: Distributed, Manageable

20 2015 is Time to Operationalize

21 From Exploratory to Operational  Exploratory Analytics Operational Analytics
 Explore Data Pick Model Build Model at Scale, Offline Continuously Update Model Score Model in Real-Time

22 Lambda λ Architecture noun. 1. Name of a design
idea you’ve had before but didn’t realize was a thing that needed a name.

23 Lambda Architecture λ: Streaming • Lambda Architecture – Batch
Layer: compute full answer offline, in batch – Speed Layer: compute approximate answer online, in near-real-time – Serving Layer: stitch speed/batch answers together in real-time • Great fit for big, real-time ML • Ecosystem has right components now – Batch: Spark + MLlib – Speed: Spark Streaming – Serving: Tomcat / Jetty – Data Fabric: Kafka, HDFS

24 Oryx 2: Lambda for ML (alpha) github.com/OryxProject/oryx

Thank You [email protected] @sean_r_owen

17TH ~ 18th NOV 2014 MADRID (SPAIN)

State of Play. Data Science on Hadoop in 2015 b...

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

Big Data Spain

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript

STATE OF PLAY SEAN OWEN DIRECTOR OF DATA SCIENCE CLOUDERA

State of Play Data Science on Hadoop in 2015 Sean

2 About … • Engineer • Data Science @ Cloudera

3 Where Is My Magic Wand?

4 We Like Hadoop Because … • (Was) Shiny New

5 Incremental Today vs. Revolutionary Tomorrow • We set up

6 Phase 1. Collect Data Phase 2. Data Science? Phase

7 Demystifying with Data Science • Machine Learning is not

8 What is Data Science? What skill sets does it

9 Three Camps

10 s3.amazonaws.com/aws.drewconway.com/viz/venn_diagram/data_science.html

11 Business

12 Business

13 Engineering vs. Statistics Programming languages Systems languages Latency, throughput

14 Data Science + Hadoop

15 Engineering, Statistics & Hadoop: Before Gap.

16 Engineering, Statistics & Hadoop: 2014 YARN RM

17 Apache Spark: Something for Everyone • Now Apache TLP

18 Statisticians: Shell, Concise Syntax <row Id="4" ... Tags="...c#...winforms..."/> (4,"c#")

19 Engineers: Distributed, Manageable

20 2015 is Time to Operationalize

21 From Exploratory to Operational  Exploratory Analytics Operational Analytics

22 Lambda λ Architecture noun. 1. Name of a design

23 Lambda Architecture λ: Streaming • Lambda Architecture – Batch

24 Oryx 2: Lambda for ML (alpha) github.com/OryxProject/oryx

Thank You [email protected] @sean_r_owen

17TH ~ 18th NOV 2014 MADRID (SPAIN)