Slide 1

Slide 1 text

STATE OF PLAY SEAN OWEN DIRECTOR OF DATA SCIENCE CLOUDERA

Slide 2

Slide 2 text

State of Play Data Science on Hadoop in 2015 Sean Owen // Director, Data Science @ Cloudera

Slide 3

Slide 3 text

2 About … • Engineer • Data Science @ Cloudera • Oryx project founder • Committer, erstwhile VP Apache Mahout • Apache Spark contributor / personality • Co-author, Mahout in Action / Advanced Analytics on Spark • [email protected] / @sean_r_owen

Slide 4

Slide 4 text

3 Where Is My Magic Wand?

Slide 5

Slide 5 text

4 We Like Hadoop Because … • (Was) Shiny New Toy • Be Like Yahoo, Google, FB • Data as Strategy • Free – Just Add Hardware • Open, Standard • Cost-Savings Projects • Bigger and Faster is Better • Fewer Hacks to Survive Scale • Do The Previously Impossible It’s Aspirational It Costs Less We Get More Computing www.avalonconsulting.net/blog/485-thinking-beyond-shiny- and-new www.pianta.co.uk/massive-sale-now-on/ www.google.com/about/careers/locations/mayes-county/

Slide 6

Slide 6 text

5 Incremental Today vs. Revolutionary Tomorrow • We set up a prototype Hadoop cluster as part of a big data POC • We cut our IT budget by 22% by moving some operations to Hadoop • Our SQL queries are 3 times faster and overnight reports finish in 39 minutes now • We do the same things with data, but do them notably better. • We want to become a real-time product business that reacts to new machine sensor data in seconds, not days • We want to predict which merchants will take out a business loan this month • We want a complete customer profile that “understands” what they want at any time • We think there is a magic wand available?

Slide 7

Slide 7 text

6 Phase 1. Collect Data Phase 2. Data Science? Phase 3. Profit!

Slide 8

Slide 8 text

7 Demystifying with Data Science • Machine Learning is not new • Big Machine Learning is qualitatively different – More data beats algorithm improvement – Scale trumps noise and sample size effects – Can brute-force manual tasks • Feature selection • Hyperparameter tuning • Engineering “Big” is Difficult – Build new scalable data platforms – Re-engineering parallel algorithms

Slide 9

Slide 9 text

8 What is Data Science? What skill sets does it require? What tools are commonly used? How do we architect data products? How do we get started?

Slide 10

Slide 10 text

9 Three Camps

Slide 11

Slide 11 text

10 s3.amazonaws.com/aws.drewconway.com/viz/venn_diagram/data_science.html

Slide 12

Slide 12 text

11 Business

Slide 13

Slide 13 text

12 Business

Slide 14

Slide 14 text

13 Engineering vs. Statistics Programming languages Systems languages Latency, throughput Huge data Online problems Automated Developers, Engineers Statistical environments, BI tools High-level languages Accuracy Medium-sized data Offline work Ad-hoc Statisticians, Analysts vs.

Slide 15

Slide 15 text

14 Data Science + Hadoop

Slide 16

Slide 16 text

15 Engineering, Statistics & Hadoop: Before Gap.

Slide 17

Slide 17 text

16 Engineering, Statistics & Hadoop: 2014 YARN RM

Slide 18

Slide 18 text

17 Apache Spark: Something for Everyone • Now Apache TLP – From UC Berkeley AMPLab – … inspired by MS DryadLINQ • Scala-based – Expressive, efficient – JVM-based • Scala-like abstractions – RDD: Resilient Distributed (immutable) Dataset – Distributed works like local – Like Apache Crunch is Collection-like • Read-Evaluate-Print-Loop – Interactive – No compile/deploy cycle needed • Python API too • Natively Distributed • Hadoop-friendly – Integrate with where data already is – ETL no longer separate • Subprojects: MLlib and more

Slide 19

Slide 19 text

18 Statisticians: Shell, Concise Syntax (4,"c#") (4,"winforms") ... (4,3104,1.0) (4,2148819,1.0) ... scala> val postIDTags = postsXML.flatMap { line => val idTagRegex = "Id=\"(\\d+)\".+Tags=\"([^\"]+)\"".r val tagRegex = "<([^&]+)>".r idTagRegex.findFirstMatchIn(line) match { case None => None case Some(m) => { val postID = m.group(1).toInt val tagsString = m.group(2) val tags = tagRegex.findAllMatchIn(tagsString) .map(_.group(1)).toList tags.map((postID,_)) } } }

Slide 20

Slide 20 text

19 Engineers: Distributed, Manageable

Slide 21

Slide 21 text

20 2015 is Time to Operationalize

Slide 22

Slide 22 text

21 From Exploratory to Operational  Exploratory Analytics Operational Analytics  Explore Data Pick Model Build Model at Scale, Offline Continuously Update Model Score Model in Real-Time

Slide 23

Slide 23 text

22 Lambda λ Architecture noun. 1. Name of a design idea you’ve had before but didn’t realize was a thing that needed a name.

Slide 24

Slide 24 text

23 Lambda Architecture λ: Streaming • Lambda Architecture – Batch Layer: compute full answer offline, in batch – Speed Layer: compute approximate answer online, in near-real-time – Serving Layer: stitch speed/batch answers together in real-time • Great fit for big, real-time ML • Ecosystem has right components now – Batch: Spark + MLlib – Speed: Spark Streaming – Serving: Tomcat / Jetty – Data Fabric: Kafka, HDFS

Slide 25

Slide 25 text

24 Oryx 2: Lambda for ML (alpha) github.com/OryxProject/oryx

Slide 26

Slide 26 text

Thank You [email protected] @sean_r_owen

Slide 27

Slide 27 text

17TH ~ 18th NOV 2014 MADRID (SPAIN)