Slide 1

Slide 1 text

® © 2014 MapR Technologies 1 ® © 2014 MapR Technologies Lambda Architecture with Apache Spark Michael Hausenblas, Chief Data Engineer MapR Big Data Beers, Berlin, 2014-07-24

Slide 2

Slide 2 text

® Fault tolerance hardware software developer ?

Slide 3

Slide 3 text

® © 2014 MapR Technologies 3 © 2014 MapR Technologies ® Let’s talk about developers…

Slide 4

Slide 4 text

® http://xkcd.com/327/

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

® © 2014 MapR Technologies 7 © 2014 MapR Technologies ® human fault tolerance Let’s talk about developers…

Slide 8

Slide 8 text

® Human fault tolerance

Slide 9

Slide 9 text

® When things go wrong … http://allfacebook.com/the-real-reason-facebook-went-down-yesterday-its-complicated_b19366 2010 unfortunate handling of error condition

Slide 10

Slide 10 text

® When things go wrong … 2012 cascaded bug http://money.cnn.com/2012/06/21/technology/twitter-down/index.htm

Slide 11

Slide 11 text

® When things go wrong … http://www.v3.co.uk/v3-uk/news/2196577/rbs-takes-gbp125m-hit-over-it-outage 2012 upgrade of batch processing

Slide 12

Slide 12 text

® When things go wrong … http://www.androidcentral.com/google-explains-reasons-behind-today-s-30-minute-service-outage 2014 bug/bad config

Slide 13

Slide 13 text

® © 2014 MapR Technologies 13 © 2014 MapR Technologies ® Lambda Architecture to the rescue!

Slide 14

Slide 14 text

® Let’s step back a bit … •  Nathan Marz (Backtype, Twitter, stealth startup) •  Creator of … –  Storm –  Cascalog –  ElephantDB http://manning.com/marz/

Slide 15

Slide 15 text

® Lambda Architecture—Requirements •  Fault-tolerant against both hardware failures and human errors •  Support variety of use cases that include low latency querying as well as updates •  Linear scale-out capabilities •  Extensible, so that the system is manageable and can accommodate newer features easily

Slide 16

Slide 16 text

® Lambda Architecture—Concept •  Latency—the time it takes to run a query •  Timeliness—how up to date the query results are (à consistency) •  Accuracy—tradeoff between performance and scalability (à approximations) query = function(all data)

Slide 17

Slide 17 text

® Lambda Architecture NEW DATA STREAM QUERY BATCH VIEWS √ View 1 View 2 View N REAL-TIME VIEWS BATCH LAYER SERVINGLAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWS BATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWS REAL-TIME INCREMENT View 1 View 2 View N

Slide 18

Slide 18 text

® Lambda Architecture—Layers •  Batch layer –  managing the master dataset, an immutable, append-only set of raw data –  pre-computing arbitrary query functions, called batch views •  Serving layer indexes batch views so that they can be queried in ad hoc with low latency •  Speed layer accommodates all requests that are subject to low latency requirements. Using fast and incremental algorithms, deals with recent data only

Slide 19

Slide 19 text

® Lambda Architecture—Compensate Batch time not absorbed now

Slide 20

Slide 20 text

® Lambda Architecture—Immutable Data + Views http://openflights.org

Slide 21

Slide 21 text

® Lambda Architecture—Immutable Data + Views timestamp airport flight action timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off 2014-01-01T10:07:00 AMS BA99 take-off timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off 2014-01-01T10:07:00 AMS BA99 take-off 2014-01-01T10:09:00 LHR LH17 landing timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off 2014-01-01T10:07:00 AMS BA99 take-off 2014-01-01T10:09:00 LHR LH17 landing 2014-01-01T10:10:00 CDG AF03 landing timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off 2014-01-01T10:07:00 AMS BA99 take-off 2014-01-01T10:09:00 LHR LH17 landing 2014-01-01T10:10:00 CDG AF03 landing 2014-01-01T10:10:00 FCO AZ501 take-off immutable master dataset

Slide 22

Slide 22 text

® Lambda Architecture—Immutable Data + Views timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off 2014-01-01T10:07:00 AMS BA99 take-off 2014-01-01T10:09:00 LHR LH17 landing 2014-01-01T10:10:00 CDG AF03 landing 2014-01-01T10:10:00 FCO AZ501 take-off immutable master dataset views airport planes AMS 69 CDG 44 DUB 31 FCO 10 HEL 17 LHR 101 airport load: airline planes AF 59 AZ 23 BA 167 EI 19 LH 201 SAS 28 air-borne per airline: air-borne: 2307

Slide 23

Slide 23 text

® © 2014 MapR Technologies 23 © 2014 MapR Technologies ® Implementing the Lambda Architecture

Slide 24

Slide 24 text

® © 2014 MapR Technologies 24

Slide 25

Slide 25 text

®

Slide 26

Slide 26 text

® How about an integrated approach? •  Twitter Summingbird •  Lambdoop •  Apache Spark

Slide 27

Slide 27 text

® Apache Spark

Slide 28

Slide 28 text

® Apache Spark •  Originally developed in 2009 in UC Berkeley’s AMP Lab •  As part of BDAS stack open sourced in 2010 •  Top-level Apache Project as of 2014 http://spark.apache.org/

Slide 29

Slide 29 text

® The Spark community

Slide 30

Slide 30 text

® Spark—a unified platform … Continued innovation bringing new functionality, such as: •  Tachyon (Shared RDDs, off-heap solution) •  BlinkDB (approximate queries) •  SparkR (R wrapper for Spark) Spark SQL (SQL/HQL) Spark Streaming (stream processing) MLlib (machine learning) Spark (core execution engine) GraphX (graph processing) Mesos Distributed File System (local FS, HDFS, S3, …) YARN

Slide 31

Slide 31 text

® Easy and fast Big Data •  Easy to Develop –  Rich APIs available through Java, Scala, Python –  Interactive shell •  Fast to Run –  Advanced data storage model (automated optimization between memory and disk) –  General execution graphs 2-5× less code up to 10× faster on disk, 100× in memory https://amplab.cs.berkeley.edu/benchmark/

Slide 32

Slide 32 text

® … for complex workloads … •  Iterative Algorithms –  machine learning –  graph processing beyond DAG •  Interactive Data Mining •  Streaming Applications

Slide 33

Slide 33 text

® … across multiple datasources •  Local Files –  file:///opt/httpd/logs/access_log •  Object Stores (e.g. Amazon S3) •  HDFS –  text files, sequence files, any other Hadoop InputFormat •  Key-Value datastores (e.g. Apache HBase)

Slide 34

Slide 34 text

® Easy: expressive API map reduce

Slide 35

Slide 35 text

® Easy: expressive API map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...

Slide 36

Slide 36 text

® © 2014 MapR Technologies Easy: get started immediately Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD lines = sc.textFile(...); lines.filter(new Function() { Boolean call(String s) { return s.contains(“error”); } }).count();

Slide 37

Slide 37 text

® © 2014 MapR Technologies … and scale as you go (mentally and physically) YARN Standalone

Slide 38

Slide 38 text

® Resilient Distributed Datasets (RDD) •  RDDs are the core of the Spark execution engine •  Collections of elements that can be operated on in parallel •  Persistent in memory between operations http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Slide 39

Slide 39 text

® RDD Operations •  Lazy evaluation is key to Spark •  Transformations –  Creation of a new dataset from an existing: map, filter, distinct, union, sample, groupByKey, join, etc. •  Actions –  Return a value after running a computation: collect, count, first, takeSample, foreach, etc.

Slide 40

Slide 40 text

® RDD persistence http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence

Slide 41

Slide 41 text

® And in the real world?

Slide 42

Slide 42 text

® Industry Leading Ad-Targeting Platform: Real-time Decisions High performance analytics over MapR-DB Load from MapR-DB table into RDD to augment scoring Results stored back in MapR-DB for other applications

Slide 43

Slide 43 text

® Cisco: Security Intelligence Operations Sensor data lands in MapR Spark Streaming on MapR for first check on known threats Data next processed on GraphX and Mahout Additional SQL querying done via Shark and Impala

Slide 44

Slide 44 text

® Leading Pharma Company: NextGen Genomics Existing process takes several weeks to align chemical compounds with genes ADAM on Spark allows realignment in a few hours Geneticists can minimize engineering dependency

Slide 45

Slide 45 text

® Further resources …

Slide 46

Slide 46 text

® The book: Learning Spark http://shop.oreilly.com/product/0636920028512.do

Slide 47

Slide 47 text

http://lambda-architecture.net

Slide 48

Slide 48 text

http://spark-stack.org

Slide 49

Slide 49 text

® Conclusion •  Let’s scale systems and humans •  How? Lambda Architecture! •  Apache Spark is an efficient way to implement Lambda Architecture

Slide 50

Slide 50 text

® Q & A @mhausenblas maprtech mhausenblas@mapr.com Engage with us! MapR maprtech mapr-technologies