Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!

Slide 1

Slide 1 text

Slide 2

Slide 2 text

© 2016 MapR Technologies ‹#› @tgrall {“about” : “me”} Tugdual “Tug” Grall • MapR • Technical Evangelist • MongoDB • Technical Evangelist • Couchbase • Technical Evangelist • eXo • CTO • Oracle • Developer/Product Manager • Mainly Java/SOA • Developer in consulting firms • Web • @tgrall • http://tgrall.github.io • tgrall  • NantesJUG co-founder  • Pet Project : • http://www.resultri.com • [email protected] • [email protected]

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

© 2016 MapR Technologies 5 Data Hub Choose the best “connector”: • File • Sqoop • ETL • … Use the aggregated data • In your applications • To update other systems • as an Open Data API • … Customer DB Customer DB Logs … Hadoop NoSQL

Slide 6

Slide 6 text

© 2016 MapR Technologies 6 Financial Services Fraud detection Personalized offers Fraud investigation tool Fraud investigator Fraud model Recommendations table Clickstream analysis Online transactions MapR Distribution for Hadoop Analytics Real-time Operational Applications Interactive marketer

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

© 2016 MapR Technologies 15 Lambda Architecture Requirements • Fault-tolerant against both hardware failures & human errors • Support variety of use cases that include low latency querying as well as updates • Linear scale-out capabilities • Extensible, so that the system is manageable and can accommodate newer features easily

Slide 16

Slide 16 text

Slide 17

Slide 17 text

© 2016 MapR Technologies 17 Lambda Architecture NEW DATA   STREAM QUERY BATCH VIEWS √ View 1 View 2 View N REAL-TIME VIEWS BATCH LAYER SERVINGLAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWS BATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWS View 1 View 2 View N

Slide 18

Slide 18 text

Slide 19

Slide 19 text

© 2016 MapR Technologies Batch Layer • managing the master dataset, an immutable, append-only set of raw data • pre-computing arbitrary query functions, called batch views. BATCH VIEWS BATCH LAYER IMMUTABLE MASTER DATA PRECOMPUTE VIEWS BATCH RECOMPUTE View 1 View 2 View N

Slide 20

Slide 20 text

© 2016 MapR Technologies 20 Speed Layer √ View 1 View 2 View N REAL-TIME VIEWS SPEED LAYER PROCESS STREAM INCREMENT VIEWS • Speed layer accommodates low latency requests that are subject to low latency requirements. • Using fast and incremental algorithms, deals with recent data only

Slide 21

Slide 21 text

© 2016 MapR Technologies 21 Serving Layer QUERY BATCH VIEWS √ View 1 View 2 View N REAL-TIME VIEWS SERVINGLAYER MERGE View 1 View 2 View N • Serving layer indexes batch views so that they can be queried in ad hoc with low latency

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

© 2016 MapR Technologies 24 Lambda Architecture—Immutable Data + Views timestamp airport flight action 2016-02-04T10:00:00 MUC EY123 take-off 2016-02-04T10:05:00 BRU SAS45 take-off 2016-02-04T10:07:00 AMS BA99 take-off 2016-02-04T10:09:00 LHR LH17 landing 2016-02-04T10:10:00 CDG AF03 landing 2016-02-04T10:10:00 FCO AZ501 take-off immutable master dataset

Slide 25

Slide 25 text

© 2016 MapR Technologies 25 Lambda Architecture—Immutable Data + Views timestamp airport flight action 2016-02-04T10:00:00 MUC EY123 take-off 2016-02-04T10:05:00 BRU SAS45 take-off 2016-02-04T10:07:00 AMS BA99 take-off 2016-02-04T10:09:00 LHR LH17 landing 2016-02-04T10:10:00 CDG AF03 landing 2016-02-04T10:10:00 FCO AZ501 take-off air-borne: 2307 airline planes AF 59 AZ 23 BA 167 EY 19 LH 201 SAS 28 air-borne per airline: airport planes AMS 69 CDG 44 BRU 31 FCO 10 HEL 17 LHR 101 airport load:

Slide 26

Slide 26 text

Slide 27

Slide 27 text

© 2016 MapR Technologies 27 Lambda Architecture NEW DATA   STREAM QUERY BATCH VIEWS √ View 1 View 2 View N REAL-TIME VIEWS BATCH LAYER SERVINGLAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWS BATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWS View 1 View 2 View N

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

© 2015 MapR Technologies ‹#› @tgrall Spark components Spark SQL Spark Streaming (Streaming) MLlib (Machine Learning) Spark Core (General execution engine) GraphX (Graph Computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN

Slide 32

Slide 32 text

Slide 33

Slide 33 text

© 2016 MapR Technologies 33 Spark Resilient Distributed Datasets “RDD” Sensor RDD W Executor P4 W Executor P1 P3 W Executor P2 sc.textFile P1 8213034705, 95, 2.927373, jake7870, 0…… P2 8213034705, 115, 2.943484, Davidbresler2, 1…. P3 8213034705, 100, 2.951285, gladimacowgirl, 58… P4 8213034705, 117, 2.998947, daysrus, 95….

Slide 34

Slide 34 text

Slide 35

Slide 35 text

© 2015 MapR Technologies @tgrall Transformations • Process an RDD, returns an RDD • Examples : • map() : one value => another value • mapToPair() : one value => a tuple • filter() : filters values/tuples on a given condition • groupByKey() : groups values by key • reduceByKey() : aggregates values by key • join(), cogroup(), … : joins RDDs

Slide 36

Slide 36 text

© 2015 MapR Technologies @tgrall Actions • Process an RDD, returns a value • Examples : • count() : counts number of items in dataset • first() : returns first entry • take(n) : returns array of the n first elements • foreach() : applies a function on each element • collect() : returns all elements • saveAsTextFile() : saves in files each element

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text

Slide 41

Slide 41 text

Slide 42

Slide 42 text

Slide 43

Slide 43 text

Slide 44

Slide 44 text

© 2016 MapR Technologies 44 Spark Streaming Architecture • Divide data stream into batches of X seconds (micro batching) • Called DStream = sequence of RDDs Spark Streaming input data stream DStream RDD batches Batch interval data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3 RDD @ time 1

Slide 45

Slide 45 text

Slide 46

Slide 46 text

Slide 47

Slide 47 text

© 2016 MapR Technologies 47 Lambda Architecture NEW DATA   STREAM QUERY BATCH VIEWS √ View 1 View 2 View N REAL-TIME VIEWS BATCH LAYER SERVINGLAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWS BATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWS View 1 View 2 View N NoSQL Distributed File System NoSQL Streams

Slide 48

Slide 48 text

© 2016 MapR Technologies 48 Lambda Architecture in Action Batch processing  (MapReduce) Tax reduction reporting Shortest path graph algorithm  (Titan on MapR-DB) Route optimization . . . Geolocation Geolocation Geolocation Geolocation Online alerts Real-time stream

Slide 49

Slide 49 text

© 2016 MapR Technologies 49 Lambda Architecture • Fault-tolerant • Use batch layer to pre compute complex/large data set queries • Use speed layer to deal with “near real time” use cases • Linear scale-out capabilities • Error Prone: • Recompute data from master data set when needed

Slide 50

Slide 50 text

Slide 51

Slide 51 text