Slide 1

Slide 1 text

© 2015 MapR Technologies ‹#› © 2016 MapR Technologies Tugdual Grall Technical Evangelist @tgrall Lambda Architecture: The Best Way to Build Scalable and Reliable Applications! OOP-2016 Feb, 04, 2016

Slide 2

Slide 2 text

© 2016 MapR Technologies ‹#› @tgrall {“about” : “me”} Tugdual “Tug” Grall • MapR • Technical Evangelist • MongoDB • Technical Evangelist • Couchbase • Technical Evangelist • eXo • CTO • Oracle • Developer/Product Manager • Mainly Java/SOA • Developer in consulting firms • Web • @tgrall • http://tgrall.github.io • tgrall
 • NantesJUG co-founder
 • Pet Project : • http://www.resultri.com • [email protected][email protected]

Slide 3

Slide 3 text

© 2016 MapR Technologies @tgrall 3 Big Data & Hadoop In Production

Slide 4

Slide 4 text

© 2016 MapR Technologies 4 Data Warehouse Optimization

Slide 5

Slide 5 text

© 2016 MapR Technologies 5 Data Hub Choose the best “connector”: • File • Sqoop • ETL • … Use the aggregated data • In your applications • To update other systems • as an Open Data API • … Customer DB Customer DB Logs … Hadoop NoSQL

Slide 6

Slide 6 text

© 2016 MapR Technologies 6 Financial Services Fraud detection Personalized offers Fraud investigation tool Fraud investigator Fraud model Recommendations table Clickstream analysis Online transactions MapR Distribution for Hadoop Analytics Real-time Operational Applications Interactive marketer

Slide 7

Slide 7 text

© 2016 MapR Technologies @tgrall 7 Fault Tolerance

Slide 8

Slide 8 text

© 2016 MapR Technologies 8 Fault Tolerance hardware software developer ?

Slide 9

Slide 9 text

© 2016 MapR Technologies 9 Human fault tolerance

Slide 10

Slide 10 text

© 2014 MapR Technologies 10

Slide 11

Slide 11 text

© 2014 MapR Technologies 11

Slide 12

Slide 12 text

© 2014 MapR Technologies 12

Slide 13

Slide 13 text

© 2016 MapR Technologies @tgrall 13 Lambda Architecture To the rescue λ

Slide 14

Slide 14 text

© 2016 MapR Technologies 14 A little bit of history…. • Defined by Nathan Marz • ex BackType, Twitter • in a new Startup • Creator of … – Storm – Cascalog – ElephantDB

Slide 15

Slide 15 text

© 2016 MapR Technologies 15 Lambda Architecture Requirements • Fault-tolerant against both hardware failures & human errors • Support variety of use cases that include low latency querying as well as updates • Linear scale-out capabilities • Extensible, so that the system is manageable and can accommodate newer features easily

Slide 16

Slide 16 text

© 2016 MapR Technologies 16

Slide 17

Slide 17 text

© 2016 MapR Technologies 17 Lambda Architecture NEW DATA 
 STREAM QUERY BATCH VIEWS √ View 1 View 2 View N REAL-TIME VIEWS BATCH LAYER SERVINGLAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWS BATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWS View 1 View 2 View N

Slide 18

Slide 18 text

© 2016 MapR Technologies 18 Data Ingestion All data entering the system are dispatched to both • the batch layer • the speed layer NEW DATA 
 STREAM BATCH LAYER SPEED LAYER

Slide 19

Slide 19 text

© 2016 MapR Technologies Batch Layer • managing the master dataset, an immutable, append-only set of raw data • pre-computing arbitrary query functions, called batch views. BATCH VIEWS BATCH LAYER IMMUTABLE MASTER DATA PRECOMPUTE VIEWS BATCH RECOMPUTE View 1 View 2 View N

Slide 20

Slide 20 text

© 2016 MapR Technologies 20 Speed Layer √ View 1 View 2 View N REAL-TIME VIEWS SPEED LAYER PROCESS STREAM INCREMENT VIEWS • Speed layer accommodates low latency requests that are subject to low latency requirements. • Using fast and incremental algorithms, deals with recent data only

Slide 21

Slide 21 text

© 2016 MapR Technologies 21 Serving Layer QUERY BATCH VIEWS √ View 1 View 2 View N REAL-TIME VIEWS SERVINGLAYER MERGE View 1 View 2 View N • Serving layer indexes batch views so that they can be queried in ad hoc with low latency

Slide 22

Slide 22 text

© 2014 MapR Technologies 22 Lambda Architecture—Compensate Batch time not absorbed now

Slide 23

Slide 23 text

© 2016 MapR Technologies 23 Lambda Architecture—Immutable Data + Views http://openflights.org

Slide 24

Slide 24 text

© 2016 MapR Technologies 24 Lambda Architecture—Immutable Data + Views timestamp airport flight action 2016-02-04T10:00:00 MUC EY123 take-off 2016-02-04T10:05:00 BRU SAS45 take-off 2016-02-04T10:07:00 AMS BA99 take-off 2016-02-04T10:09:00 LHR LH17 landing 2016-02-04T10:10:00 CDG AF03 landing 2016-02-04T10:10:00 FCO AZ501 take-off immutable master dataset

Slide 25

Slide 25 text

© 2016 MapR Technologies 25 Lambda Architecture—Immutable Data + Views timestamp airport flight action 2016-02-04T10:00:00 MUC EY123 take-off 2016-02-04T10:05:00 BRU SAS45 take-off 2016-02-04T10:07:00 AMS BA99 take-off 2016-02-04T10:09:00 LHR LH17 landing 2016-02-04T10:10:00 CDG AF03 landing 2016-02-04T10:10:00 FCO AZ501 take-off air-borne: 2307 airline planes AF 59 AZ 23 BA 167 EY 19 LH 201 SAS 28 air-borne per airline: airport planes AMS 69 CDG 44 BRU 31 FCO 10 HEL 17 LHR 101 airport load:

Slide 26

Slide 26 text

© 2016 MapR Technologies @tgrall 26 Implementation

Slide 27

Slide 27 text

© 2016 MapR Technologies 27 Lambda Architecture NEW DATA 
 STREAM QUERY BATCH VIEWS √ View 1 View 2 View N REAL-TIME VIEWS BATCH LAYER SERVINGLAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWS BATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWS View 1 View 2 View N

Slide 28

Slide 28 text

© 2016 MapR Technologies 28 Batch Layer: View Generation Master Data View 1 View 2 Master Data Master Data Master Data Events “Raw” Storage Processing Aggregated Data

Slide 29

Slide 29 text

© 2016 MapR Technologies 29

Slide 30

Slide 30 text

© 2016 MapR Technologies 30 • Cluster Computing Platform • Extends “MapReduce” with extensions – Streaming – Interactive Analytics • Run in Memory

Slide 31

Slide 31 text

© 2015 MapR Technologies ‹#› @tgrall Spark components Spark SQL Spark Streaming (Streaming) MLlib (Machine Learning) Spark Core (General execution engine) GraphX (Graph Computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN

Slide 32

Slide 32 text

© 2016 MapR Technologies 32 Spark Jobs Driver Program (application) sc=new SparkContext rDD=sc.textfile(“hdfs:// …”) rDD.map Cluster Manager Worker Executor Task Task Worker Executor Task Task

Slide 33

Slide 33 text

© 2016 MapR Technologies 33 Spark Resilient Distributed Datasets “RDD” Sensor RDD W Executor P4 W Executor P1 P3 W Executor P2 sc.textFile P1 8213034705, 95, 2.927373, jake7870, 0…… P2 8213034705, 115, 2.943484, Davidbresler2, 1…. P3 8213034705, 100, 2.951285, gladimacowgirl, 58… P4 8213034705, 117, 2.998947, daysrus, 95….

Slide 34

Slide 34 text

© 2016 MapR Technologies 34 Spark Resilient Distributed Datasets Transformation Filter() Action Count() RDD newRDD Value

Slide 35

Slide 35 text

© 2015 MapR Technologies @tgrall Transformations • Process an RDD, returns an RDD • Examples : • map() : one value => another value • mapToPair() : one value => a tuple • filter() : filters values/tuples on a given condition • groupByKey() : groups values by key • reduceByKey() : aggregates values by key • join(), cogroup(), … : joins RDDs

Slide 36

Slide 36 text

© 2015 MapR Technologies @tgrall Actions • Process an RDD, returns a value • Examples : • count() : counts number of items in dataset • first() : returns first entry • take(n) : returns array of the n first elements • foreach() : applies a function on each element • collect() : returns all elements • saveAsTextFile() : saves in files each element

Slide 37

Slide 37 text

© 2016 MapR Technologies 37 Speed Layer Real Time View1 Real Time View 2 Events Processing NoSQL

Slide 38

Slide 38 text

© 2016 MapR Technologies 38 Serving Layer: Aggregated Data • Views are stored in a Read/Write database • Apache HBase • MapR DB Binary & JSON • Cassandra • MongoDB • Elasticsearch • …

Slide 39

Slide 39 text

© 2016 MapR Technologies 39 Serving Layer Real Time View Events Processing Aggregated Batch View Query - SQL Dataviz Query/Visualisation SQL

Slide 40

Slide 40 text

© 2016 MapR Technologies // Join MapR-DB Table, Parquet and MongoDB collection > SELECT u.name, b.category, count(1) nb_review FROM mongo.yelp.`user` u , dfs.yelp.`review.parquet` r, (select business_id, flatten(categories) category from maprdb.`business` ) b WHERE u.user_id = r.user_id AND b.business_id = r.business_id GROUP BY u.user_id, u.name, b.category ORDER BY nb_review DESC LIMIT 10; +-----------+--------------+------------+ | name | category | nb_review | +-----------+--------------+------------+ | Rand | Restaurants | 1086 | | J | Restaurants | 661 | | Aileen | Restaurants | 499 | | Michael | Restaurants | 496 | +-----------+--------------+------------+ 40

Slide 41

Slide 41 text

© 2016 MapR Technologies @tgrall 41 Events Capture?

Slide 42

Slide 42 text

© 2016 MapR Technologies 42 Events Capture Customer DB API Logs … Streaming Streams Files

Slide 43

Slide 43 text

© 2016 MapR Technologies 43 What is Spark Streaming? • Enables scalable, high-throughput, fault-tolerant stream processing of live data • Extension of the core Spark Data Sources Data Sinks

Slide 44

Slide 44 text

© 2016 MapR Technologies 44 Spark Streaming Architecture • Divide data stream into batches of X seconds (micro batching) • Called DStream = sequence of RDDs Spark Streaming input data stream DStream RDD batches Batch interval data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3 RDD @ time 1

Slide 45

Slide 45 text

© 2016 MapR Technologies 45 What are Apache Kafka & MapR Streams? • Publish Subscribe Messaging • Fast • Scalable • Durable • Distributed

Slide 46

Slide 46 text

© 2016 MapR Technologies @tgrall 46 Summary

Slide 47

Slide 47 text

© 2016 MapR Technologies 47 Lambda Architecture NEW DATA 
 STREAM QUERY BATCH VIEWS √ View 1 View 2 View N REAL-TIME VIEWS BATCH LAYER SERVINGLAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWS BATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWS View 1 View 2 View N NoSQL Distributed File System NoSQL Streams

Slide 48

Slide 48 text

© 2016 MapR Technologies 48 Lambda Architecture in Action Batch processing
 (MapReduce) Tax reduction reporting Shortest path graph algorithm
 (Titan on MapR-DB) Route optimization . . . Geolocation Geolocation Geolocation Geolocation Online alerts Real-time stream

Slide 49

Slide 49 text

© 2016 MapR Technologies 49 Lambda Architecture • Fault-tolerant • Use batch layer to pre compute complex/large data set queries • Use speed layer to deal with “near real time” use cases • Linear scale-out capabilities • Error Prone: • Recompute data from master data set when needed

Slide 50

Slide 50 text

© 2016 MapR Technologies 50

Slide 51

Slide 51 text

© 2016 MapR Technologies 51 Q & A @tgrall maprtech [email protected] Engage with us! MapR maprtech mapr-technologies