Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lambda Architecture: The Best Way to Build Scal...

Tugdual Grall
February 09, 2016

Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!

Lambda Architecture is a useful framework to think about designing big data applications. This framework has been built initially at Twitter. In this presentation you will learn, based on concrete examples how to build deploy scalable and fault tolerant applications, with a focus on Big Data and Hadoop.

This presentation was delivered at the OOP conference, Munich, Feb 2016

Tugdual Grall

February 09, 2016
Tweet

More Decks by Tugdual Grall

Other Decks in Technology

Transcript

  1. © 2015 MapR Technologies ‹#› © 2016 MapR Technologies Tugdual

    Grall Technical Evangelist @tgrall Lambda Architecture: The Best Way to Build Scalable and Reliable Applications! OOP-2016 Feb, 04, 2016
  2. © 2016 MapR Technologies ‹#› @tgrall {“about” : “me”} Tugdual

    “Tug” Grall • MapR • Technical Evangelist • MongoDB • Technical Evangelist • Couchbase • Technical Evangelist • eXo • CTO • Oracle • Developer/Product Manager • Mainly Java/SOA • Developer in consulting firms • Web • @tgrall • http://tgrall.github.io • tgrall
 • NantesJUG co-founder
 • Pet Project : • http://www.resultri.com • [email protected][email protected]
  3. © 2016 MapR Technologies 5 Data Hub Choose the best

    “connector”: • File • Sqoop • ETL • … Use the aggregated data • In your applications • To update other systems • as an Open Data API • … Customer DB Customer DB Logs … Hadoop NoSQL
  4. © 2016 MapR Technologies 6 Financial Services Fraud detection Personalized

    offers Fraud investigation tool Fraud investigator Fraud model Recommendations table Clickstream analysis Online transactions MapR Distribution for Hadoop Analytics Real-time Operational Applications Interactive marketer
  5. © 2016 MapR Technologies 14 A little bit of history….

    • Defined by Nathan Marz • ex BackType, Twitter • in a new Startup • Creator of … – Storm – Cascalog – ElephantDB
  6. © 2016 MapR Technologies 15 Lambda Architecture Requirements • Fault-tolerant

    against both hardware failures & human errors • Support variety of use cases that include low latency querying as well as updates • Linear scale-out capabilities • Extensible, so that the system is manageable and can accommodate newer features easily
  7. © 2016 MapR Technologies 17 Lambda Architecture NEW DATA 


    STREAM QUERY BATCH VIEWS √ View 1 View 2 View N REAL-TIME VIEWS BATCH LAYER SERVINGLAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWS BATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWS View 1 View 2 View N
  8. © 2016 MapR Technologies 18 Data Ingestion All data entering

    the system are dispatched to both • the batch layer • the speed layer NEW DATA 
 STREAM BATCH LAYER SPEED LAYER
  9. © 2016 MapR Technologies Batch Layer • managing the master

    dataset, an immutable, append-only set of raw data • pre-computing arbitrary query functions, called batch views. BATCH VIEWS BATCH LAYER IMMUTABLE MASTER DATA PRECOMPUTE VIEWS BATCH RECOMPUTE View 1 View 2 View N
  10. © 2016 MapR Technologies 20 Speed Layer √ View 1

    View 2 View N REAL-TIME VIEWS SPEED LAYER PROCESS STREAM INCREMENT VIEWS • Speed layer accommodates low latency requests that are subject to low latency requirements. • Using fast and incremental algorithms, deals with recent data only
  11. © 2016 MapR Technologies 21 Serving Layer QUERY BATCH VIEWS

    √ View 1 View 2 View N REAL-TIME VIEWS SERVINGLAYER MERGE View 1 View 2 View N • Serving layer indexes batch views so that they can be queried in ad hoc with low latency
  12. © 2016 MapR Technologies 24 Lambda Architecture—Immutable Data + Views

    timestamp airport flight action 2016-02-04T10:00:00 MUC EY123 take-off 2016-02-04T10:05:00 BRU SAS45 take-off 2016-02-04T10:07:00 AMS BA99 take-off 2016-02-04T10:09:00 LHR LH17 landing 2016-02-04T10:10:00 CDG AF03 landing 2016-02-04T10:10:00 FCO AZ501 take-off immutable master dataset
  13. © 2016 MapR Technologies 25 Lambda Architecture—Immutable Data + Views

    timestamp airport flight action 2016-02-04T10:00:00 MUC EY123 take-off 2016-02-04T10:05:00 BRU SAS45 take-off 2016-02-04T10:07:00 AMS BA99 take-off 2016-02-04T10:09:00 LHR LH17 landing 2016-02-04T10:10:00 CDG AF03 landing 2016-02-04T10:10:00 FCO AZ501 take-off air-borne: 2307 airline planes AF 59 AZ 23 BA 167 EY 19 LH 201 SAS 28 air-borne per airline: airport planes AMS 69 CDG 44 BRU 31 FCO 10 HEL 17 LHR 101 airport load:
  14. © 2016 MapR Technologies 27 Lambda Architecture NEW DATA 


    STREAM QUERY BATCH VIEWS √ View 1 View 2 View N REAL-TIME VIEWS BATCH LAYER SERVINGLAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWS BATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWS View 1 View 2 View N
  15. © 2016 MapR Technologies 28 Batch Layer: View Generation Master

    Data View 1 View 2 Master Data Master Data Master Data Events “Raw” Storage Processing Aggregated Data
  16. © 2016 MapR Technologies 30 • Cluster Computing Platform •

    Extends “MapReduce” with extensions – Streaming – Interactive Analytics • Run in Memory
  17. © 2015 MapR Technologies ‹#› @tgrall Spark components Spark SQL

    Spark Streaming (Streaming) MLlib (Machine Learning) Spark Core (General execution engine) GraphX (Graph Computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN
  18. © 2016 MapR Technologies 32 Spark Jobs Driver Program (application)

    sc=new SparkContext rDD=sc.textfile(“hdfs:// …”) rDD.map Cluster Manager Worker Executor Task Task Worker Executor Task Task
  19. © 2016 MapR Technologies 33 Spark Resilient Distributed Datasets “RDD”

    Sensor RDD W Executor P4 W Executor P1 P3 W Executor P2 sc.textFile P1 8213034705, 95, 2.927373, jake7870, 0…… P2 8213034705, 115, 2.943484, Davidbresler2, 1…. P3 8213034705, 100, 2.951285, gladimacowgirl, 58… P4 8213034705, 117, 2.998947, daysrus, 95….
  20. © 2015 MapR Technologies @tgrall Transformations • Process an RDD,

    returns an RDD • Examples : • map() : one value => another value • mapToPair() : one value => a tuple • filter() : filters values/tuples on a given condition • groupByKey() : groups values by key • reduceByKey() : aggregates values by key • join(), cogroup(), … : joins RDDs
  21. © 2015 MapR Technologies @tgrall Actions • Process an RDD,

    returns a value • Examples : • count() : counts number of items in dataset • first() : returns first entry • take(n) : returns array of the n first elements • foreach() : applies a function on each element • collect() : returns all elements • saveAsTextFile() : saves in files each element
  22. © 2016 MapR Technologies 37 Speed Layer Real Time View1

    Real Time View 2 Events Processing NoSQL
  23. © 2016 MapR Technologies 38 Serving Layer: Aggregated Data •

    Views are stored in a Read/Write database • Apache HBase • MapR DB Binary & JSON • Cassandra • MongoDB • Elasticsearch • …
  24. © 2016 MapR Technologies 39 Serving Layer Real Time View

    Events Processing Aggregated Batch View Query - SQL Dataviz Query/Visualisation SQL
  25. © 2016 MapR Technologies // Join MapR-DB Table, Parquet and

    MongoDB collection > SELECT u.name, b.category, count(1) nb_review FROM mongo.yelp.`user` u , dfs.yelp.`review.parquet` r, (select business_id, flatten(categories) category from maprdb.`business` ) b WHERE u.user_id = r.user_id AND b.business_id = r.business_id GROUP BY u.user_id, u.name, b.category ORDER BY nb_review DESC LIMIT 10; +-----------+--------------+------------+ | name | category | nb_review | +-----------+--------------+------------+ | Rand | Restaurants | 1086 | | J | Restaurants | 661 | | Aileen | Restaurants | 499 | | Michael | Restaurants | 496 | +-----------+--------------+------------+ 40
  26. © 2016 MapR Technologies 43 What is Spark Streaming? •

    Enables scalable, high-throughput, fault-tolerant stream processing of live data • Extension of the core Spark Data Sources Data Sinks
  27. © 2016 MapR Technologies 44 Spark Streaming Architecture • Divide

    data stream into batches of X seconds (micro batching) • Called DStream = sequence of RDDs Spark Streaming input data stream DStream RDD batches Batch interval data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3 RDD @ time 1
  28. © 2016 MapR Technologies 45 What are Apache Kafka &

    MapR Streams? • Publish Subscribe Messaging • Fast • Scalable • Durable • Distributed
  29. © 2016 MapR Technologies 47 Lambda Architecture NEW DATA 


    STREAM QUERY BATCH VIEWS √ View 1 View 2 View N REAL-TIME VIEWS BATCH LAYER SERVINGLAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWS BATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWS View 1 View 2 View N NoSQL Distributed File System NoSQL Streams
  30. © 2016 MapR Technologies 48 Lambda Architecture in Action Batch

    processing
 (MapReduce) Tax reduction reporting Shortest path graph algorithm
 (Titan on MapR-DB) Route optimization . . . Geolocation Geolocation Geolocation Geolocation Online alerts Real-time stream
  31. © 2016 MapR Technologies 49 Lambda Architecture • Fault-tolerant •

    Use batch layer to pre compute complex/large data set queries • Use speed layer to deal with “near real time” use cases • Linear scale-out capabilities • Error Prone: • Recompute data from master data set when needed
  32. © 2016 MapR Technologies 51 Q & A @tgrall maprtech

    [email protected] Engage with us! MapR maprtech mapr-technologies