Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!

Tugdual Grall
February 09, 2016

Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!

Lambda Architecture is a useful framework to think about designing big data applications. This framework has been built initially at Twitter. In this presentation you will learn, based on concrete examples how to build deploy scalable and fault tolerant applications, with a focus on Big Data and Hadoop.

This presentation was delivered at the OOP conference, Munich, Feb 2016

Tugdual Grall

February 09, 2016
Tweet

More Decks by Tugdual Grall

Other Decks in Technology

Transcript

  1. © 2015 MapR Technologies ‹#›
    © 2016 MapR Technologies
    Tugdual Grall
    Technical Evangelist
    @tgrall
    Lambda Architecture: The Best Way to Build
    Scalable and Reliable Applications!
    OOP-2016
    Feb, 04, 2016

    View Slide

  2. © 2016 MapR Technologies ‹#›
    @tgrall
    {“about” : “me”}
    Tugdual “Tug” Grall
    • MapR
    • Technical Evangelist
    • MongoDB
    • Technical Evangelist
    • Couchbase
    • Technical Evangelist
    • eXo
    • CTO
    • Oracle
    • Developer/Product Manager
    • Mainly Java/SOA
    • Developer in consulting firms
    • Web
    • @tgrall
    • http://tgrall.github.io
    • tgrall

    • NantesJUG co-founder

    • Pet Project :
    • http://www.resultri.com
    [email protected]
    [email protected]

    View Slide

  3. © 2016 MapR Technologies
    @tgrall 3
    Big Data & Hadoop
    In Production

    View Slide

  4. © 2016 MapR Technologies 4
    Data Warehouse Optimization

    View Slide

  5. © 2016 MapR Technologies 5
    Data Hub
    Choose the best “connector”:
    • File
    • Sqoop
    • ETL
    • …
    Use the aggregated data
    • In your applications
    • To update other systems
    • as an Open Data API
    • …
    Customer DB
    Customer DB
    Logs

    Hadoop
    NoSQL

    View Slide

  6. © 2016 MapR Technologies 6
    Financial Services
    Fraud detection Personalized
    offers
    Fraud
    investigation tool
    Fraud investigator
    Fraud model
    Recommendations
    table
    Clickstream
    analysis
    Online
    transactions
    MapR Distribution for Hadoop
    Analytics
    Real-time Operational Applications
    Interactive marketer

    View Slide

  7. © 2016 MapR Technologies
    @tgrall 7
    Fault Tolerance

    View Slide

  8. © 2016 MapR Technologies 8
    Fault Tolerance
    hardware
    software
    developer
    ?

    View Slide

  9. © 2016 MapR Technologies 9
    Human fault tolerance

    View Slide

  10. © 2014 MapR Technologies 10

    View Slide

  11. © 2014 MapR Technologies 11

    View Slide

  12. © 2014 MapR Technologies 12

    View Slide

  13. © 2016 MapR Technologies
    @tgrall 13
    Lambda Architecture
    To the rescue
    λ

    View Slide

  14. © 2016 MapR Technologies 14
    A little bit of history….
    • Defined by Nathan Marz
    • ex BackType, Twitter
    • in a new Startup
    • Creator of …
    – Storm
    – Cascalog
    – ElephantDB

    View Slide

  15. © 2016 MapR Technologies 15
    Lambda Architecture Requirements
    • Fault-tolerant against both hardware failures & human errors
    • Support variety of use cases that include low latency querying
    as well as updates
    • Linear scale-out capabilities
    • Extensible, so that the system is manageable and can
    accommodate newer features easily

    View Slide

  16. © 2016 MapR Technologies 16

    View Slide

  17. © 2016 MapR Technologies 17
    Lambda Architecture
    NEW DATA 

    STREAM QUERY
    BATCH VIEWS

    View 1 View 2 View N
    REAL-TIME VIEWS
    BATCH LAYER
    SERVINGLAYER
    SPEED LAYER
    MERGE
    IMMUTABLE
    MASTER DATA
    PRECOMPUTE
    VIEWS
    BATCH
    RECOMPUTE
    PROCESS
    STREAM
    INCREMENT
    VIEWS
    View 1 View 2 View N

    View Slide

  18. © 2016 MapR Technologies 18
    Data Ingestion
    All data entering the system are dispatched to both
    • the batch layer
    • the speed layer
    NEW DATA 

    STREAM
    BATCH LAYER
    SPEED LAYER

    View Slide

  19. © 2016 MapR Technologies
    Batch Layer
    • managing the master dataset, an immutable, append-only set of raw data
    • pre-computing arbitrary query functions, called batch views.
    BATCH VIEWS
    BATCH LAYER
    IMMUTABLE
    MASTER DATA
    PRECOMPUTE
    VIEWS
    BATCH
    RECOMPUTE
    View 1 View 2 View N

    View Slide

  20. © 2016 MapR Technologies 20
    Speed Layer

    View 1 View 2 View N
    REAL-TIME VIEWS
    SPEED LAYER
    PROCESS
    STREAM
    INCREMENT
    VIEWS
    • Speed layer accommodates low latency requests that are subject to
    low latency requirements.
    • Using fast and incremental algorithms, deals with recent data
    only

    View Slide

  21. © 2016 MapR Technologies 21
    Serving Layer
    QUERY
    BATCH VIEWS

    View 1 View 2 View N
    REAL-TIME VIEWS
    SERVINGLAYER
    MERGE
    View 1 View 2 View N
    • Serving layer indexes batch views so that they can be
    queried in ad hoc with low latency

    View Slide

  22. © 2014 MapR Technologies 22
    Lambda Architecture—Compensate Batch
    time
    not absorbed
    now

    View Slide

  23. © 2016 MapR Technologies 23
    Lambda Architecture—Immutable Data + Views
    http://openflights.org

    View Slide

  24. © 2016 MapR Technologies 24
    Lambda Architecture—Immutable Data + Views
    timestamp airport flight action
    2016-02-04T10:00:00 MUC EY123 take-off
    2016-02-04T10:05:00 BRU SAS45 take-off
    2016-02-04T10:07:00 AMS BA99 take-off
    2016-02-04T10:09:00 LHR LH17 landing
    2016-02-04T10:10:00 CDG AF03 landing
    2016-02-04T10:10:00 FCO AZ501 take-off
    immutable master dataset

    View Slide

  25. © 2016 MapR Technologies 25
    Lambda Architecture—Immutable Data + Views
    timestamp airport flight action
    2016-02-04T10:00:00 MUC EY123 take-off
    2016-02-04T10:05:00 BRU SAS45 take-off
    2016-02-04T10:07:00 AMS BA99 take-off
    2016-02-04T10:09:00 LHR LH17 landing
    2016-02-04T10:10:00 CDG AF03 landing
    2016-02-04T10:10:00 FCO AZ501 take-off
    air-borne: 2307
    airline planes
    AF 59
    AZ 23
    BA 167
    EY 19
    LH 201
    SAS 28
    air-borne per airline:
    airport planes
    AMS 69
    CDG 44
    BRU 31
    FCO 10
    HEL 17
    LHR 101
    airport load:

    View Slide

  26. © 2016 MapR Technologies
    @tgrall 26
    Implementation

    View Slide

  27. © 2016 MapR Technologies 27
    Lambda Architecture
    NEW DATA 

    STREAM QUERY
    BATCH VIEWS

    View 1 View 2 View N
    REAL-TIME VIEWS
    BATCH LAYER
    SERVINGLAYER
    SPEED LAYER
    MERGE
    IMMUTABLE
    MASTER DATA
    PRECOMPUTE
    VIEWS
    BATCH
    RECOMPUTE
    PROCESS
    STREAM
    INCREMENT
    VIEWS
    View 1 View 2 View N

    View Slide

  28. © 2016 MapR Technologies 28
    Batch Layer: View Generation
    Master Data
    View 1
    View 2
    Master Data
    Master Data
    Master Data
    Events “Raw” Storage Processing Aggregated Data

    View Slide

  29. © 2016 MapR Technologies 29

    View Slide

  30. © 2016 MapR Technologies 30
    • Cluster Computing Platform
    • Extends “MapReduce” with
    extensions
    – Streaming
    – Interactive Analytics
    • Run in Memory

    View Slide

  31. © 2015 MapR Technologies ‹#›
    @tgrall
    Spark components
    Spark SQL
    Spark Streaming
    (Streaming)
    MLlib
    (Machine Learning)
    Spark Core (General execution engine)
    GraphX
    (Graph Computation)
    Mesos
    Distributed File System (HDFS, MapR-FS, S3, …)
    Hadoop YARN

    View Slide

  32. © 2016 MapR Technologies 32
    Spark Jobs
    Driver Program
    (application)
    sc=new SparkContext
    rDD=sc.textfile(“hdfs://
    …”)
    rDD.map
    Cluster Manager
    Worker
    Executor
    Task Task
    Worker
    Executor
    Task Task

    View Slide

  33. © 2016 MapR Technologies 33
    Spark Resilient Distributed Datasets “RDD”
    Sensor RDD
    W
    Executor
    P4
    W
    Executor
    P1 P3
    W
    Executor
    P2
    sc.textFile P1
    8213034705,
    95, 2.927373,
    jake7870, 0……
    P2
    8213034705,
    115, 2.943484,
    Davidbresler2,
    1….
    P3
    8213034705,
    100, 2.951285,
    gladimacowgirl,
    58…
    P4
    8213034705,
    117, 2.998947,
    daysrus, 95….

    View Slide

  34. © 2016 MapR Technologies 34
    Spark Resilient Distributed Datasets
    Transformation
    Filter()
    Action
    Count()
    RDD
    newRDD
    Value

    View Slide

  35. © 2015 MapR Technologies
    @tgrall
    Transformations
    • Process an RDD, returns an RDD
    • Examples :
    • map() : one value => another value
    • mapToPair() : one value => a tuple
    • filter() : filters values/tuples on a given condition
    • groupByKey() : groups values by key
    • reduceByKey() : aggregates values by key
    • join(), cogroup(), … : joins RDDs

    View Slide

  36. © 2015 MapR Technologies
    @tgrall
    Actions
    • Process an RDD, returns a value
    • Examples :
    • count() : counts number of items in dataset
    • first() : returns first entry
    • take(n) : returns array of the n first elements
    • foreach() : applies a function on each element
    • collect() : returns all elements
    • saveAsTextFile() : saves in files each element

    View Slide

  37. © 2016 MapR Technologies 37
    Speed Layer
    Real Time View1
    Real Time View 2
    Events Processing
    NoSQL

    View Slide

  38. © 2016 MapR Technologies 38
    Serving Layer: Aggregated Data
    • Views are stored in a Read/Write database
    • Apache HBase
    • MapR DB Binary & JSON
    • Cassandra
    • MongoDB
    • Elasticsearch
    • …

    View Slide

  39. © 2016 MapR Technologies 39
    Serving Layer
    Real Time View
    Events Processing Aggregated
    Batch View
    Query - SQL
    Dataviz
    Query/Visualisation
    SQL

    View Slide

  40. © 2016 MapR Technologies
    // Join MapR-DB Table, Parquet and MongoDB collection
    > SELECT u.name, b.category, count(1) nb_review
    FROM mongo.yelp.`user` u , dfs.yelp.`review.parquet` r, (select
    business_id, flatten(categories) category from maprdb.`business` ) b
    WHERE u.user_id = r.user_id
    AND b.business_id = r.business_id
    GROUP BY u.user_id, u.name, b.category
    ORDER BY nb_review DESC
    LIMIT 10;
    +-----------+--------------+------------+
    | name | category | nb_review |
    +-----------+--------------+------------+
    | Rand | Restaurants | 1086 |
    | J | Restaurants | 661 |
    | Aileen | Restaurants | 499 |
    | Michael | Restaurants | 496 |
    +-----------+--------------+------------+
    40

    View Slide

  41. © 2016 MapR Technologies
    @tgrall 41
    Events Capture?

    View Slide

  42. © 2016 MapR Technologies 42
    Events Capture
    Customer DB
    API
    Logs

    Streaming
    Streams
    Files

    View Slide

  43. © 2016 MapR Technologies 43
    What is Spark Streaming?
    • Enables scalable, high-throughput, fault-tolerant stream
    processing of live data
    • Extension of the core Spark
    Data Sources Data Sinks

    View Slide

  44. © 2016 MapR Technologies 44
    Spark Streaming Architecture
    • Divide data stream into batches of X seconds (micro batching)
    • Called DStream = sequence of RDDs
    Spark
    Streaming
    input data
    stream
    DStream RDD batches
    Batch
    interval
    data from
    time 0 to 1
    data from
    time 1 to 2
    RDD @ time 2
    data from
    time 2 to 3
    RDD @ time 3
    RDD @ time 1

    View Slide

  45. © 2016 MapR Technologies 45
    What are Apache Kafka & MapR Streams?
    • Publish Subscribe Messaging
    • Fast
    • Scalable
    • Durable
    • Distributed

    View Slide

  46. © 2016 MapR Technologies
    @tgrall 46
    Summary

    View Slide

  47. © 2016 MapR Technologies 47
    Lambda Architecture
    NEW DATA 

    STREAM QUERY
    BATCH VIEWS

    View 1 View 2 View N
    REAL-TIME VIEWS
    BATCH LAYER
    SERVINGLAYER
    SPEED LAYER
    MERGE
    IMMUTABLE
    MASTER DATA
    PRECOMPUTE
    VIEWS
    BATCH
    RECOMPUTE
    PROCESS
    STREAM
    INCREMENT
    VIEWS
    View 1 View 2 View N
    NoSQL
    Distributed
    File System
    NoSQL
    Streams

    View Slide

  48. © 2016 MapR Technologies 48
    Lambda Architecture in Action
    Batch processing

    (MapReduce)
    Tax reduction
    reporting
    Shortest path graph
    algorithm

    (Titan on MapR-DB)
    Route
    optimization
    .
    .
    .
    Geolocation
    Geolocation
    Geolocation
    Geolocation
    Online alerts
    Real-time stream

    View Slide

  49. © 2016 MapR Technologies 49
    Lambda Architecture
    • Fault-tolerant
    • Use batch layer to pre compute complex/large data set queries
    • Use speed layer to deal with “near real time” use cases
    • Linear scale-out capabilities
    • Error Prone:
    • Recompute data from master data set when needed

    View Slide

  50. © 2016 MapR Technologies 50

    View Slide

  51. © 2016 MapR Technologies 51
    Q & A
    @tgrall maprtech
    [email protected]
    Engage with us!
    MapR
    maprtech
    mapr-technologies

    View Slide