Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lambda Architecture with Apache Spark

Lambda Architecture with Apache Spark

Talk at Big Data Beers in Berlin, 2014-07-24, see also: http://www.meetup.com/Big-Data-Beers/events/189314292/

Michael Hausenblas

July 24, 2014
Tweet

More Decks by Michael Hausenblas

Other Decks in Technology

Transcript

  1. ®
    © 2014 MapR Technologies 1
    ®
    © 2014 MapR Technologies
    Lambda Architecture with Apache Spark
    Michael Hausenblas, Chief Data Engineer MapR
    Big Data Beers, Berlin, 2014-07-24

    View Slide

  2. ®
    Fault tolerance
    hardware
    software
    developer
    ?

    View Slide

  3. ®
    © 2014 MapR Technologies 3
    © 2014 MapR Technologies
    ®
    Let’s talk about developers…

    View Slide

  4. ®
    http://xkcd.com/327/

    View Slide

  5. View Slide

  6. View Slide

  7. ®
    © 2014 MapR Technologies 7
    © 2014 MapR Technologies
    ®
    human fault tolerance
    Let’s talk about developers…

    View Slide

  8. ®
    Human fault tolerance

    View Slide

  9. ®
    When things go wrong …
    http://allfacebook.com/the-real-reason-facebook-went-down-yesterday-its-complicated_b19366
    2010
    unfortunate
    handling of
    error condition

    View Slide

  10. ®
    When things go wrong …
    2012
    cascaded bug
    http://money.cnn.com/2012/06/21/technology/twitter-down/index.htm

    View Slide

  11. ®
    When things go wrong …
    http://www.v3.co.uk/v3-uk/news/2196577/rbs-takes-gbp125m-hit-over-it-outage
    2012
    upgrade of batch
    processing

    View Slide

  12. ®
    When things go wrong …
    http://www.androidcentral.com/google-explains-reasons-behind-today-s-30-minute-service-outage
    2014
    bug/bad config

    View Slide

  13. ®
    © 2014 MapR Technologies 13
    © 2014 MapR Technologies
    ®
    Lambda Architecture to the rescue!

    View Slide

  14. ®
    Let’s step back a bit …
    •  Nathan Marz (Backtype, Twitter, stealth startup)
    •  Creator of …
    –  Storm
    –  Cascalog
    –  ElephantDB
    http://manning.com/marz/

    View Slide

  15. ®
    Lambda Architecture—Requirements
    •  Fault-tolerant against both hardware failures and human errors
    •  Support variety of use cases that include low latency querying as
    well as updates
    •  Linear scale-out capabilities
    •  Extensible, so that the system is manageable and can
    accommodate newer features easily

    View Slide

  16. ®
    Lambda Architecture—Concept
    •  Latency—the time it takes to run a query
    •  Timeliness—how up to date the query results are
    (à consistency)
    •  Accuracy—tradeoff between performance and scalability
    (à approximations)
    query = function(all data)

    View Slide

  17. ®
    Lambda Architecture
    NEW DATA
    STREAM QUERY
    BATCH VIEWS

    View 1 View 2 View N
    REAL-TIME VIEWS
    BATCH LAYER
    SERVINGLAYER
    SPEED LAYER
    MERGE
    IMMUTABLE
    MASTER DATA
    PRECOMPUTE
    VIEWS
    BATCH
    RECOMPUTE
    PROCESS
    STREAM
    INCREMENT
    VIEWS
    REAL-TIME
    INCREMENT
    View 1 View 2 View N

    View Slide

  18. ®
    Lambda Architecture—Layers
    •  Batch layer
    –  managing the master dataset, an immutable, append-only set of raw data
    –  pre-computing arbitrary query functions, called batch views
    •  Serving layer indexes batch views so that they can be queried in
    ad hoc with low latency
    •  Speed layer accommodates all requests that are subject to low
    latency requirements. Using fast and incremental algorithms,
    deals with recent data only

    View Slide

  19. ®
    Lambda Architecture—Compensate Batch
    time
    not absorbed
    now

    View Slide

  20. ®
    Lambda Architecture—Immutable Data + Views
    http://openflights.org

    View Slide

  21. ®
    Lambda Architecture—Immutable Data + Views
    timestamp airport flight action
    timestamp airport flight action
    2014-01-01T10:00:00 DUB EI123 take-off
    timestamp airport flight action
    2014-01-01T10:00:00 DUB EI123 take-off
    2014-01-01T10:05:00 HEL SAS45 take-off
    timestamp airport flight action
    2014-01-01T10:00:00 DUB EI123 take-off
    2014-01-01T10:05:00 HEL SAS45 take-off
    2014-01-01T10:07:00 AMS BA99 take-off
    timestamp airport flight action
    2014-01-01T10:00:00 DUB EI123 take-off
    2014-01-01T10:05:00 HEL SAS45 take-off
    2014-01-01T10:07:00 AMS BA99 take-off
    2014-01-01T10:09:00 LHR LH17 landing
    timestamp airport flight action
    2014-01-01T10:00:00 DUB EI123 take-off
    2014-01-01T10:05:00 HEL SAS45 take-off
    2014-01-01T10:07:00 AMS BA99 take-off
    2014-01-01T10:09:00 LHR LH17 landing
    2014-01-01T10:10:00 CDG AF03 landing
    timestamp airport flight action
    2014-01-01T10:00:00 DUB EI123 take-off
    2014-01-01T10:05:00 HEL SAS45 take-off
    2014-01-01T10:07:00 AMS BA99 take-off
    2014-01-01T10:09:00 LHR LH17 landing
    2014-01-01T10:10:00 CDG AF03 landing
    2014-01-01T10:10:00 FCO AZ501 take-off
    immutable master dataset

    View Slide

  22. ®
    Lambda Architecture—Immutable Data + Views
    timestamp airport flight action
    2014-01-01T10:00:00 DUB EI123 take-off
    2014-01-01T10:05:00 HEL SAS45 take-off
    2014-01-01T10:07:00 AMS BA99 take-off
    2014-01-01T10:09:00 LHR LH17 landing
    2014-01-01T10:10:00 CDG AF03 landing
    2014-01-01T10:10:00 FCO AZ501 take-off
    immutable master dataset
    views
    airport planes
    AMS 69
    CDG 44
    DUB 31
    FCO 10
    HEL 17
    LHR 101
    airport load: airline planes
    AF 59
    AZ 23
    BA 167
    EI 19
    LH 201
    SAS 28
    air-borne
    per airline:
    air-borne: 2307

    View Slide

  23. ®
    © 2014 MapR Technologies 23
    © 2014 MapR Technologies
    ®
    Implementing the Lambda Architecture

    View Slide

  24. ®
    © 2014 MapR Technologies 24

    View Slide

  25. ®

    View Slide

  26. ®
    How about an integrated approach?
    •  Twitter Summingbird
    •  Lambdoop
    •  Apache Spark

    View Slide

  27. ®
    Apache Spark

    View Slide

  28. ®
    Apache Spark
    •  Originally developed in 2009 in UC Berkeley’s AMP Lab
    •  As part of BDAS stack open sourced in 2010
    •  Top-level Apache Project as of 2014
    http://spark.apache.org/

    View Slide

  29. ®
    The Spark community

    View Slide

  30. ®
    Spark—a unified platform …
    Continued innovation bringing new functionality, such as:
    •  Tachyon (Shared RDDs, off-heap solution)
    •  BlinkDB (approximate queries)
    •  SparkR (R wrapper for Spark)
    Spark SQL
    (SQL/HQL)
    Spark Streaming
    (stream processing)
    MLlib
    (machine learning)
    Spark (core execution engine)
    GraphX
    (graph processing)
    Mesos
    Distributed File System (local FS, HDFS, S3, …)
    YARN

    View Slide

  31. ®
    Easy and fast Big Data
    •  Easy to Develop
    –  Rich APIs available through
    Java, Scala, Python
    –  Interactive shell
    •  Fast to Run
    –  Advanced data storage model
    (automated optimization
    between memory and disk)
    –  General execution graphs
    2-5× less code up to 10× faster on disk,
    100× in memory
    https://amplab.cs.berkeley.edu/benchmark/

    View Slide

  32. ®
    … for complex workloads …
    •  Iterative Algorithms
    –  machine learning
    –  graph processing beyond DAG
    •  Interactive Data Mining
    •  Streaming Applications

    View Slide

  33. ®
    … across multiple datasources
    •  Local Files
    –  file:///opt/httpd/logs/access_log
    •  Object Stores (e.g. Amazon S3)
    •  HDFS
    –  text files, sequence files, any other Hadoop InputFormat
    •  Key-Value datastores (e.g. Apache HBase)

    View Slide

  34. ®
    Easy: expressive API
    map reduce

    View Slide

  35. ®
    Easy: expressive API
    map
    filter
    groupBy
    sort
    union
    join
    leftOuterJoin
    rightOuterJoin
    reduce
    count
    fold
    reduceByKey
    groupByKey
    cogroup
    cross
    zip
    sample
    take
    first
    partitionBy
    mapWith
    pipe
    save ...

    View Slide

  36. ®
    © 2014 MapR Technologies
    Easy: get started immediately
    Python
    lines = sc.textFile(...)
    lines.filter(lambda s: “ERROR” in s).count()
    Scala
    val lines = sc.textFile(...)
    lines.filter(x => x.contains(“ERROR”)).count()
    Java
    JavaRDD lines = sc.textFile(...);
    lines.filter(new Function() {
    Boolean call(String s) {
    return s.contains(“error”);
    }
    }).count();

    View Slide

  37. ®
    © 2014 MapR Technologies
    … and scale as you go (mentally and physically)
    YARN
    Standalone

    View Slide

  38. ®
    Resilient Distributed Datasets (RDD)
    •  RDDs are the core of the Spark execution engine
    •  Collections of elements that can be operated on in parallel
    •  Persistent in memory between operations
    http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

    View Slide

  39. ®
    RDD Operations
    •  Lazy evaluation is key to Spark
    •  Transformations
    –  Creation of a new dataset from an existing:
    map, filter, distinct, union, sample, groupByKey, join, etc.
    •  Actions
    –  Return a value after running a computation:
    collect, count, first, takeSample, foreach, etc.

    View Slide

  40. ®
    RDD persistence
    http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence

    View Slide

  41. ®
    And in the real world?

    View Slide

  42. ®
    Industry Leading Ad-Targeting Platform:
    Real-time Decisions
    High performance analytics
    over MapR-DB
    Load from MapR-DB table
    into RDD to augment scoring
    Results stored back in
    MapR-DB for other
    applications

    View Slide

  43. ®
    Cisco: Security Intelligence Operations
    Sensor data lands in MapR
    Spark Streaming on MapR
    for first check on known
    threats
    Data next processed on
    GraphX and Mahout
    Additional SQL querying
    done via Shark and Impala

    View Slide

  44. ®
    Leading Pharma Company:
    NextGen Genomics
    Existing process takes several weeks to
    align chemical compounds with genes
    ADAM on Spark allows
    realignment in a few hours
    Geneticists can minimize
    engineering dependency

    View Slide

  45. ®
    Further resources …

    View Slide

  46. ®
    The book: Learning Spark
    http://shop.oreilly.com/product/0636920028512.do

    View Slide

  47. http://lambda-architecture.net

    View Slide

  48. http://spark-stack.org

    View Slide

  49. ®
    Conclusion
    •  Let’s scale systems and humans
    •  How? Lambda Architecture!
    •  Apache Spark is an efficient way to implement Lambda Architecture

    View Slide

  50. ®
    Q & A
    @mhausenblas maprtech
    [email protected]
    Engage with us!
    MapR
    maprtech
    mapr-technologies

    View Slide