Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Chaudhri at Big Data Spain 2017

Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Chaudhri at Big Data Spain 2017

In this presentation, attendees will see how to speed up existing Hadoop and Spark deployments by just making Apache Ignite responsible for RAM utilization. No code modifications, no new architecture from scratch!

https://www.bigdataspain.org/2017/talk/boost-hadoop-and-spark-with-in-memory-technologies

Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Big Data Spain

December 05, 2017
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. View Slide

  2. Akmal Chaudhri, GridGain Systems
    Boost Hadoop and Spark with
    in-memory technologies

    View Slide

  3. Agenda
    • Introduction to Apache Ignite
    • Hadoop Acceleration
    • Spark Acceleration
    • Demos
    • Q&A
    Big Data Spain 2017

    View Slide

  4. Apache Ignite in one slide
    • Memory-centric platform
    – that is strongly consistent
    – and highly-available
    – with powerful SQL
    – key-value and processing
    APIs
    • Designed for
    – Performance
    – Scalability
    Big Data Spain 2017

    View Slide

  5. Apache Ignite
    • Data source agnostic
    • Fully fledged compute engine and durable storage
    • OLAP and OLTP
    • Fully ACID transactions across memory and disk
    • In-memory SQL support
    • Early ML libraries
    • Growing community
    Big Data Spain 2017

    View Slide

  6. Hadoop Acceleration
    • In-memory Hadoop Execution
    • Alternative job tracker
    – Faster MapReduce
    • Built on Ignite File System (IGFS)
    • Secondary File System
    – Read-through and Write-through
    Big Data Spain 2017

    View Slide

  7. Ignite In-Memory File System
    • Distributed in-memory
    file system
    • Implements HDFS
    API
    • Can be transparently
    plugged into Hadoop
    or Spark deployments
    Big Data Spain 2017

    View Slide

  8. MapReduce
    Big Data Spain 2017

    View Slide

  9. MapReduce
    • Parallelize processing of data in HDFS
    • Eliminate Hadoop JobTracker and TaskTracker
    overhead
    • Low-Latency distributed processing
    • Minimal configuration change
    Big Data Spain 2017

    View Slide

  10. Spark Acceleration
    • Long running applications
    – Passing state between jobs
    • Disk File System
    – Convert RDDs to disk files and back
    • Share RDDs in-memory
    – Native Spark API
    – Native Spark transformations
    Big Data Spain 2017

    View Slide

  11. Ignite for Spark
    • Spark RDD abstraction
    • Shared in-memory view
    on data across different
    Spark jobs, workers or
    applications
    • Implemented as a view
    over a distributed Ignite
    cache
    Big Data Spain 2017

    View Slide

  12. IgniteContext
    • Main entry-point to Spark-Ignite integration
    • SparkContext plus either one of
    – IgniteConfiguration()
    – Path to XML configuration file
    • Optional Boolean client argument
    – true => Shared deployment
    – false => Embedded deployment
    Big Data Spain 2017

    View Slide

  13. IgniteContext examples
    Big Data Spain 2017
    val i
    gni
    t
    eCont
    ext = new I
    gni
    t
    eCont
    ext
    (
    sparkCont
    ext
    ,
    (
    ) = > new I
    gni
    t
    eConf
    i
    gurat
    i
    on(
    )
    )
    val i
    gni
    t
    eCont
    ext = new I
    gni
    t
    eCont
    ext(
    sparkCont
    ext
    ,
    "exam pl
    es/
    conf
    i
    g/
    spark/
    exam pl
    e-
    shar
    ed-
    r
    dd.
    xm l
    ")

    View Slide

  14. IgniteRDD
    • Implementation of Spark RDD representing a live
    view of an Ignite cache
    • Mutable (unlike native RDDs)
    – All changes in Ignite cache will be visible to RDD users
    immediately
    • Provides partitioning information to Spark executor
    • Provides affinity information to Spark so that RDD
    computations can use data locality
    Big Data Spain 2017

    View Slide

  15. Write to Ignite
    • Ignite caches operate on key-value pairs
    • Spark tuple RDD for key-value pairs and
    savePairs method
    – RDD partitioning, store values in parallel if possible
    • Value-only RDD and saveValues method
    – IgniteRDD generates a unique affinity-local key for
    each value stored into the cache
    Big Data Spain 2017

    View Slide

  16. Write code example
    Big Data Spain 2017
    val conf = new SparkConf
    (
    )
    .
    set
    AppNam e(
    "SparkI
    gni
    t
    eW ri
    t
    er")
    val sc = new SparkCont
    ext
    (
    conf
    )
    val i
    c = new I
    gni
    t
    eCont
    ext
    (
    sc,
    "exam pl
    es/
    conf
    i
    g/
    spark/
    exam pl
    e-
    shar
    ed-
    r
    dd.
    xm l
    ")
    val shar
    edRD D : I
    gni
    t
    eRD D [
    I
    nt
    , I
    nt
    ] = i
    c.
    f
    r
    om Cache(
    "shar
    edRD D ")
    shar
    edRD D .
    savePai
    rs(
    sc.
    paral
    l
    el
    i
    ze(
    1 t
    o 100000, 10)
    .
    m ap(
    i = > (
    i
    , i
    )
    )
    )

    View Slide

  17. Read from Ignite
    • IgniteRDD is a live view of an Ignite cache
    – No need to explicitly load data to Spark application
    from Ignite
    – All RDD methods are available to use right away after
    an instance of IgniteRDD is created
    Big Data Spain 2017

    View Slide

  18. Read code example
    Big Data Spain 2017
    val conf = new SparkConf
    (
    )
    .
    set
    AppNam e(
    "SparkI
    gni
    t
    eReader")
    val sc = new SparkCont
    ext
    (
    conf
    )
    val i
    c = new I
    gni
    t
    eCont
    ext
    (
    sc,
    "exam pl
    es/
    conf
    i
    g/
    spark/
    exam pl
    e-
    shar
    ed-
    r
    dd.
    xm l
    ")
    val shar
    edRD D : I
    gni
    t
    eRD D [
    I
    nt
    , I
    nt
    ] = i
    c.
    f
    r
    om Cache(
    "shar
    edRD D ")
    val gr
    eat
    erThanFi
    f
    t
    yThousand = shar
    edRD D .
    f
    i
    l
    t
    er(
    _.
    _2 > 50000)
    pri
    nt
    l
    n(
    "The count i
    s "+ gr
    eat
    erThanFi
    f
    t
    yThousand.
    count
    (
    )
    )

    View Slide

  19. Demos
    Big Data Spain 2017

    View Slide

  20. Any Questions?
    Thank you for joining us. Follow the conversation.
    http://ignite.apache.org
    Big Data Spain 2017

    View Slide