$30 off During Our Annual Pro Sale. View Details »

[OracleCode SF] In-memory Analytics with Spark and Hazelcast

[OracleCode SF] In-memory Analytics with Spark and Hazelcast

Apache Spark is a distributed computation framework optimized to work in-memory, and heavily influenced by concepts from functional programming languages.

Hazelcast - open source in-memory data grid capable of amazing feats of scale - provides wide range of distributed computing primitives computation, including ExecutorService, M/R and Aggregations frameworks.

The nature of data exploration and analysis requires data scientists be able to ask questions that weren't planned to be asked—and get an answer fast!

In this talk, Viktor will explore Spark and see how it works together with Hazelcast to provide a robust in-memory open-source big data analytics solution!

Viktor Gamov

March 01, 2017
Tweet

More Decks by Viktor Gamov

Other Decks in Programming

Transcript

  1. @gamussa @hazelcast #oraclecode
    IN-MEMORY ANALYTICS
    with APACHE SPARK and
    HAZELCAST

    View Slide

  2. @gamussa @hazelcast #oraclecode
    Solutions Architect
    Developer Advocate
    @gamussa in internetz
    Please, follow me on Twitter
    I’m very interesting ©
    Who am I?

    View Slide

  3. @gamussa @hazelcast #oraclecode
    What’s Apache Spark?
    Lightning-Fast Cluster Computing

    View Slide

  4. @gamussa @hazelcast #oraclecode
    Run programs up to 100x
    faster than Hadoop
    MapReduce in memory,
    or 10x faster on disk.

    View Slide

  5. @gamussa @hazelcast #oraclecode
    When to use Spark?
    Data Science Tasks
    when questions are unknown
    Data Processing Tasks
    when you have to much data
    You’re tired of Hadoop

    View Slide

  6. @gamussa @hazelcast #oraclecode
    Spark Architecture

    View Slide

  7. @gamussa @hazelcast #oraclecode

    View Slide

  8. @gamussa @hazelcast #oraclecode
    RDD

    View Slide

  9. @gamussa @hazelcast #oraclecode
    Resilient Distributed Datasets (RDD)
    are the primary abstraction in Spark –
    a fault-tolerant collection of elements that can be
    operated on in parallel

    View Slide

  10. @gamussa @hazelcast #oraclecode

    View Slide

  11. @gamussa @hazelcast #oraclecode
    RDD Operations

    View Slide

  12. @gamussa @hazelcast #oraclecode
    operations on RDDs:
    transformations and actions

    View Slide

  13. @gamussa @hazelcast #oraclecode
    transformations are lazy
    (not computed immediately)
    the transformed RDD gets recomputed
    when an action is run on it (default)

    View Slide

  14. @gamussa @hazelcast #oraclecode
    RDD
    Transformations

    View Slide

  15. @gamussa @hazelcast #oraclecode

    View Slide

  16. @gamussa @hazelcast #oraclecode

    View Slide

  17. @gamussa @hazelcast #oraclecode
    RDD
    Actions

    View Slide

  18. @gamussa @hazelcast #oraclecode

    View Slide

  19. @gamussa @hazelcast #oraclecode

    View Slide

  20. @gamussa @hazelcast #oraclecode
    RDD
    Fault Tolerance

    View Slide

  21. @gamussa @hazelcast #oraclecode

    View Slide

  22. @gamussa @hazelcast #oraclecode
    RDD
    Construction

    View Slide

  23. @gamussa @hazelcast #oraclecode
    parallelized collections
    take an existing Scala collection
    and run functions on it in parallel

    View Slide

  24. @gamussa @hazelcast #oraclecode
    Hadoop datasets
    run functions on each record of a file in Hadoop distributed
    file system or any other storage system supported by
    Hadoop

    View Slide

  25. @gamussa @hazelcast #oraclecode
    What’s Hazelcast IMDG?
    The Fastest In-memory Data Grid

    View Slide

  26. @gamussa @hazelcast #oraclecode
    Hazelcast IMDG
    is an operational,
    in-memory,
    distributed computing platform
    that manages data using
    in-memory storage, and
    performs parallel execution for
    breakthrough application speed
    and scale

    View Slide

  27. @gamussa @hazelcast #oraclecode
    High-Density
    Caching
    In-Memory
    Data Grid
    Web Session
    Clustering
    Microservices
    Infrastructure

    View Slide

  28. @gamussa @hazelcast #oraclecode
    What’s Hazelcast IMDG?
    In-memory Data Grid
    Apache v2 Licensed
    Distributed
    Caches (IMap, JCache)
    Java Collections (IList, ISet, IQueue)
    Messaging (Topic, RingBuffer)
    Computation (ExecutorService, M-R)

    View Slide

  29. @gamussa @hazelcast #oraclecode
    Green
    Primary
    Green
    Backup
    Green
    Shard

    View Slide

  30. @gamussa @hazelcast #oraclecode

    View Slide

  31. @gamussa @hazelcast #oraclecode
    final SparkConf sparkConf = new SparkConf()
    .set("hazelcast.server.addresses", "localhost")
    .set("hazelcast.server.groupName", "dev")
    .set("hazelcast.server.groupPass", "dev-pass")
    .set("hazelcast.spark.readBatchSize", "5000")
    .set("hazelcast.spark.writeBatchSize", "5000")
    .set("hazelcast.spark.valueBatchingEnabled", "true");
    final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077",
    "app", sparkConf);
    final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);
    final HazelcastJavaRDD mapRdd = hsc.fromHazelcastMap("movie");
    final HazelcastJavaRDD cacheRdd = hsc.fromHazelcastCache("my-
    cache");

    View Slide

  32. @gamussa @hazelcast #oraclecode
    final SparkConf sparkConf = new SparkConf()
    .set("hazelcast.server.addresses", "localhost")
    .set("hazelcast.server.groupName", "dev")
    .set("hazelcast.server.groupPass", "dev-pass")
    .set("hazelcast.spark.readBatchSize", "5000")
    .set("hazelcast.spark.writeBatchSize", "5000")
    .set("hazelcast.spark.valueBatchingEnabled", "true");
    final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077",
    "app", sparkConf);
    final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);
    final HazelcastJavaRDD mapRdd = hsc.fromHazelcastMap("movie");
    final HazelcastJavaRDD cacheRdd = hsc.fromHazelcastCache("my-
    cache");

    View Slide

  33. @gamussa @hazelcast #oraclecode
    final SparkConf sparkConf = new SparkConf()
    .set("hazelcast.server.addresses", "localhost")
    .set("hazelcast.server.groupName", "dev")
    .set("hazelcast.server.groupPass", "dev-pass")
    .set("hazelcast.spark.readBatchSize", "5000")
    .set("hazelcast.spark.writeBatchSize", "5000")
    .set("hazelcast.spark.valueBatchingEnabled", "true");
    final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077",
    "app", sparkConf);
    final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);
    final HazelcastJavaRDD mapRdd = hsc.fromHazelcastMap("movie");
    final HazelcastJavaRDD cacheRdd = hsc.fromHazelcastCache("my-
    cache");

    View Slide

  34. @gamussa @hazelcast #oraclecode
    final SparkConf sparkConf = new SparkConf()
    .set("hazelcast.server.addresses", "localhost")
    .set("hazelcast.server.groupName", "dev")
    .set("hazelcast.server.groupPass", "dev-pass")
    .set("hazelcast.spark.readBatchSize", "5000")
    .set("hazelcast.spark.writeBatchSize", "5000")
    .set("hazelcast.spark.valueBatchingEnabled", "true");
    final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077",
    "app", sparkConf);
    final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);
    final HazelcastJavaRDD mapRdd = hsc.fromHazelcastMap("movie");
    final HazelcastJavaRDD cacheRdd = hsc.fromHazelcastCache("my-
    cache");

    View Slide

  35. @gamussa @hazelcast #oraclecode
    Demo

    View Slide

  36. @gamussa @hazelcast #oraclecode
    LIMITATIONS

    View Slide

  37. @gamussa @hazelcast #oraclecode
    DATA SHOULD NOT BE
    UPDATED WHILE READING
    FROM SPARK

    View Slide

  38. @gamussa @hazelcast #oraclecode
    WHY ?

    View Slide

  39. @gamussa @hazelcast #oraclecode
    MAP EXPANSION
    SHUFFLES THE DATA
    INSIDE THE BUCKET

    View Slide

  40. @gamussa @hazelcast #oraclecode
    CURSOR DOESN’T POINT TO
    CORRECT ENTRY ANYMORE,
    DUPLICATE OR MISSING
    ENTRIES COULD OCCUR

    View Slide

  41. @gamussa @hazelcast #oraclecode
    github.com/hazelcast/hazelcast-spark

    View Slide

  42. @gamussa @hazelcast #oraclecode
    THANKS!
    Any questions?
    You can find me at
    @gamussa
    [email protected]

    View Slide