[OracleCode SF] In-memory Analytics with Spark and Hazelcast

[OracleCode SF] In-memory Analytics with Spark and Hazelcast

Apache Spark is a distributed computation framework optimized to work in-memory, and heavily influenced by concepts from functional programming languages.

Hazelcast - open source in-memory data grid capable of amazing feats of scale - provides wide range of distributed computing primitives computation, including ExecutorService, M/R and Aggregations frameworks.

The nature of data exploration and analysis requires data scientists be able to ask questions that weren't planned to be asked—and get an answer fast!

In this talk, Viktor will explore Spark and see how it works together with Hazelcast to provide a robust in-memory open-source big data analytics solution!

0680be1c881abcf19219f09f1e8cf140?s=128

Viktor Gamov

March 01, 2017
Tweet

Transcript

  1. @gamussa @hazelcast #oraclecode IN-MEMORY ANALYTICS with APACHE SPARK and HAZELCAST

  2. @gamussa @hazelcast #oraclecode Solutions Architect Developer Advocate @gamussa in internetz

    Please, follow me on Twitter I’m very interesting © Who am I?
  3. @gamussa @hazelcast #oraclecode What’s Apache Spark? Lightning-Fast Cluster Computing

  4. @gamussa @hazelcast #oraclecode Run programs up to 100x faster than

    Hadoop MapReduce in memory, or 10x faster on disk.
  5. @gamussa @hazelcast #oraclecode When to use Spark? Data Science Tasks

    when questions are unknown Data Processing Tasks when you have to much data You’re tired of Hadoop
  6. @gamussa @hazelcast #oraclecode Spark Architecture

  7. @gamussa @hazelcast #oraclecode

  8. @gamussa @hazelcast #oraclecode RDD

  9. @gamussa @hazelcast #oraclecode Resilient Distributed Datasets (RDD) are the primary

    abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel
  10. @gamussa @hazelcast #oraclecode

  11. @gamussa @hazelcast #oraclecode RDD Operations

  12. @gamussa @hazelcast #oraclecode operations on RDDs: transformations and actions

  13. @gamussa @hazelcast #oraclecode transformations are lazy (not computed immediately) the

    transformed RDD gets recomputed when an action is run on it (default)
  14. @gamussa @hazelcast #oraclecode RDD Transformations

  15. @gamussa @hazelcast #oraclecode

  16. @gamussa @hazelcast #oraclecode

  17. @gamussa @hazelcast #oraclecode RDD Actions

  18. @gamussa @hazelcast #oraclecode

  19. @gamussa @hazelcast #oraclecode

  20. @gamussa @hazelcast #oraclecode RDD Fault Tolerance

  21. @gamussa @hazelcast #oraclecode

  22. @gamussa @hazelcast #oraclecode RDD Construction

  23. @gamussa @hazelcast #oraclecode parallelized collections take an existing Scala collection

    and run functions on it in parallel
  24. @gamussa @hazelcast #oraclecode Hadoop datasets run functions on each record

    of a file in Hadoop distributed file system or any other storage system supported by Hadoop
  25. @gamussa @hazelcast #oraclecode What’s Hazelcast IMDG? The Fastest In-memory Data

    Grid
  26. @gamussa @hazelcast #oraclecode Hazelcast IMDG is an operational, in-memory, distributed

    computing platform that manages data using in-memory storage, and performs parallel execution for breakthrough application speed and scale
  27. @gamussa @hazelcast #oraclecode High-Density Caching In-Memory Data Grid Web Session

    Clustering Microservices Infrastructure
  28. @gamussa @hazelcast #oraclecode What’s Hazelcast IMDG? In-memory Data Grid Apache

    v2 Licensed Distributed Caches (IMap, JCache) Java Collections (IList, ISet, IQueue) Messaging (Topic, RingBuffer) Computation (ExecutorService, M-R)
  29. @gamussa @hazelcast #oraclecode Green Primary Green Backup Green Shard

  30. @gamussa @hazelcast #oraclecode

  31. @gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses",

    "localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");
  32. @gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses",

    "localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");
  33. @gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses",

    "localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");
  34. @gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses",

    "localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");
  35. @gamussa @hazelcast #oraclecode Demo

  36. @gamussa @hazelcast #oraclecode LIMITATIONS

  37. @gamussa @hazelcast #oraclecode DATA SHOULD NOT BE UPDATED WHILE READING

    FROM SPARK
  38. @gamussa @hazelcast #oraclecode WHY ?

  39. @gamussa @hazelcast #oraclecode MAP EXPANSION SHUFFLES THE DATA INSIDE THE

    BUCKET
  40. @gamussa @hazelcast #oraclecode CURSOR DOESN’T POINT TO CORRECT ENTRY ANYMORE,

    DUPLICATE OR MISSING ENTRIES COULD OCCUR
  41. @gamussa @hazelcast #oraclecode github.com/hazelcast/hazelcast-spark

  42. @gamussa @hazelcast #oraclecode THANKS! Any questions? You can find me

    at @gamussa viktor@hazelcast.com