Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[OracleCode SF] In-memory Analytics with Spark and Hazelcast

[OracleCode SF] In-memory Analytics with Spark and Hazelcast

Apache Spark is a distributed computation framework optimized to work in-memory, and heavily influenced by concepts from functional programming languages.

Hazelcast - open source in-memory data grid capable of amazing feats of scale - provides wide range of distributed computing primitives computation, including ExecutorService, M/R and Aggregations frameworks.

The nature of data exploration and analysis requires data scientists be able to ask questions that weren't planned to be asked—and get an answer fast!

In this talk, Viktor will explore Spark and see how it works together with Hazelcast to provide a robust in-memory open-source big data analytics solution!

Viktor Gamov

March 01, 2017
Tweet

More Decks by Viktor Gamov

Other Decks in Programming

Transcript

  1. @gamussa @hazelcast #oraclecode Solutions Architect Developer Advocate @gamussa in internetz

    Please, follow me on Twitter I’m very interesting © Who am I?
  2. @gamussa @hazelcast #oraclecode Run programs up to 100x faster than

    Hadoop MapReduce in memory, or 10x faster on disk.
  3. @gamussa @hazelcast #oraclecode When to use Spark? Data Science Tasks

    when questions are unknown Data Processing Tasks when you have to much data You’re tired of Hadoop
  4. @gamussa @hazelcast #oraclecode Resilient Distributed Datasets (RDD) are the primary

    abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel
  5. @gamussa @hazelcast #oraclecode transformations are lazy (not computed immediately) the

    transformed RDD gets recomputed when an action is run on it (default)
  6. @gamussa @hazelcast #oraclecode Hadoop datasets run functions on each record

    of a file in Hadoop distributed file system or any other storage system supported by Hadoop
  7. @gamussa @hazelcast #oraclecode Hazelcast IMDG is an operational, in-memory, distributed

    computing platform that manages data using in-memory storage, and performs parallel execution for breakthrough application speed and scale
  8. @gamussa @hazelcast #oraclecode What’s Hazelcast IMDG? In-memory Data Grid Apache

    v2 Licensed Distributed Caches (IMap, JCache) Java Collections (IList, ISet, IQueue) Messaging (Topic, RingBuffer) Computation (ExecutorService, M-R)
  9. @gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses",

    "localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");
  10. @gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses",

    "localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");
  11. @gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses",

    "localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");
  12. @gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses",

    "localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");