[OracleCode SF] In-memory Analytics with Spark ...

March 01, 2017

160

[OracleCode SF] In-memory Analytics with Spark and Hazelcast

Apache Spark is a distributed computation framework optimized to work in-memory, and heavily influenced by concepts from functional programming languages.

Hazelcast - open source in-memory data grid capable of amazing feats of scale - provides wide range of distributed computing primitives computation, including ExecutorService, M/R and Aggregations frameworks.

The nature of data exploration and analysis requires data scientists be able to ask questions that weren't planned to be asked—and get an answer fast!

In this talk, Viktor will explore Spark and see how it works together with Hazelcast to provide a robust in-memory open-source big data analytics solution!

Viktor Gamov

March 01, 2017

Tweet

More Decks by Viktor Gamov

See All by Viktor Gamov

Processing Streaming Data with KSQL

4

370

[VirtualJUG] Apache Kafka — A Streaming Data Platform

3

370

[SF JUG] Apache Kafka — A Streaming Data Platform

4

81

[OracleCode NYC-2018] Apache Kafka A Streaming Data Platform

1

170

[OracleCode NYC-2018] Rethinking Stream Processing with KStreams and KSQL

2

230

[JBreak-2018] Это кто там твитить про #jbreak?

0

210

[DevNexus-2018] Apache Kafka A Streaming Data Platform

2

280

[DataSciCon] Divide, Distribute and Conquer: Stream v. Batch

0

110

[Philly JUG] Divide, Distribute and Conquer: Stream v. Batch

0

470

Other Decks in Programming

See All in Programming

TypeScriptでDXを上げろ！ Hono編

3

880

階層化自動テストで開発に機動力を

1

440

What's new in AppKit on macOS 26

0

180

抽象化という思考のツール - 理解と活用 - / Abstraction-as-a-Tool-for-Thinking

1

890

CIを整備してメンテナンスを生成AIに任せる

0

330

知って得する@cloudflare_vite-pluginのあれこれ

1

120

変化を楽しむエンジニアリング ~ いままでとこれから ~

0

570

副作用と戦う PHP リファクタリング ─ ドメインイベントでビジネスロジックを解きほぐす

3

490

マッチングアプリにおけるフリックUIで苦労したこと

0

250

iOS開発スターターキットの作り方

0

220

AIに安心して任せるためにTypeScriptで一意な型を作ろう

0

280

The Modern View Layer Rails Deserves: A Vision For 2025 And Beyond @ RailsConf 2025, Philadelphia, PA

2

830

Featured

See All Featured

A better future with KSS

238

17k

Testing 201, or: Great Expectations

43

7.6k

Easily Structure & Communicate Ideas using Wireframe

194

16k

Keith and Marios Guide to Fast Websites

411

22k

The Cult of Friendly URLs

79

6.5k

Building a Modern Day  E-commerce SEO Strategy

42

7.4k

Rails Girls Zürich Keynote

95

14k

Music & Morning Musume

46

6.7k

Design and Strategy: How to Deal with People Who Don’t "Get" Design

130

19k

The Psychology of Web Performance [Beyond Tellerrand 2023]

48

2.9k

How to train your dragon (web standard)

96

6.1k

Making Projects Easy

117

6.3k

Transcript

@gamussa @hazelcast #oraclecode IN-MEMORY ANALYTICS with APACHE SPARK and HAZELCAST
@gamussa @hazelcast #oraclecode Solutions Architect Developer Advocate @gamussa in internetz
Please, follow me on Twitter I’m very interesting © Who am I?
@gamussa @hazelcast #oraclecode What’s Apache Spark? Lightning-Fast Cluster Computing
@gamussa @hazelcast #oraclecode Run programs up to 100x faster than
Hadoop MapReduce in memory, or 10x faster on disk.
@gamussa @hazelcast #oraclecode When to use Spark? Data Science Tasks
when questions are unknown Data Processing Tasks when you have to much data You’re tired of Hadoop
@gamussa @hazelcast #oraclecode Spark Architecture
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode RDD
@gamussa @hazelcast #oraclecode Resilient Distributed Datasets (RDD) are the primary
abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode RDD Operations
@gamussa @hazelcast #oraclecode operations on RDDs: transformations and actions
@gamussa @hazelcast #oraclecode transformations are lazy (not computed immediately) the
transformed RDD gets recomputed when an action is run on it (default)
@gamussa @hazelcast #oraclecode RDD Transformations
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode RDD Actions
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode RDD Fault Tolerance
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode RDD Construction
@gamussa @hazelcast #oraclecode parallelized collections take an existing Scala collection
and run functions on it in parallel
@gamussa @hazelcast #oraclecode Hadoop datasets run functions on each record
of a file in Hadoop distributed file system or any other storage system supported by Hadoop
@gamussa @hazelcast #oraclecode What’s Hazelcast IMDG? The Fastest In-memory Data
Grid
@gamussa @hazelcast #oraclecode Hazelcast IMDG is an operational, in-memory, distributed
computing platform that manages data using in-memory storage, and performs parallel execution for breakthrough application speed and scale
@gamussa @hazelcast #oraclecode High-Density Caching In-Memory Data Grid Web Session
Clustering Microservices Infrastructure
@gamussa @hazelcast #oraclecode What’s Hazelcast IMDG? In-memory Data Grid Apache
v2 Licensed Distributed Caches (IMap, JCache) Java Collections (IList, ISet, IQueue) Messaging (Topic, RingBuffer) Computation (ExecutorService, M-R)
@gamussa @hazelcast #oraclecode Green Primary Green Backup Green Shard
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses",
"localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");
@gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses",
"localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");
@gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses",
"localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");
@gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses",
"localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");
@gamussa @hazelcast #oraclecode Demo
@gamussa @hazelcast #oraclecode LIMITATIONS
@gamussa @hazelcast #oraclecode DATA SHOULD NOT BE UPDATED WHILE READING
FROM SPARK
@gamussa @hazelcast #oraclecode WHY ?
@gamussa @hazelcast #oraclecode MAP EXPANSION SHUFFLES THE DATA INSIDE THE
BUCKET
@gamussa @hazelcast #oraclecode CURSOR DOESN’T POINT TO CORRECT ENTRY ANYMORE,
DUPLICATE OR MISSING ENTRIES COULD OCCUR
@gamussa @hazelcast #oraclecode github.com/hazelcast/hazelcast-spark
@gamussa @hazelcast #oraclecode THANKS! Any questions? You can find me
at @gamussa viktor@hazelcast.com