Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data at the Speed of your Users
Search
Rustam Aliyev
September 26, 2014
Technology
1
66
Data at the Speed of your Users
Apache Cassandra and Spark for simple, distributed, near real-time stream processing.
Rustam Aliyev
September 26, 2014
Tweet
Share
More Decks by Rustam Aliyev
See All by Rustam Aliyev
From monolith web app to micro-frontends
rstml
0
900
Lightning Fast Analytics with Spark and Cassandra
rstml
2
300
Deep dive into CQL
rstml
1
45
Other Decks in Technology
See All in Technology
反実仮想機械学習とは何か
usaito
PRO
11
4.6k
Cloud Native Java with Spring Boot (CNCF Aarhus, April 2024)
thomasvitale
1
170
本当のAWS基礎
toru_kubota
0
520
プロトタイピングによる不確実性の低減 / Reducing Uncertainty through Prototyping
ohbarye
5
380
ワールドカフェI /チューターを改良する / World Café I and Improving the Tutors
ks91
PRO
0
120
ゼロから始めるVue.jsコミュニティ貢献 / first-vuejs-community-contribution-link-and-motivation
lmi
1
120
Janus
bkuhlmann
1
490
チームでロジカルシンキングに改めて向き合っている話 〜学習環境と実践⽅法〜
sansantech
PRO
3
2.4k
Azure Container Apps + Bicep 〜 こんな感じで運用しています
kaz29
2
470
APIファーストなプロダクトマネジメントの実践 〜SaaSus Platformでの例〜 / "Practicing API-First Product Management - An Example with SaaSus Platform
oztick139
0
100
Hands-on Gemini, the Google DeepMind LLM
meteatamel
1
110
検証を通して見えてきたTiDBの性能特性
lycorptech_jp
PRO
6
3.8k
Featured
See All Featured
Thoughts on Productivity
jonyablonski
58
3.8k
KATA
mclloyd
15
12k
RailsConf 2023
tenderlove
4
540
Building Applications with DynamoDB
mza
88
5.6k
The Cost Of JavaScript in 2023
addyosmani
16
3.9k
Rails Girls Zürich Keynote
gr2m
91
13k
10 Git Anti Patterns You Should be Aware of
lemiorhan
648
58k
From Idea to $5000 a Month in 5 Months
shpigford
377
45k
Pencils Down: Stop Designing & Start Developing
hursman
117
11k
Robots, Beer and Maslow
schacon
PRO
155
7.9k
Java REST API Framework Comparison - PWX 2021
mraible
PRO
18
6.9k
Clear Off the Table
cherdarchuk
84
310k
Transcript
Data at the Speed of your Users Apache Cassandra and
Spark for simple, distributed, near real-time stream processing. GOTO Copenhagen 2014
Rustam Aliyev Solution Architect at . ! ! @rstml
Big Data? Photo: Flickr / Watches En Masse
" Volume # Variety $ Velocity
Velocity = Near Real Time
Near Real Time?
0.5 sec ≤ ≤ 60 sec Near Real Time
Use Cases Photo: Flickr / Swiss Army / Jim Pennucci
Web Analytics Dynamic Pricing Recommendation Fraud Detection
Architecture Photo: Ilkin Kangarli / Baku Haydar Aliyev Center
Architecture Goals Low Latency High Availability Horizontal Scalability Simplicity
Stream Processing % % % % % % % %
% % % % % % % % % % % % % % % % % % % % % % % Collection Processing Storing Delivery
Stream Processing % % % % % % % %
% % % % % % % % % % % % % % % % % % % % % % % Collection ! ! Spark ! Cassandra Delivery
Cassandra Distributed Database Photo: Flickr / Hypostyle Hall / Jorge
Láscar
Data Model
Partition Cell 1 Cell 2 … Cell 3 Partition Key
Partition os: Android storage: 32GB version: 4.4 weight: 130g sort
order on disk Nexus 5
Table os: Android storage: 32GB version: 4.4 weight: 130g Nexus
5 os: iOS storage: 64GB version: 8.0 weight: 129g iPhone 6
Distribution
0000 8000 4000 C000 2000 6000 E000 A000 3D97 Nexus
5
0000 8000 4000 C000 2000 6000 E000 A000 9C4F iPhone
6 3D97
Replication
0000 8000 4000 C000 2000 6000 E000 A000 3D97 9C4F
1 replica
0000 8000 4000 C000 2000 6000 E000 A000 3D97 9C4F
9C4F 3D97 2 replicas
Spark Distributed Data Processing Engine Photo: Flickr / Sparklers /
Alexandra Compo / CreativeCommons
Fast In-memory
Logistic Regression Running Time (s) 1000 2000 3000 4000 Number
of Iterations 1 5 10 20 30 Spark Hadoop
Easy
map ! reduce
map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count
fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
RDD Resilient Distributed Datasets Node 1 Node 2 Node 3
Node 1 Node 2 Node 3
Operator DAG groupBy join filter map Disk RDD Memory RDD
Spark Streaming Micro-batching
RDD DStream Data Stream
Spark + Cassandra DataStax Spark Cassandra Connector
https://github.com/datastax/spark-cassandra-connector
M M
M Cassandra Spark Worker Spark Master & Worker
Demo ! ! Twitter Analytics
Cassandra Data Model
ALL: 7139 2014-09-21: 220 2014-09-20: 309 2014-09-19: 129 sort order
#hashtag
CREATE TABLE hashtags ( hashtag text,
interval text, mentions counter, PRIMARY KEY((hashtag), interval) ) WITH CLUSTERING ORDER BY (interval DESC);
Processing Data Stream
import com.datastax.spark.connector.streaming._ ! val sc = new SparkConf()
.setMaster("spark://127.0.0.1:7077") .setAppName("Twitter-‐Demo") .setJars("demo-‐assembly-‐1.0.jar")) .set("spark.cassandra.connection.host", "127.0.0.1") ! val ssc = new StreamingContext(sc, Seconds(2)) ! val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !
import com.datastax.spark.connector.streaming._ ! val sc = new SparkConf()
.setMaster("spark://127.0.0.1:7077") .setAppName("Twitter-‐Demo") .setJars("demo-‐assembly-‐1.0.jar")) .set("spark.cassandra.connection.host", "127.0.0.1") ! val ssc = new StreamingContext(sc, Seconds(2)) ! val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !
import com.datastax.spark.connector.streaming._ ! val sc = new SparkConf()
.setMaster("spark://127.0.0.1:7077") .setAppName("Twitter-‐Demo") .setJars("demo-‐assembly-‐1.0.jar")) .set("spark.cassandra.connection.host", "127.0.0.1") ! val ssc = new StreamingContext(sc, Seconds(2)) ! val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !
! val ssc = new StreamingContext(sc, Seconds(2)) ! val
stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } ! tagCountsAll.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) ! ssc.start() ssc.awaitTermination()
! val ssc = new StreamingContext(sc, Seconds(2)) ! val
stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsByDay = tagCounts.map{ case (tag, mentions) => (tag, mentions, DateTime.now.toString("yyyyMMdd")) } ! tagCountsByDay.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) ! ssc.start() ssc.awaitTermination()
! val ssc = new StreamingContext(sc, Seconds(2)) ! val
stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } ! tagCountsAll.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) ! ssc.start() ssc.awaitTermination()
Questions ?