$30 off During Our Annual Pro Sale. View Details »
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data at the Speed of your Users
Search
Rustam Aliyev
September 26, 2014
Technology
1
77
Data at the Speed of your Users
Apache Cassandra and Spark for simple, distributed, near real-time stream processing.
Rustam Aliyev
September 26, 2014
Tweet
Share
More Decks by Rustam Aliyev
See All by Rustam Aliyev
From monolith web app to micro-frontends
rstml
0
960
Lightning Fast Analytics with Spark and Cassandra
rstml
2
320
Deep dive into CQL
rstml
1
63
Other Decks in Technology
See All in Technology
Snowflakeでデータ基盤を もう一度作り直すなら / rebuilding-data-platform-with-snowflake
pei0804
2
780
世界最速級 memcached 互換サーバー作った
yasukata
0
330
GitLab Duo Agent Platformで実現する“AI駆動・継続的サービス開発”と最新情報のアップデート
jeffi7
0
210
RAG/Agent開発のアップデートまとめ
taka0709
0
140
Ruby で作る大規模イベントネットワーク構築・運用支援システム TTDB
taketo1113
1
210
Playwright x GitHub Actionsで実現する「レビューしやすい」E2Eテストレポート
kinosuke01
0
420
Challenging Hardware Contests with Zephyr and Lessons Learned
iotengineer22
0
120
モダンデータスタック (MDS) の話とデータ分析が起こすビジネス変革
sutotakeshi
0
430
因果AIへの招待
sshimizu2006
0
930
regrowth_tokyo_2025_securityagent
hiashisan
0
170
エンジニアとPMのドメイン知識の溝をなくす、 AIネイティブな開発プロセス
applism118
4
1k
WordPress は終わったのか ~今のWordPress の制作手法ってなにがあんねん?~ / Is WordPress Over? How We Build with WordPress Today
tbshiki
1
260
Featured
See All Featured
Why You Should Never Use an ORM
jnunemaker
PRO
61
9.6k
Writing Fast Ruby
sferik
630
62k
YesSQL, Process and Tooling at Scale
rocio
174
15k
KATA
mclloyd
PRO
32
15k
Automating Front-end Workflow
addyosmani
1371
200k
The Illustrated Children's Guide to Kubernetes
chrisshort
51
51k
Intergalactic Javascript Robots from Outer Space
tanoku
273
27k
Code Reviewing Like a Champion
maltzj
527
40k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
54k
Being A Developer After 40
akosma
91
590k
Designing for humans not robots
tammielis
254
26k
Fantastic passwords and where to find them - at NoRuKo
philnash
52
3.5k
Transcript
Data at the Speed of your Users Apache Cassandra and
Spark for simple, distributed, near real-time stream processing. GOTO Copenhagen 2014
Rustam Aliyev Solution Architect at . ! ! @rstml
Big Data? Photo: Flickr / Watches En Masse
" Volume # Variety $ Velocity
Velocity = Near Real Time
Near Real Time?
0.5 sec ≤ ≤ 60 sec Near Real Time
Use Cases Photo: Flickr / Swiss Army / Jim Pennucci
Web Analytics Dynamic Pricing Recommendation Fraud Detection
Architecture Photo: Ilkin Kangarli / Baku Haydar Aliyev Center
Architecture Goals Low Latency High Availability Horizontal Scalability Simplicity
Stream Processing % % % % % % % %
% % % % % % % % % % % % % % % % % % % % % % % Collection Processing Storing Delivery
Stream Processing % % % % % % % %
% % % % % % % % % % % % % % % % % % % % % % % Collection ! ! Spark ! Cassandra Delivery
Cassandra Distributed Database Photo: Flickr / Hypostyle Hall / Jorge
Láscar
Data Model
Partition Cell 1 Cell 2 … Cell 3 Partition Key
Partition os: Android storage: 32GB version: 4.4 weight: 130g sort
order on disk Nexus 5
Table os: Android storage: 32GB version: 4.4 weight: 130g Nexus
5 os: iOS storage: 64GB version: 8.0 weight: 129g iPhone 6
Distribution
0000 8000 4000 C000 2000 6000 E000 A000 3D97 Nexus
5
0000 8000 4000 C000 2000 6000 E000 A000 9C4F iPhone
6 3D97
Replication
0000 8000 4000 C000 2000 6000 E000 A000 3D97 9C4F
1 replica
0000 8000 4000 C000 2000 6000 E000 A000 3D97 9C4F
9C4F 3D97 2 replicas
Spark Distributed Data Processing Engine Photo: Flickr / Sparklers /
Alexandra Compo / CreativeCommons
Fast In-memory
Logistic Regression Running Time (s) 1000 2000 3000 4000 Number
of Iterations 1 5 10 20 30 Spark Hadoop
Easy
map ! reduce
map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count
fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
RDD Resilient Distributed Datasets Node 1 Node 2 Node 3
Node 1 Node 2 Node 3
Operator DAG groupBy join filter map Disk RDD Memory RDD
Spark Streaming Micro-batching
RDD DStream Data Stream
Spark + Cassandra DataStax Spark Cassandra Connector
https://github.com/datastax/spark-cassandra-connector
M M
M Cassandra Spark Worker Spark Master & Worker
Demo ! ! Twitter Analytics
Cassandra Data Model
ALL: 7139 2014-09-21: 220 2014-09-20: 309 2014-09-19: 129 sort order
#hashtag
CREATE TABLE hashtags ( hashtag text,
interval text, mentions counter, PRIMARY KEY((hashtag), interval) ) WITH CLUSTERING ORDER BY (interval DESC);
Processing Data Stream
import com.datastax.spark.connector.streaming._ ! val sc = new SparkConf()
.setMaster("spark://127.0.0.1:7077") .setAppName("Twitter-‐Demo") .setJars("demo-‐assembly-‐1.0.jar")) .set("spark.cassandra.connection.host", "127.0.0.1") ! val ssc = new StreamingContext(sc, Seconds(2)) ! val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !
import com.datastax.spark.connector.streaming._ ! val sc = new SparkConf()
.setMaster("spark://127.0.0.1:7077") .setAppName("Twitter-‐Demo") .setJars("demo-‐assembly-‐1.0.jar")) .set("spark.cassandra.connection.host", "127.0.0.1") ! val ssc = new StreamingContext(sc, Seconds(2)) ! val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !
import com.datastax.spark.connector.streaming._ ! val sc = new SparkConf()
.setMaster("spark://127.0.0.1:7077") .setAppName("Twitter-‐Demo") .setJars("demo-‐assembly-‐1.0.jar")) .set("spark.cassandra.connection.host", "127.0.0.1") ! val ssc = new StreamingContext(sc, Seconds(2)) ! val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !
! val ssc = new StreamingContext(sc, Seconds(2)) ! val
stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } ! tagCountsAll.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) ! ssc.start() ssc.awaitTermination()
! val ssc = new StreamingContext(sc, Seconds(2)) ! val
stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsByDay = tagCounts.map{ case (tag, mentions) => (tag, mentions, DateTime.now.toString("yyyyMMdd")) } ! tagCountsByDay.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) ! ssc.start() ssc.awaitTermination()
! val ssc = new StreamingContext(sc, Seconds(2)) ! val
stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } ! tagCountsAll.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) ! ssc.start() ssc.awaitTermination()
Questions ?