Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data at the Speed of your Users
Search
Rustam Aliyev
September 26, 2014
Technology
1
77
Data at the Speed of your Users
Apache Cassandra and Spark for simple, distributed, near real-time stream processing.
Rustam Aliyev
September 26, 2014
Tweet
Share
More Decks by Rustam Aliyev
See All by Rustam Aliyev
From monolith web app to micro-frontends
rstml
0
960
Lightning Fast Analytics with Spark and Cassandra
rstml
2
320
Deep dive into CQL
rstml
1
65
Other Decks in Technology
See All in Technology
22nd ACRi Webinar - NTT Kawahara-san's slide
nao_sumikawa
0
100
AI駆動開発を事業のコアに置く
tasukuonizawa
1
320
Introduction to Sansan for Engineers / エンジニア向け会社紹介
sansan33
PRO
6
68k
Why Organizations Fail: ノーベル経済学賞「国家はなぜ衰退するのか」から考えるアジャイル組織論
kawaguti
PRO
1
140
コミュニティが変えるキャリアの地平線:コロナ禍新卒入社のエンジニアがAWSコミュニティで見つけた成長の羅針盤
kentosuzuki
0
130
Frontier Agents (Kiro autonomous agent / AWS Security Agent / AWS DevOps Agent) の紹介
msysh
3
180
ブロックテーマ、WordPress でウェブサイトをつくるということ / 2026.02.07 Gifu WordPress Meetup
torounit
0
190
Oracle Cloud Observability and Management Platform - OCI 運用監視サービス概要 -
oracle4engineer
PRO
2
14k
日本の85%が使う公共SaaSは、どう育ったのか
taketakekaho
1
230
Oracle Base Database Service 技術詳細
oracle4engineer
PRO
15
93k
予期せぬコストの急増を障害のように扱う――「コスト版ポストモーテム」の導入とその後の改善
muziyoshiz
1
2k
Tebiki Engineering Team Deck
tebiki
0
24k
Featured
See All Featured
Lightning talk: Run Django tests with GitHub Actions
sabderemane
0
120
DBのスキルで生き残る技術 - AI時代におけるテーブル設計の勘所
soudai
PRO
62
50k
The AI Search Optimization Roadmap by Aleyda Solis
aleyda
1
5.2k
Agile Actions for Facilitating Distributed Teams - ADO2019
mkilby
0
120
Dominate Local Search Results - an insider guide to GBP, reviews, and Local SEO
greggifford
PRO
0
78
Music & Morning Musume
bryan
47
7.1k
Why Mistakes Are the Best Teachers: Turning Failure into a Pathway for Growth
auna
0
54
Navigating Weather and Climate Data
rabernat
0
110
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
21
1.4k
The Art of Programming - Codeland 2020
erikaheidi
57
14k
Fireside Chat
paigeccino
41
3.8k
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs
inesmontani
PRO
3
3k
Transcript
Data at the Speed of your Users Apache Cassandra and
Spark for simple, distributed, near real-time stream processing. GOTO Copenhagen 2014
Rustam Aliyev Solution Architect at . ! ! @rstml
Big Data? Photo: Flickr / Watches En Masse
" Volume # Variety $ Velocity
Velocity = Near Real Time
Near Real Time?
0.5 sec ≤ ≤ 60 sec Near Real Time
Use Cases Photo: Flickr / Swiss Army / Jim Pennucci
Web Analytics Dynamic Pricing Recommendation Fraud Detection
Architecture Photo: Ilkin Kangarli / Baku Haydar Aliyev Center
Architecture Goals Low Latency High Availability Horizontal Scalability Simplicity
Stream Processing % % % % % % % %
% % % % % % % % % % % % % % % % % % % % % % % Collection Processing Storing Delivery
Stream Processing % % % % % % % %
% % % % % % % % % % % % % % % % % % % % % % % Collection ! ! Spark ! Cassandra Delivery
Cassandra Distributed Database Photo: Flickr / Hypostyle Hall / Jorge
Láscar
Data Model
Partition Cell 1 Cell 2 … Cell 3 Partition Key
Partition os: Android storage: 32GB version: 4.4 weight: 130g sort
order on disk Nexus 5
Table os: Android storage: 32GB version: 4.4 weight: 130g Nexus
5 os: iOS storage: 64GB version: 8.0 weight: 129g iPhone 6
Distribution
0000 8000 4000 C000 2000 6000 E000 A000 3D97 Nexus
5
0000 8000 4000 C000 2000 6000 E000 A000 9C4F iPhone
6 3D97
Replication
0000 8000 4000 C000 2000 6000 E000 A000 3D97 9C4F
1 replica
0000 8000 4000 C000 2000 6000 E000 A000 3D97 9C4F
9C4F 3D97 2 replicas
Spark Distributed Data Processing Engine Photo: Flickr / Sparklers /
Alexandra Compo / CreativeCommons
Fast In-memory
Logistic Regression Running Time (s) 1000 2000 3000 4000 Number
of Iterations 1 5 10 20 30 Spark Hadoop
Easy
map ! reduce
map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count
fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
RDD Resilient Distributed Datasets Node 1 Node 2 Node 3
Node 1 Node 2 Node 3
Operator DAG groupBy join filter map Disk RDD Memory RDD
Spark Streaming Micro-batching
RDD DStream Data Stream
Spark + Cassandra DataStax Spark Cassandra Connector
https://github.com/datastax/spark-cassandra-connector
M M
M Cassandra Spark Worker Spark Master & Worker
Demo ! ! Twitter Analytics
Cassandra Data Model
ALL: 7139 2014-09-21: 220 2014-09-20: 309 2014-09-19: 129 sort order
#hashtag
CREATE TABLE hashtags ( hashtag text,
interval text, mentions counter, PRIMARY KEY((hashtag), interval) ) WITH CLUSTERING ORDER BY (interval DESC);
Processing Data Stream
import com.datastax.spark.connector.streaming._ ! val sc = new SparkConf()
.setMaster("spark://127.0.0.1:7077") .setAppName("Twitter-‐Demo") .setJars("demo-‐assembly-‐1.0.jar")) .set("spark.cassandra.connection.host", "127.0.0.1") ! val ssc = new StreamingContext(sc, Seconds(2)) ! val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !
import com.datastax.spark.connector.streaming._ ! val sc = new SparkConf()
.setMaster("spark://127.0.0.1:7077") .setAppName("Twitter-‐Demo") .setJars("demo-‐assembly-‐1.0.jar")) .set("spark.cassandra.connection.host", "127.0.0.1") ! val ssc = new StreamingContext(sc, Seconds(2)) ! val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !
import com.datastax.spark.connector.streaming._ ! val sc = new SparkConf()
.setMaster("spark://127.0.0.1:7077") .setAppName("Twitter-‐Demo") .setJars("demo-‐assembly-‐1.0.jar")) .set("spark.cassandra.connection.host", "127.0.0.1") ! val ssc = new StreamingContext(sc, Seconds(2)) ! val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !
! val ssc = new StreamingContext(sc, Seconds(2)) ! val
stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } ! tagCountsAll.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) ! ssc.start() ssc.awaitTermination()
! val ssc = new StreamingContext(sc, Seconds(2)) ! val
stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsByDay = tagCounts.map{ case (tag, mentions) => (tag, mentions, DateTime.now.toString("yyyyMMdd")) } ! tagCountsByDay.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) ! ssc.start() ssc.awaitTermination()
! val ssc = new StreamingContext(sc, Seconds(2)) ! val
stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } ! tagCountsAll.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) ! ssc.start() ssc.awaitTermination()
Questions ?