Data at the Speed of your Users

Data at the Speed of your Users Apache Cassandra and
Spark for simple, distributed, near real-time stream processing. GOTO Copenhagen 2014

Rustam Aliyev Solution Architect at . ! ! @rstml

Big Data? Photo: Flickr / Watches En Masse

" Volume # Variety $ Velocity

Velocity = Near Real Time

Near Real Time?

0.5 sec ≤ ≤ 60 sec Near Real Time

Use Cases Photo: Flickr / Swiss Army / Jim Pennucci

Web Analytics Dynamic Pricing Recommendation Fraud Detection

Architecture Photo: Ilkin Kangarli / Baku Haydar Aliyev Center

Architecture Goals Low Latency High Availability Horizontal Scalability Simplicity

Stream Processing % % % % % % % %
% % % % % % % % % % % % % % % % % % % % % % % Collection Processing Storing Delivery

Stream Processing % % % % % % % %
% % % % % % % % % % % % % % % % % % % % % % % Collection ! ! Spark  ! Cassandra Delivery

Cassandra Distributed Database Photo: Flickr / Hypostyle Hall / Jorge
Láscar

Data Model

Partition Cell 1 Cell 2 … Cell 3 Partition Key

Partition os: Android storage: 32GB version: 4.4 weight: 130g sort
order on disk Nexus 5

Table os: Android storage: 32GB version: 4.4 weight: 130g Nexus
5 os: iOS storage: 64GB version: 8.0 weight: 129g iPhone 6

Distribution

0000 8000 4000 C000 2000 6000 E000 A000 3D97 Nexus
5

0000 8000 4000 C000 2000 6000 E000 A000 9C4F iPhone
6 3D97

Replication

0000 8000 4000 C000 2000 6000 E000 A000 3D97 9C4F
1 replica

0000 8000 4000 C000 2000 6000 E000 A000 3D97 9C4F
9C4F 3D97 2 replicas

Spark Distributed Data Processing Engine Photo: Flickr / Sparklers /
Alexandra Compo / CreativeCommons

Fast In-memory

Logistic Regression Running Time (s) 1000 2000 3000 4000 Number
of Iterations 1 5 10 20 30 Spark Hadoop

map ! reduce

map ﬁlter groupBy sort union join leftOuterJoin rightOuterJoin reduce count
fold reduceByKey groupByKey cogroup cross zip sample take ﬁrst partitionBy mapWith pipe save   ...

RDD Resilient Distributed Datasets Node 1 Node 2 Node 3
Node 1 Node 2 Node 3

Operator DAG groupBy join ﬁlter map Disk RDD Memory RDD

Spark Streaming Micro-batching

RDD DStream Data Stream

Spark + Cassandra DataStax Spark Cassandra Connector

https://github.com/datastax/spark-cassandra-connector

        M M
M  Cassandra Spark Worker Spark Master & Worker

Demo ! ! Twitter Analytics

Cassandra Data Model

ALL: 7139 2014-09-21: 220 2014-09-20: 309 2014-09-19: 129 sort order
#hashtag

CREATE TABLE hashtags ( hashtag text,
interval text, mentions counter, PRIMARY KEY((hashtag), interval) ) WITH CLUSTERING ORDER BY (interval DESC);

Processing Data Stream

import com.datastax.spark.connector.streaming._ ! val sc = new SparkConf()
.setMaster("spark://127.0.0.1:7077") .setAppName("Twitter-‐Demo") .setJars("demo-‐assembly-‐1.0.jar")) .set("spark.cassandra.connection.host", "127.0.0.1") ! val ssc = new StreamingContext(sc, Seconds(2)) ! val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !

! val ssc = new StreamingContext(sc, Seconds(2)) ! val
stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } ! tagCountsAll.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) ! ssc.start() ssc.awaitTermination()

stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsByDay = tagCounts.map{ case (tag, mentions) => (tag, mentions, DateTime.now.toString("yyyyMMdd")) } ! tagCountsByDay.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) ! ssc.start() ssc.awaitTermination()

stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) ! val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) ! val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) ! val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } ! tagCountsAll.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) ! ssc.start() ssc.awaitTermination()

Questions ?

Data at the Speed of your Users

Data at the Speed of your Users

More Decks by Rustam Aliyev

Other Decks in Technology

Featured

Transcript