Spark / Cassandra Zurich Meetup

Apache Spark + Apache Cassandra Hayato Shimizu EMEA Solutions Architect
Team Lead [email protected] @hayato_shimizu

atabase History 4 DataStax Confidential. Do not distribute without consent.
2 970’s 70:Codd’s Relational Model 1977: IBM System R 1980’s Oracle 1990’s MySQL PostgreSQL Teradata 2000’s Netezza Hadoop NoSQL 2010’s Spark

pache Cassandra™ Apache Cassandra™ is a massively scalable, open source,
NoSQL, distributed database built for modern, mission-critical online applications. Written in Java and is a hybrid of Amazon Dynamo and Google BigTable Masterless with no single point of failure Distributed and topology aware 100% uptime Predictable scaling Low latency High throughput 4 DataStax Confidential. Do not distribute without consent. 3 Dynamo BigTable BigTable: http://research.google.com/archive/bigtable-osdi06.pdf Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pd CASSANDRA"

ache Spark Project started in 2009 – AMPLab – Berkeley
University Distributed large-scale data processing engine Real-time Streaming Distributed Processing GraphX, MLLib Ease of Use No storage engine of its own 10x – 100x speed of MapReduce

hy is Spark Fast? MapReduce conceptualised in early 2000’s DataStax
Confidential. Do not distribute without consent. 5

hy is Spark Fast? MapReduce conceptualised in early 2000’s DataStax
Confidential. Do not distribute without consent. 6

hy is Spark Fast? MapReduce conceptualised in early 2000’s Spark
project started in late 2000’s DataStax Confidential. Do not distribute without consent. 7

hy is Spark Fast? MapReduce conceptualised in early 2000’s Spark
project started in late 2000’s DataStax Confidential. Do not distribute without consent. 8

ast of Use - Hadoop API p! reduce!

ast of Use - Spark API p! lter! oupBy! rt!
ion! in! ftOuterJoin! ghtOuterJoin! reduce! count! fold! reduceByKey! groupByKey! cogroup! cross! zip! sample! take! first! partitionBy! mapWith! pipe! save ! ...!

ast 0 500 1000 1500 2000 2500 3000 3500 4000
1 5 10 20 30 Running Time (s) Number of Iterations Hado Spar 110 sec / iteration ﬁrst iteration 80 sec further iterations 1 sec ogistic Regression Performance

perator Graph: Optimization and Fault Tolerance join filter groupBy Stage
3 Stage 1 Stage 2 A: B: C: D: E: F: map = Cached partition = RDD

park Integration with Cassandra 4 DataStax Confidential. Do not distribute
without consent. 13 Data ngestion Spark Distributed Processing Data Persistence ODBC Custom Analysis

ark Streaming - Discretized Stream Processing DataStax Confidential. Do not
distribute without consent. Run a streaming computation as a series of very small, deterministic batch jobs Spark Spark Streaming batches of X seconds live data stream processed results §  Chop up the live stream into batches of X seconds §  Spark treats each batch of data as RDDs and processes them using RDD opera;ons §  Finally, the processed results of the RDD opera;ons are returned in batches

park Streaming Integration with Cassandra 4 DataStax Confidential. Do not
distribute without consent. 15 Data Stream Spark Stream Procesing Data Persistence Real-Time Queries

park / MLLib Hadoop MapReduce not suitable for Machine Learning
Algorithms   Collaborative Filtering •  Alternating Least Squares   Classiﬁcation and regression   Clustering •  K-means   Dimensionality Reduction •  Singular Value Decomposition •  Principal Component Analysis   Optimization •  Stochastic Gradient Descent •  Limited Memory BFGS (Broyden Fletcher Goldfarb Shanno) 4 DataStax Confidential. Do not distribute without consent. 16

park Machine Learning Integration with Cassandra 4 DataStax Confidential. Do
not distribute without consent. 17 Data Inserts Spark Machine Learning Data Persistence No ETL

Why Spark on Cassandra? ata model independent queries ross-table operations
(JOIN, UNION, etc.) omplex analytics (e.g. machine learning) ata transformation, aggregation, etc. tream processing (coming soon)

How to Spark on Cassandra? ataStax Cassandra Spark driver * Open
source: https://github.com/datastax/cassandra-driver-spark ompatible with * Spark 0.9+ * Cassandra 2.0+ * DataStax Enterprise 4.5+

assandra Spark Driver assandra tables exposed as Spark RDDs ead
from and write to Cassandra Mapping of C* tables and rows to Scala objects ll Cassandra types supported and converted to Scala types erver side data selection irtual Nodes support cala only driver for now

onnecting to Cassandra // Import Cassandra-‐specific functions on SparkContext and
RDD objects import com.datastax.driver.spark._ // Spark connection options val conf = new SparkConf(true) .setMaster("spark://192.168.123.10:7077") .setAppName("cassandra-‐demo") .set("cassandra.connection.host", "192.168.123.10") // initial contact .set("cassandra.username", "cassandra") .set("cassandra.password", "cassandra") val sc = new SparkContext(conf)

ccessing Data CREATE TABLE test.words (word text PRIMARY KEY, count
int); INSERT INTO test.words (word, count) VALUES ('bar', 30); INSERT INTO test.words (word, count) VALUES ('foo', 20); // Use table as RDD val rdd = sc.cassandraTable("test", "words") // rdd: CassandraRDD[CassandraRow] = CassandraRDD[0] rdd.toArray.foreach(println) // CassandraRow[word: bar, count: 30] // CassandraRow[word: foo, count: 20] rdd.columnNames // Stream(word, count) rdd.size // 2 val firstRow = rdd.first // firstRow: CassandraRow = CassandraRow[word: bar, count: 30] firstRow.getInt("count") // Int = 30 Accessing table above as RDD:

aving Data val newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50)))
// newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2] newRdd.saveToCassandra("test", "words", Seq("word", "count")) SELECT * FROM test.words; word | count -‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐ bar | 30 foo | 20 cat | 40 fox | 50 (4 rows) RDD above saved to Cassandra:

park SQL vs Shark Shark or Spark SQL Streaming ML
Spark (General execution engine) Graph Cassandra Compatible

park Integration / Shark Hive Query Language – ANSI SQL
like Joins across multiple Cassandra tables Batch queries Caching Massively faster than Hadoop/Hive queries 4 DataStax Confidential. Do not distribute without consent. 25 CREATE TABLE CachedStocks TBLPROPERTIES ("shark.cache" = "true") ! AS SELECT * from PortfolioDemo.Stocks WHERE value > 95.0;! ! SELECT * FROM CachedStocks;!

park Integration – What’s coming? Spark 1.0   SparkSQL  
Streaming •  Real-Time event processing •  Data enrichment •  Cassandra as the persistence layer 4 DataStax Confidential. Do not distribute without consent. 26

hank You We power the big data apps that transform
business. ©2013 DataStax Confidential. Do not distribute without consent.

Spark / Cassandra Zurich Meetup

Spark / Cassandra Zurich Meetup

Ale

More Decks by Ale

Other Decks in Technology

Featured

Transcript

Apache Spark + Apache Cassandra Hayato Shimizu EMEA Solutions Architect

atabase History 4 DataStax Confidential. Do not distribute without consent.

pache Cassandra™ Apache Cassandra™ is a massively scalable, open source,

ache Spark Project started in 2009 – AMPLab – Berkeley

hy is Spark Fast? MapReduce conceptualised in early 2000’s DataStax

hy is Spark Fast? MapReduce conceptualised in early 2000’s DataStax

hy is Spark Fast? MapReduce conceptualised in early 2000’s Spark

hy is Spark Fast? MapReduce conceptualised in early 2000’s Spark

ast of Use - Hadoop API p! reduce!

ast of Use - Spark API p! lter! oupBy! rt!

ast 0 500 1000 1500 2000 2500 3000 3500 4000

perator Graph: Optimization and Fault Tolerance join filter groupBy Stage

park Integration with Cassandra 4 DataStax Confidential. Do not distribute

ark Streaming - Discretized Stream Processing DataStax Confidential. Do not

park Streaming Integration with Cassandra 4 DataStax Confidential. Do not

park / MLLib Hadoop MapReduce not suitable for Machine Learning

park Machine Learning Integration with Cassandra 4 DataStax Confidential. Do

Why Spark on Cassandra? ata model independent queries ross-table operations

How to Spark on Cassandra? ataStax Cassandra Spark driver * Open

assandra Spark Driver assandra tables exposed as Spark RDDs ead

onnecting to Cassandra // Import Cassandra-‐specific functions on SparkContext and

ccessing Data CREATE TABLE test.words (word text PRIMARY KEY, count

aving Data val newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50)))

park SQL vs Shark Shark or Spark SQL Streaming ML

park Integration / Shark Hive Query Language – ANSI SQL

park Integration – What’s coming? Spark 1.0   SparkSQL

hank You We power the big data apps that transform