Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark / Cassandra Zurich Meetup

Ale
August 18, 2014

Spark / Cassandra Zurich Meetup

Presentation by Hayato Shimizu

Ale

August 18, 2014
Tweet

More Decks by Ale

Other Decks in Technology

Transcript

  1. atabase History 4 DataStax Confidential. Do not distribute without consent.

    2 970’s 70:Codd’s Relational Model 1977: IBM System R 1980’s Oracle 1990’s MySQL PostgreSQL Teradata 2000’s Netezza Hadoop NoSQL 2010’s Spark
  2. pache Cassandra™ Apache Cassandra™ is a massively scalable, open source,

    NoSQL, distributed database built for modern, mission-critical online applications. Written in Java and is a hybrid of Amazon Dynamo and Google BigTable Masterless with no single point of failure Distributed and topology aware 100% uptime Predictable scaling Low latency High throughput 4 DataStax Confidential. Do not distribute without consent. 3 Dynamo BigTable BigTable: http://research.google.com/archive/bigtable-osdi06.pdf Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pd CASSANDRA"
  3. ache Spark Project started in 2009 – AMPLab – Berkeley

    University Distributed large-scale data processing engine Real-time Streaming Distributed Processing GraphX, MLLib Ease of Use No storage engine of its own 10x – 100x speed of MapReduce
  4. hy is Spark Fast? MapReduce conceptualised in early 2000’s DataStax

    Confidential. Do not distribute without consent. 5
  5. hy is Spark Fast? MapReduce conceptualised in early 2000’s DataStax

    Confidential. Do not distribute without consent. 6
  6. hy is Spark Fast? MapReduce conceptualised in early 2000’s Spark

    project started in late 2000’s DataStax Confidential. Do not distribute without consent. 7
  7. hy is Spark Fast? MapReduce conceptualised in early 2000’s Spark

    project started in late 2000’s DataStax Confidential. Do not distribute without consent. 8
  8. ast of Use - Spark API p! lter! oupBy! rt!

    ion! in! ftOuterJoin! ghtOuterJoin! reduce! count! fold! reduceByKey! groupByKey! cogroup! cross! zip! sample! take! first! partitionBy! mapWith! pipe! save ! ...!
  9. ast 0 500 1000 1500 2000 2500 3000 3500 4000

    1 5 10 20 30 Running Time (s) Number of Iterations Hado Spar 110 sec / iteration first iteration 80 sec further iterations 1 sec ogistic Regression Performance
  10. perator Graph: Optimization and Fault Tolerance join filter groupBy Stage

    3 Stage 1 Stage 2 A: B: C: D: E: F: map = Cached partition = RDD
  11. park Integration with Cassandra 4 DataStax Confidential. Do not distribute

    without consent. 13 Data ngestion Spark Distributed Processing Data Persistence ODBC Custom Analysis
  12. ark Streaming - Discretized Stream Processing DataStax Confidential. Do not

    distribute without consent. Run a streaming computation as a series of very small, deterministic batch jobs Spark   Spark   Streaming   batches  of  X   seconds   live  data  stream   processed   results   §  Chop  up  the  live  stream  into  batches  of  X   seconds     §  Spark  treats  each  batch  of  data  as  RDDs  and   processes  them  using  RDD  opera;ons   §  Finally,  the  processed  results  of  the  RDD   opera;ons  are  returned  in  batches  
  13. park Streaming Integration with Cassandra 4 DataStax Confidential. Do not

    distribute without consent. 15 Data Stream Spark Stream Procesing Data Persistence Real-Time Queries
  14. park / MLLib Hadoop MapReduce not suitable for Machine Learning

    Algorithms   Collaborative Filtering •  Alternating Least Squares   Classification and regression   Clustering •  K-means   Dimensionality Reduction •  Singular Value Decomposition •  Principal Component Analysis   Optimization •  Stochastic Gradient Descent •  Limited Memory BFGS (Broyden Fletcher Goldfarb Shanno) 4 DataStax Confidential. Do not distribute without consent. 16
  15. park Machine Learning Integration with Cassandra 4 DataStax Confidential. Do

    not distribute without consent. 17 Data Inserts Spark Machine Learning Data Persistence No ETL
  16. Why Spark on Cassandra? ata model independent queries ross-table operations

    (JOIN, UNION, etc.) omplex analytics (e.g. machine learning) ata transformation, aggregation, etc. tream processing (coming soon)
  17. How to Spark on Cassandra? ataStax Cassandra Spark driver * Open

    source: https://github.com/datastax/cassandra-driver-spark ompatible with * Spark 0.9+ * Cassandra 2.0+ * DataStax Enterprise 4.5+
  18. assandra Spark Driver assandra tables exposed as Spark RDDs ead

    from and write to Cassandra Mapping of C* tables and rows to Scala objects ll Cassandra types supported and converted to Scala types erver side data selection irtual Nodes support cala only driver for now
  19. onnecting to Cassandra //  Import  Cassandra-­‐specific  functions  on  SparkContext  and

     RDD  objects   import  com.datastax.driver.spark._       //  Spark  connection  options   val  conf  =  new  SparkConf(true)              .setMaster("spark://192.168.123.10:7077")              .setAppName("cassandra-­‐demo")                  .set("cassandra.connection.host",  "192.168.123.10")  //  initial  contact                  .set("cassandra.username",  "cassandra")                  .set("cassandra.password",  "cassandra")       val  sc  =  new  SparkContext(conf)  
  20. ccessing Data CREATE  TABLE  test.words  (word  text  PRIMARY  KEY,  count

     int);     INSERT  INTO  test.words  (word,  count)  VALUES  ('bar',  30);   INSERT  INTO  test.words  (word,  count)  VALUES  ('foo',  20);   //  Use  table  as  RDD   val  rdd  =  sc.cassandraTable("test",  "words")   //  rdd:  CassandraRDD[CassandraRow]  =  CassandraRDD[0]     rdd.toArray.foreach(println)   //  CassandraRow[word:  bar,  count:  30]   //  CassandraRow[word:  foo,  count:  20]     rdd.columnNames        //  Stream(word,  count)     rdd.size                      //  2     val  firstRow  =  rdd.first    //  firstRow:  CassandraRow  =  CassandraRow[word:  bar,  count:  30]   firstRow.getInt("count")    //  Int  =  30   Accessing table above as RDD:
  21. aving Data val  newRdd  =  sc.parallelize(Seq(("cat",  40),  ("fox",  50)))  

    //  newRdd:  org.apache.spark.rdd.RDD[(String,  Int)]  =  ParallelCollectionRDD[2]     newRdd.saveToCassandra("test",  "words",  Seq("word",  "count"))   SELECT  *  FROM  test.words;      word  |  count   -­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐      bar  |        30      foo  |        20      cat  |        40      fox  |        50     (4  rows)   RDD above saved to Cassandra:
  22. park SQL vs Shark Shark or Spark SQL Streaming ML

    Spark (General execution engine) Graph Cassandra Compatible
  23. park Integration / Shark Hive Query Language – ANSI SQL

    like Joins across multiple Cassandra tables Batch queries Caching Massively faster than Hadoop/Hive queries 4 DataStax Confidential. Do not distribute without consent. 25 CREATE TABLE CachedStocks TBLPROPERTIES ("shark.cache" = "true") ! AS SELECT * from PortfolioDemo.Stocks WHERE value > 95.0;! ! SELECT * FROM CachedStocks;!
  24. park Integration – What’s coming? Spark 1.0   SparkSQL  

    Streaming •  Real-Time event processing •  Data enrichment •  Cassandra as the persistence layer 4 DataStax Confidential. Do not distribute without consent. 26
  25. hank You We power the big data apps that transform

    business. ©2013 DataStax Confidential. Do not distribute without consent.