Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cassandra and Spark for the Internet of Things

Cassandra and Spark for the Internet of Things

Josep Casals

January 29, 2015
Tweet

More Decks by Josep Casals

Other Decks in Technology

Transcript

  1. Cassandra and Spark for the Internet of Things Josep Casals

    Lead Data Engineer - British Gas Connected Homes
  2. My Energy 3.8M - Monthly! 400K - Daily! 200K -

    Half Hourly! 5K - 10 seconds
  3. 30 minute data Filtering Time Series Magic Clustering Markov Model

    Fridge Consumption Fridges - They are exciting… really
  4. Linear Scalability - C* is designed around Partition and Availability.

    - Consistency is tunable. - Lightweight transactions with Paxos allow CP mode. C Consistency A Availability P Partition Paxos Lightweight Transactions To achieve linear scalability consistency is enforced only eventually
  5. Very high availability - Implements Amazon’s Dynamo partitioning and replication

    model ! - All nodes can be used to talk to the Database ! - Replication factor and consistency level determine how many nodes we can lose. Cassandra introduces no single point of failure
  6. Good match for Time Series data - Data model from

    Google Bigtable ! - Wide rows allow for sequential disk reads when time is selected as the clustering key. Data is stored sequentially on disk
  7. Very complete API map(func)! filter(func) flatMap(func) mapPartitions(func) mapPartitionsWithIndex(func) sample(withReplacement, fraction,

    seed) union(otherDataset) intersection(otherDataset) distinct([numTasks])) groupByKey([numTasks]) reduceByKey(func, [numTasks]) aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) sortByKey([ascending], [numTasks]) join(otherDataset, [numTasks]) cogroup(otherDataset, [numTasks]) cartesian(otherDataset) pipe(command, [envVars]) coalesce(numPartitions) repartition(numPartitions) repartitionAndSortWithinPartitions(partitioner) reduce(func)! collect() count() first() take(n) takeSample(withReplacement, num, [seed]) takeOrdered(n, [ordering]) saveAsTextFile(path) saveAsSequenceFile(path) saveAsObjectFile(path) countByKey() foreach(func)
  8. Use cases • Data storage • Spark Streaming from queue

    ! • Data processing • Transformations and Joins ! • Data analytics • Data science productionising
  9. Data Streaming End 2015: ! •Hive Home -> 200k users

    •~ 15000 messages / s! ! •Connected boilers -> 25k users •~ 2500 messages / s! ! •Live Energy -> 50k users •~ 8500 messages / s
  10. import org.apache.spark.streaming.{Seconds, StreamingContext}! import org.apache.spark.SparkConf! ! object StreamingDemo{! def main(args:

    Array[String]) {! // Connect to the Cassandra Cluster! val client: CassandraConnector = new CassandraConnector()! val cassandraNode = if(args.length > 0) args(0) else "127.0.0.1"! client.connect(cassandraNode)! client.execute("CREATE KEYSPACE IF NOT EXISTS streaming_test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 } ;")! client.execute("CREATE TABLE IF NOT EXISTS streaming_test.words (word text PRIMARY KE count int);")! // Create a StreamingContext with a SparkConf configuration! val sparkConf = new SparkConf()! .setAppName(“StreamingDemo”)! val ssc = new StreamingContext(sparkConf, Seconds(5))! ! // Create a DStream that will connect to serverIP:serverPort! val lines = ssc.socketTextStream("localhost", 9999)! ! client.execute("USE streaming_test;")! lines.foreachRDD(rdd => {! rdd.collect().foreach(line => {! if (line.length > 1) {! client.execute("INSERT INTO words (word, count)" + "VALUES ('" + line + "', 1); }! println("Line from RDD: " + line)! })! })! ssc.start()! }! } (1) Spark Streaming - Storing to C* Use Spark-C* connector Replace by a RabbitMQ or Kafka stream.
  11. Data Processing Energy Readings from British Gas head ends: !

    • 30 M rows monthly ~ 1 M rows daily ! • 48 columns per row (1 reading each 1/2h) ! • 4M rows contracts tables ! • Need to join both tables based on contract date ! • Need to convert columns to rows for data science ! Difficult to handle on a relational DB
  12. object TransformReadings {! def main(args: Array[String]) {! val cassandraHost =

    if(args.length > 0) args(0) else "127.0.0.1"! val conf = new SparkConf(true).setAppName("TransformReadings")! .set("spark.cassandra.connection.host", cassandraHost)! .setMaster("local[2]")! ! val sc = new SparkContext(conf)! val serIngest = sc.cassandraTable("ser_ingest", "profile_reads")! .groupBy(row => row.get[String]("mpxn"))! ! val contracts = sc.cassandraTable("ser_ingest", "contract")! .select("mpxn","move_in_date","business_partner","contract_account","premi se","postcode_gis")! .groupBy(row => row.get[String]("mpxn").takeRight(10))! ! serIngest.join(contracts)! .flatMap(transformRow)! .saveToCassandra("ser",! "half_hourly_reads",! SomeColumns("mxpn",! "month",! "energy_type",! "readdate",! "reading",! "business_partner_id",! "contract_id",! "premise_id",! "postcode"! )! )! }! ! def transformRow (data: (String, (Iterable[CassandraRow],Iterable[CassandraRow]))) = {….)! )! (2) Transformations and Joins join + transfor m
  13. def transformRow (data: (String, (Iterable[CassandraRow],Iterable[CassandraRow]))) = {! val halfHourlyColumnNames =

    (0 to 47).map(i => f"t${i/2}%02d${30*(i%2)}%02d") // Create se column names! val row = data._2._1.head! val contracts = data._2._2! val readDate = row.get[Date]("reading_date")! val current_contract = contracts.filter(c => c.get[Date]("move_in_date").before(readDate)) .sortBy(_.get[Date]("move_in_date"))! .reverse! .head! ! ! ! halfHourlyColumnNames.map(col => (! row.get[String]("mpxn"),! new SimpleDateFormat("yyyy-MM").format(readDate),! "elec",! new SimpleDateFormat("yyyy-MM-dd hhmm").parse(new SimpleDateFormat("yyyy-MM-dd ").format(readDate) + col.drop(1)),! row.get[Option[Double]](col).getOrElse(0.0),! current_contract.get[String]("business_partner"),! current_contract.get[String]("contract_account"),! current_contract.get[String]("premise"),! current_contract.get[String]("postcode_gis")! )! )! } (2) Transformations and Joins Get the right Create a row for
  14. Data science productionising ! • Data scientists write algorithms in

    R and C# ! • Fridge energy calculation algorithm • -> 1000 loc in C# ! • Spark allows us to reduce from 1000 to 400 loc ! • We translate R and C# to Java / Scala ! • Would be great if Scala was the language of choice for Data scientists
  15. def main(args: Array[String]) = {! val sparkConf = new SparkConf().{…}!

    ! val (month, keyspace, interimSave) = parseParams(args, config)! ! val sc = new SparkContext(sparkConf)! ! val readings = sc.cassandraTable("ser", "profile_reads").filter({r: CassandraRow => r.getString("month") == month })! ! val grouped = readings.groupBy({ r: CassandraRow => r.getLong("mxpn")})! ! def buildStructurePartial = buildStructure _! ! val initial = grouped.map(buildStructurePartial.tupled).flatMap(m => m)! ! val baseload = initial.map({ x: MonthlyBreakdown => {! x.markProcessed(getBaseload(x), "appliances", "baseload")! x }})! ! val fridged = baseload.map( { x: MonthlyBreakdown => {! x.markProcessed(getFridge(x), "appliances", "fridge")! x }})! ! if( interimSave ) {! val fridgeresults = fridged.map({ x: MonthlyBreakdown => x.interimStage("fridge")}).flatMap(identity).flatMap(identity)! fridgeresults.saveToCassandra(keyspace, "interim_breakdown", Seq("mxpn", "stage", "month", "readdate", "breakdown"))! }! ! val results = fridged.map({x: MonthlyBreakdown => x.rollupResults()}).flatMap(identity)! ! results.saveToCassandra( keyspace, "energy_breakdown", Seq("business_partner_id", "premise_id", "mpxn", "customer_type", "start_date", "end_date", "group", "energy_type", (3) Data science productionising Spark allows us to chain different algorithm