Cassandra and Spark for the Internet of Things

Cassandra and Spark for the Internet of Things Josep Casals
Lead Data Engineer - British Gas Connected Homes

The Smart Home

Hive 100K - 2 minutes

2.3 Billion Events

My Energy 3.8M - Monthly! 400K - Daily! 200K -
Half Hourly! 5K - 10 seconds

Insight

Non-Intrusive! Load Monitoring

30 minute data Filtering Time Series Magic Clustering Markov Model
Fridge Consumption Fridges - They are exciting… really

Home device data Collection Processing Storing 30 TB

Why Cassandra! for IoT?

Linear Scalability - C* is designed around Partition and Availability.
- Consistency is tunable. - Lightweight transactions with Paxos allow CP mode. C Consistency A Availability P Partition Paxos Lightweight Transactions To achieve linear scalability consistency is enforced only eventually

Very high availability - Implements Amazon’s Dynamo partitioning and replication
model ! - All nodes can be used to talk to the Database ! - Replication factor and consistency level determine how many nodes we can lose. Cassandra introduces no single point of failure

Good match for Time Series data - Data model from
Google Bigtable ! - Wide rows allow for sequential disk reads when time is selected as the clustering key. Data is stored sequentially on disk

Why Spark! for Large Scale! Data Science?

Fast In-memory Cluster Computing Spark’s world record in sorting 100
TB of data

Very complete API map(func)! filter(func) flatMap(func) mapPartitions(func) mapPartitionsWithIndex(func) sample(withReplacement, fraction,
seed) union(otherDataset) intersection(otherDataset) distinct([numTasks])) groupByKey([numTasks]) reduceByKey(func, [numTasks]) aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) sortByKey([ascending], [numTasks]) join(otherDataset, [numTasks]) cogroup(otherDataset, [numTasks]) cartesian(otherDataset) pipe(command, [envVars]) coalesce(numPartitions) repartition(numPartitions) repartitionAndSortWithinPartitions(partitioner) reduce(func)! collect() count() first() take(n) takeSample(withReplacement, num, [seed]) takeOrdered(n, [ordering]) saveAsTextFile(path) saveAsSequenceFile(path) saveAsObjectFile(path) countByKey() foreach(func)

Promising tools for Data Science

Spark Word Count in Java 7

Spark Word Count in Java 8

Spark Word Count in Scala

Hadoop Word Count in Java

Use cases • Data storage • Spark Streaming from queue
! • Data processing • Transformations and Joins ! • Data analytics • Data science productionising

Data Streaming End 2015: ! •Hive Home -> 200k users
•~ 15000 messages / s! ! •Connected boilers -> 25k users •~ 2500 messages / s! ! •Live Energy -> 50k users •~ 8500 messages / s

import org.apache.spark.streaming.{Seconds, StreamingContext}! import org.apache.spark.SparkConf! ! object StreamingDemo{! def main(args:
Array[String]) {! // Connect to the Cassandra Cluster! val client: CassandraConnector = new CassandraConnector()! val cassandraNode = if(args.length > 0) args(0) else "127.0.0.1"! client.connect(cassandraNode)! client.execute("CREATE KEYSPACE IF NOT EXISTS streaming_test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 } ;")! client.execute("CREATE TABLE IF NOT EXISTS streaming_test.words (word text PRIMARY KE count int);")! // Create a StreamingContext with a SparkConf configuration! val sparkConf = new SparkConf()! .setAppName(“StreamingDemo”)! val ssc = new StreamingContext(sparkConf, Seconds(5))! ! // Create a DStream that will connect to serverIP:serverPort! val lines = ssc.socketTextStream("localhost", 9999)! ! client.execute("USE streaming_test;")! lines.foreachRDD(rdd => {! rdd.collect().foreach(line => {! if (line.length > 1) {! client.execute("INSERT INTO words (word, count)" + "VALUES ('" + line + "', 1); }! println("Line from RDD: " + line)! })! })! ssc.start()! }! } (1) Spark Streaming - Storing to C* Use Spark-C* connector Replace by a RabbitMQ or Kafka stream.

Data Processing Energy Readings from British Gas head ends: !
• 30 M rows monthly ~ 1 M rows daily ! • 48 columns per row (1 reading each 1/2h) ! • 4M rows contracts tables ! • Need to join both tables based on contract date ! • Need to convert columns to rows for data science ! Difﬁcult to handle on a relational DB

object TransformReadings {! def main(args: Array[String]) {! val cassandraHost =
if(args.length > 0) args(0) else "127.0.0.1"! val conf = new SparkConf(true).setAppName("TransformReadings")! .set("spark.cassandra.connection.host", cassandraHost)! .setMaster("local[2]")! ! val sc = new SparkContext(conf)! val serIngest = sc.cassandraTable("ser_ingest", "profile_reads")! .groupBy(row => row.get[String]("mpxn"))! ! val contracts = sc.cassandraTable("ser_ingest", "contract")! .select("mpxn","move_in_date","business_partner","contract_account","premi se","postcode_gis")! .groupBy(row => row.get[String]("mpxn").takeRight(10))! ! serIngest.join(contracts)! .flatMap(transformRow)! .saveToCassandra("ser",! "half_hourly_reads",! SomeColumns("mxpn",! "month",! "energy_type",! "readdate",! "reading",! "business_partner_id",! "contract_id",! "premise_id",! "postcode"! )! )! }! ! def transformRow (data: (String, (Iterable[CassandraRow],Iterable[CassandraRow]))) = {….)! )! (2) Transformations and Joins join + transfor m

def transformRow (data: (String, (Iterable[CassandraRow],Iterable[CassandraRow]))) = {! val halfHourlyColumnNames =
(0 to 47).map(i => f"t${i/2}%02d${30*(i%2)}%02d") // Create se column names! val row = data._2._1.head! val contracts = data._2._2! val readDate = row.get[Date]("reading_date")! val current_contract = contracts.filter(c => c.get[Date]("move_in_date").before(readDate)) .sortBy(_.get[Date]("move_in_date"))! .reverse! .head! ! ! ! halfHourlyColumnNames.map(col => (! row.get[String]("mpxn"),! new SimpleDateFormat("yyyy-MM").format(readDate),! "elec",! new SimpleDateFormat("yyyy-MM-dd hhmm").parse(new SimpleDateFormat("yyyy-MM-dd ").format(readDate) + col.drop(1)),! row.get[Option[Double]](col).getOrElse(0.0),! current_contract.get[String]("business_partner"),! current_contract.get[String]("contract_account"),! current_contract.get[String]("premise"),! current_contract.get[String]("postcode_gis")! )! )! } (2) Transformations and Joins Get the right Create a row for

Data science productionising ! • Data scientists write algorithms in
R and C# ! • Fridge energy calculation algorithm • -> 1000 loc in C# ! • Spark allows us to reduce from 1000 to 400 loc ! • We translate R and C# to Java / Scala ! • Would be great if Scala was the language of choice for Data scientists

def main(args: Array[String]) = {! val sparkConf = new SparkConf().{…}!
! val (month, keyspace, interimSave) = parseParams(args, config)! ! val sc = new SparkContext(sparkConf)! ! val readings = sc.cassandraTable("ser", "profile_reads").filter({r: CassandraRow => r.getString("month") == month })! ! val grouped = readings.groupBy({ r: CassandraRow => r.getLong("mxpn")})! ! def buildStructurePartial = buildStructure _! ! val initial = grouped.map(buildStructurePartial.tupled).flatMap(m => m)! ! val baseload = initial.map({ x: MonthlyBreakdown => {! x.markProcessed(getBaseload(x), "appliances", "baseload")! x }})! ! val fridged = baseload.map( { x: MonthlyBreakdown => {! x.markProcessed(getFridge(x), "appliances", "fridge")! x }})! ! if( interimSave ) {! val fridgeresults = fridged.map({ x: MonthlyBreakdown => x.interimStage("fridge")}).flatMap(identity).flatMap(identity)! fridgeresults.saveToCassandra(keyspace, "interim_breakdown", Seq("mxpn", "stage", "month", "readdate", "breakdown"))! }! ! val results = fridged.map({x: MonthlyBreakdown => x.rollupResults()}).flatMap(identity)! ! results.saveToCassandra( keyspace, "energy_breakdown", Seq("business_partner_id", "premise_id", "mpxn", "customer_type", "start_date", "end_date", "group", "energy_type", (3) Data science productionising Spark allows us to chain different algorithm

Cassandra and Spark for the Internet of Things

Cassandra and Spark for the Internet of Things

Josep Casals

More Decks by Josep Casals

Other Decks in Technology

Featured

Transcript