Spark Streaming の基本とスケールする時系列データ処理

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

® © 2015 MapR Technologies 3 Spark Streaming はなぜ必要? •  時系列列データ処理理: –  リアルタイムに結果を得る •  利利⽤用例例 –  ソーシャルネットワークのトレンド –  Web サイト統計、監視 –  不不正検知 –  広告クリック課⾦金金 put put put put Time stamped data data •  センサー、システムメトリクス、イベント、ログファイル •  ストックティッカー、ユーザーアクティビティ •  ⼤大容量量、⾼高頻度度 Data for real-time monitoring

Slide 4

Slide 4 text

® © 2015 MapR Technologies 4 時系列列データとは? •  タイムスタンプ付きのデータ –  センサーデータ –  ログファイル –  電話 ® © 2015 MapR Technologies What is time series data? •  Stuff with timestamps –  Sensor data –  log files –  Phones.. Credit Card Transactions Web user behaviour Social media Log files Geodata Sensors ® © 2015 MapR Technologies 4 What is time series data? •  Stuff with timestamps –  Sensor data –  log files –  Phones.. Credit Card Transactions Web user behaviour Social media Log files Geodata Sensors ® © 2015 MapR Technologies 4 What is time series data? •  Stuff with timestamps –  Sensor data –  log files –  Phones.. Credit Card Transactions Web user behaviour Social media Log files Geodata Sensors ® © 2015 MapR Technologies 4 What is time series data? •  Stuff with timestamps –  Sensor data –  log files –  Phones.. edit Card Transactions Web user behaviour Social media Log files Geodata Sensors ® © 2015 MapR Technologies 4 What is time series data? •  Stuff with timestamps –  Sensor data –  log files –  Phones.. Credit Card Transactions Web user behaviour Social media Log files Geodata Sensors ® © 2015 MapR Technologies 4 What is time series data? •  Stuff with timestamps –  Sensor data –  log files –  Phones.. Credit Card Transactions Web user behaviour Social media Log files Geodata Sensors クレジットカードトランザクションソーシャルメディアログファイル地理理データ Web ユーザー⾏行行動履履歴センサー

Slide 5

Slide 5 text

® © 2015 MapR Technologies 5 Apache Spark Streaming はなぜ必要? •  どのような場合? –  データを取得した瞬間に分析したいですか? ® © 2015 MapR Technologies 5 Why Spark Streaming ? What If? •  You want to analyze data as it arrives? For Example Time Series Data: Sensors, Clicks, Logs, Stats 時系列列データの例例: センサー、クリック、ログ、統計

Slide 6

Slide 6 text

® © 2015 MapR Technologies 6 バッチ処理理 ® © 2015 MapR Technologies 6 Batch Processing It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees It was hot at 6:05 yesterday! Batch processing may be too late for some events ® © 2015 MapR Technologies 6 Batch Processing It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees It was hot at 6:05 yesterday! Batch processing may be too late for some events ® © 2015 MapR Technologies 6 Batch Processing It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees It was hot at 6:05 yesterday! Batch processing may be too late for some events 特定のイベントに関してはバッチ処理理では遅すぎるかもしれない It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees 昨⽇日の 6:05 は暑かった!

Slide 7

Slide 7 text

® © 2015 MapR Technologies 7 イベント処理理 ® © 2015 MapR Technologies 6 Batch Processing It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees It was hot at 6:05 yesterday! Batch processing may be too late for some events ® © 2015 MapR Technologies 7 Event Processing It's 6:05 and 90 degrees Someone should open a window! Streaming Its becoming important to process events as they arrive It's 6:05 and 90 degrees 誰か窓を開けてください! イベントを取得した瞬間に処理理することが重要になってくる

Slide 8

Slide 8 text

Slide 9

Slide 9 text

® © 2015 MapR Technologies 9 ストリーム処理理アーキテクチャ ® © 2015 MapR Technologies 9 Stream Processing Architecture Streaming Sources/Apps MapR-FS Data Ingest Topics MapR-DB Data Storage MapR-FS Apps$ Stream Processing HDFS HDFS HBase

Slide 10

Slide 10 text

Slide 11

Slide 11 text

® © 2015 MapR Technologies 11 Spark Streaming アーキテクチャ •  データストリームを X 秒ごとのかたまり(Batch)に分割 – これを DStream と呼びます = 連続した複数の RDD Spark Streaming ⼊入⼒力力データストリーム DStream RDD Batch Batch インターバル time 0 から 1 までのデータ time 1 から 2 までのデータ RDD @ time 2 time 2 から 3 までのデータ RDD @ time 3 RDD @ time 1

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

® © 2015 MapR Technologies 16 RDD の操作 RDD RDD RDD RDD Transformations Action Value linesWithErrorRDD.count()! 6! ! linesWithErrorRDD.first()! # Error line! textFile = sc.textFile(”SomeFile.txt”)! linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line)!

Slide 17

Slide 17 text

® © 2015 MapR Technologies 17 Dstream の処理理 transform Transform map reduceByValue count DStream RDD DStream RDD transform transform •  Transformation を利利⽤用して処理理 – 新しい RDD を作成 time 0 から 1 までのデータ time 1 から 2 までのデータ RDD @ time 2 time 2 から 3 までのデータ RDD @ time 3 RDD @ time 1 RDD @ time 1 RDD @ time 2 RDD @ time 3

Slide 18

Slide 18 text

® © 2015 MapR Technologies 18 キーコンセプト •  データソース •  Transformation: 新しい DStream を作成 –  標準 RDD オペレーション: map, filter, union, reduce, join, ... –  ステートフルオペレーション: UpdateStateByKey(function), countByValueAndWindow, ... •  出⼒力力オペレーション

Slide 19

Slide 19 text

® © 2015 MapR Technologies 19 Spark Streaming アーキテクチャ •  処理理結果は Batch として出⼒力力される Spark 処理理結果の Batch Spark Streaming ⼊入⼒力力データストリーム DStream RDD Batch time 0 から 1 までのデータ time 1 から 2 までのデータ RDD @ time 2 time 2 から 3 までのデータ RDD @ time 3 RDD @ time 1

Slide 20

Slide 20 text

® © 2015 MapR Technologies 20 キーコンセプト •  データソース •  Transformation •  出⼒力力オペレーション: 処理理のトリガーになる –  saveAsHadoopFiles – HDFS に保存 –  saveAsHadoopDataset – HBase に保存 –  saveAsTextFiles –  foreach – RDD の Batch ごとに⾏行行う任意の処理理

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

® © 2015 MapR Technologies 23 CSV のデータ列列を Sensor オブジェクトに変換 case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double) def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble) }

Slide 24

Slide 24 text

® © 2015 MapR Technologies 24 スキーマ •  すべてのイベントを格納 data カラムファミリーにはデータ保持期間を設定するかも •  フィルタリングされたアラートを alerts カラムファミリーに格納 •  ⽇日次集計を stats カラムファミリーに格納⾏行行キーカラムファミリー data カラムファミリー alerts カラムファミリー stats hz … psi psi … hz_avg … psi_min COHUTTA_3/10/14_1:01 10.37 84 0 COHUTTA_3/10/14 10 0

Slide 25

Slide 25 text

® © 2015 MapR Technologies 25 Spark Streaming コードの基本ステップ Spark Streaming コードの基本ステップは下記の通り: 1.  Spark StreamingContext オブジェクトを初期化 2.  コンテキストを使⽤用して DStream を作成 –  ソースからのストリーミングデータを表す 1.  Transformation を適⽤用 •  新しい DStream が⽣生成される 2.  出⼒力力オペレーションを適⽤用 •  データを永続化または出⼒力力 3.  データ受信を開始して処理理する –  streamingContext.start() を使⽤用 4.  処理理が停⽌止するのを待つ –  streamingContext.awaitTermination() を使⽤用

Slide 26

Slide 26 text

® © 2015 MapR Technologies 26 DStream の⽣生成 val ssc = new StreamingContext(sparkConf, Seconds(2)) val linesDStream = ssc.textFileStream("/mapr/stream") batch time 0-1 linesDStream batch time 1-2 batch time 1-2 DStream: データストリームを表す連続したRDD RDD としてメモリ上に格納される

Slide 27

Slide 27 text

® © 2015 MapR Technologies 27 DStream の処理理 val linesDStream = ssc.textFileStream("directory path") val sensorDStream = linesDStream.map(parseSensor) map 各 Batch ごとに⽣生成される新しい RDD batch time 0-1 linesDStream RDD sensorDstream RDD batch time 1-2 map map batch time 1-2

Slide 28

Slide 28 text

Slide 29

Slide 29 text

® © 2015 MapR Technologies 29 DataFrame と SQL オペレーション // RDD ごとにセンサーオブジェクトフィルターで解析 sensorDStream.foreachRDD { rdd => . . . alertRdd.toDF().registerTempTable("alert") // アラートデータとポンプの保守情報をジョイン val alertViewDF = sqlContext.sql( "select s.resid, s.psi, p.pumpType from alert s join pump p on s.resid = p.resid join maint m on p.resid=m.resid") . . . }

Slide 30

Slide 30 text

® © 2015 MapR Technologies 30 HBase への保存 // RDD ごとにセンサーオブジェクトフィルターで解析 sensorDStream.foreachRDD { rdd => . . . // アラートを put オブジェクトに変換し HBase に書き出す rdd.map(Sensor.convertToPutAlert) .saveAsHadoopDataset(jobConfig) }

Slide 31

Slide 31 text

® © 2015 MapR Technologies 31 HBase への保存 rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig) map Put オブジェクトを HBase に書き出す batch time 0-1 linesRDD DStream sensorRDD DStream batch time 1-2 map map batch time 1-2 HBase save save save 出⼒力力オペレーション: 外部ストレージにデータを永続化

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

® © 2015 MapR Technologies 34 HBase の読み書き HBase HBase Read and Write val hBaseRDD = sc.newAPIHadoopRDD( conf,classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig) newAPIHadoopRDD Row key Result saveAsHadoopDataset Key Put HBase Scan Result val hBaseRDD = sc.newAPIHadoopRDD( conf,classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)

Slide 35

Slide 35 text

® © 2015 MapR Technologies 35 HBase からデータを読む // HBase デーブルから (rowkey, Result) タプルからなる RDD をロード val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) // 結果を得る val resultRDD = hBaseRDD.map(tuple => tuple._2) // (RowKey, ColumnValue) の RDD に加⼯工 val keyValueRDD = resultRDD.map( result => (Bytes.toString(result.getRow()).split(" ")(0), Bytes.toDouble(result.value))) // rowkey で group by, カラムの値の統計情報を取得 val keyStatsRDD = keyValueRDD.groupByKey().mapValues(list => StatCounter(list))

Slide 36

Slide 36 text

® © 2015 MapR Technologies 36 HBase にデータを書き出す // HBase テーブルのカラムファミリー data に保存 val jobConfig: JobConf = new JobConf(conf, this.getClass) jobConfig.setOutputFormat(classOf[TableOutputFormat]) jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName) // 圧⼒力力統計データを put に変換し hbase テーブルの stats カラムファミリーに書き出す keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)