Spark Streaming の基本とスケールする時系列データ処理

® © 2015 MapR Technologies 1 ® © 2015 MapR
Technologies Spark Streaming の基本とスケールする時系列列データ処理理草薙昭彦 – MapR Technologies 2015 年年 12 ⽉月 9 ⽇日

® © 2015 MapR Technologies 2 今⽇日のトピック •  Apache Spark
Streaming はなぜ必要? •  Apache Spark Streaming 概要 –  キーコンセプトとアーキテクチャ •  利利⽤用例例草薙昭彦 (@nagix)

® © 2015 MapR Technologies 3 Spark Streaming はなぜ必要? • 
時系列列データ処理理: –  リアルタイムに結果を得る •  利利⽤用例例 –  ソーシャルネットワークのトレンド –  Web サイト統計、監視 –  不不正検知 –  広告クリック課⾦金金 put put put put Time stamped data data •  センサー、システムメトリクス、イベント、ログファイル •  ストックティッカー、ユーザーアクティビティ •  ⼤大容量量、⾼高頻度度 Data for real-time monitoring

® © 2015 MapR Technologies 4 時系列列データとは? •  タイムスタンプ付きのデータ – 
センサーデータ –  ログファイル –  電話 ® © 2015 MapR Technologies What is time series data? •  Stuff with timestamps –  Sensor data –  log files –  Phones.. Credit Card Transactions Web user behaviour Social media Log files Geodata Sensors ® © 2015 MapR Technologies 4 What is time series data? •  Stuff with timestamps –  Sensor data –  log files –  Phones.. Credit Card Transactions Web user behaviour Social media Log files Geodata Sensors ® © 2015 MapR Technologies 4 What is time series data? •  Stuff with timestamps –  Sensor data –  log files –  Phones.. Credit Card Transactions Web user behaviour Social media Log files Geodata Sensors ® © 2015 MapR Technologies 4 What is time series data? •  Stuff with timestamps –  Sensor data –  log files –  Phones.. edit Card Transactions Web user behaviour Social media Log files Geodata Sensors ® © 2015 MapR Technologies 4 What is time series data? •  Stuff with timestamps –  Sensor data –  log files –  Phones.. Credit Card Transactions Web user behaviour Social media Log files Geodata Sensors ® © 2015 MapR Technologies 4 What is time series data? •  Stuff with timestamps –  Sensor data –  log files –  Phones.. Credit Card Transactions Web user behaviour Social media Log files Geodata Sensors クレジットカードトランザクションソーシャルメディアログファイル地理理データ Web ユーザー⾏行行動履履歴センサー

® © 2015 MapR Technologies 5 Apache Spark Streaming はなぜ必要?
•  どのような場合? –  データを取得した瞬間に分析したいですか? ® © 2015 MapR Technologies 5 Why Spark Streaming ? What If? •  You want to analyze data as it arrives? For Example Time Series Data: Sensors, Clicks, Logs, Stats 時系列列データの例例: センサー、クリック、ログ、統計

® © 2015 MapR Technologies 6 バッチ処理理 ® © 2015
MapR Technologies 6 Batch Processing It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees It was hot at 6:05 yesterday! Batch processing may be too late for some events ® © 2015 MapR Technologies 6 Batch Processing It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees It was hot at 6:05 yesterday! Batch processing may be too late for some events ® © 2015 MapR Technologies 6 Batch Processing It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees It was hot at 6:05 yesterday! Batch processing may be too late for some events 特定のイベントに関してはバッチ処理理では遅すぎるかもしれない It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees 昨⽇日の 6:05 は暑かった!

® © 2015 MapR Technologies 7 イベント処理理 ® © 2015
MapR Technologies 6 Batch Processing It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees It was hot at 6:05 yesterday! Batch processing may be too late for some events ® © 2015 MapR Technologies 7 Event Processing It's 6:05 and 90 degrees Someone should open a window! Streaming Its becoming important to process events as they arrive It's 6:05 and 90 degrees 誰か窓を開けてください! イベントを取得した瞬間に処理理することが重要になってくる

® © 2015 MapR Technologies 8 Spark Streaming 概要 • 
ライブデータを使ったスケーラブル、⾼高スループット、耐障害性のあるストリーム処理理を可能にする •  コア Spark API を拡張データソースデータシンク

® © 2015 MapR Technologies 9 ストリーム処理理アーキテクチャ ® © 2015
MapR Technologies 9 Stream Processing Architecture Streaming Sources/Apps MapR-FS Data Ingest Topics MapR-DB Data Storage MapR-FS Apps$ Stream Processing HDFS HDFS HBase

® © 2015 MapR Technologies 10 キーコンセプト •  データソース: – 
ファイルベース: HDFS –  ネットワークベース: TCP ソケット、 Twitter, Kafka, Flume, ZeroMQ, Akka Actor •  Transformation •  出⼒力力オペレーション

® © 2015 MapR Technologies 11 Spark Streaming アーキテクチャ • 
データストリームを X 秒ごとのかたまり(Batch)に分割 – これを DStream と呼びます = 連続した複数の RDD Spark Streaming ⼊入⼒力力データストリーム DStream RDD Batch Batch インターバル time 0 から 1 までのデータ time 1 から 2 までのデータ RDD @ time 2 time 2 から 3 までのデータ RDD @ time 3 RDD @ time 1

® © 2015 MapR Technologies 12 Resilient Distributed Datasets (RDD)
Spark は RDD を中⼼心に回っている •  Read Only な要素の集合

® © 2015 MapR Technologies 13 Resilient Distributed Datasets (RDD)
Spark は RDD を中⼼心に回っている •  Read Only な要素の集合 •  並列列に処理理される •  メモリ上にキャッシュ –  もしくはディスク上 •  耐障害性

® © 2015 MapR Technologies 14 RDD の操作 RDD textFile
= sc.textFile(”SomeFile.txt”)!

® © 2015 MapR Technologies 15 RDD の操作 RDD RDD
RDD RDD Transformations linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line)! linesRDD = sc.textFile(”LogFile.txt”)!

® © 2015 MapR Technologies 16 RDD の操作 RDD RDD
RDD RDD Transformations Action Value linesWithErrorRDD.count()! 6! ! linesWithErrorRDD.first()! # Error line! textFile = sc.textFile(”SomeFile.txt”)! linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line)!

® © 2015 MapR Technologies 17 Dstream の処理理 transform Transform
map reduceByValue count DStream RDD DStream RDD transform transform •  Transformation を利利⽤用して処理理 – 新しい RDD を作成 time 0 から 1 までのデータ time 1 から 2 までのデータ RDD @ time 2 time 2 から 3 までのデータ RDD @ time 3 RDD @ time 1 RDD @ time 1 RDD @ time 2 RDD @ time 3

® © 2015 MapR Technologies 18 キーコンセプト •  データソース • 
Transformation: 新しい DStream を作成 –  標準 RDD オペレーション: map, filter, union, reduce, join, ... –  ステートフルオペレーション: UpdateStateByKey(function), countByValueAndWindow, ... •  出⼒力力オペレーション

® © 2015 MapR Technologies 19 Spark Streaming アーキテクチャ • 
処理理結果は Batch として出⼒力力される Spark 処理理結果の Batch Spark Streaming ⼊入⼒力力データストリーム DStream RDD Batch time 0 から 1 までのデータ time 1 から 2 までのデータ RDD @ time 2 time 2 から 3 までのデータ RDD @ time 3 RDD @ time 1

® © 2015 MapR Technologies 20 キーコンセプト •  データソース • 
Transformation •  出⼒力力オペレーション: 処理理のトリガーになる –  saveAsHadoopFiles – HDFS に保存 –  saveAsHadoopDataset – HBase に保存 –  saveAsTextFiles –  foreach – RDD の Batch ごとに⾏行行う任意の処理理

® © 2015 MapR Technologies 22 利利⽤用例例: 時系列列データリアルタイム監視のためのデータ read
センサータイムスタンプ付きデータ Spark による処理理 Spark Streaming

® © 2015 MapR Technologies 23 CSV のデータ列列を Sensor オブジェクトに変換
case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double) def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble) }

® © 2015 MapR Technologies 24 スキーマ •  すべてのイベントを格納 data
カラムファミリーにはデータ保持期間を設定するかも •  フィルタリングされたアラートを alerts カラムファミリーに格納 •  ⽇日次集計を stats カラムファミリーに格納⾏行行キーカラムファミリー data カラムファミリー alerts カラムファミリー stats hz … psi psi … hz_avg … psi_min COHUTTA_3/10/14_1:01 10.37 84 0 COHUTTA_3/10/14 10 0

® © 2015 MapR Technologies 25 Spark Streaming コードの基本ステップ Spark
Streaming コードの基本ステップは下記の通り: 1.  Spark StreamingContext オブジェクトを初期化 2.  コンテキストを使⽤用して DStream を作成 –  ソースからのストリーミングデータを表す 1.  Transformation を適⽤用 •  新しい DStream が⽣生成される 2.  出⼒力力オペレーションを適⽤用 •  データを永続化または出⼒力力 3.  データ受信を開始して処理理する –  streamingContext.start() を使⽤用 4.  処理理が停⽌止するのを待つ –  streamingContext.awaitTermination() を使⽤用

® © 2015 MapR Technologies 26 DStream の⽣生成 val ssc
= new StreamingContext(sparkConf, Seconds(2)) val linesDStream = ssc.textFileStream("/mapr/stream") batch time 0-1 linesDStream batch time 1-2 batch time 1-2 DStream: データストリームを表す連続したRDD RDD としてメモリ上に格納される

® © 2015 MapR Technologies 27 DStream の処理理 val linesDStream
= ssc.textFileStream("directory path") val sensorDStream = linesDStream.map(parseSensor) map 各 Batch ごとに⽣生成される新しい RDD batch time 0-1 linesDStream RDD sensorDstream RDD batch time 1-2 map map batch time 1-2

® © 2015 MapR Technologies 28 DStream の処理理 // RDD
ごとの処理理 sensorDStream.foreachRDD { rdd => // 低い圧⼒力力のセンサーデータをフィルタリング val alertRDD = sensorRDD.filter(sensor => sensor.psi < 5.0) . . . }

® © 2015 MapR Technologies 29 DataFrame と SQL オペレーション
// RDD ごとにセンサーオブジェクトフィルターで解析 sensorDStream.foreachRDD { rdd => . . . alertRdd.toDF().registerTempTable("alert") // アラートデータとポンプの保守情報をジョイン val alertViewDF = sqlContext.sql( "select s.resid, s.psi, p.pumpType from alert s join pump p on s.resid = p.resid join maint m on p.resid=m.resid") . . . }

® © 2015 MapR Technologies 30 HBase への保存 // RDD
ごとにセンサーオブジェクトフィルターで解析 sensorDStream.foreachRDD { rdd => . . . // アラートを put オブジェクトに変換し HBase に書き出す rdd.map(Sensor.convertToPutAlert) .saveAsHadoopDataset(jobConfig) }

® © 2015 MapR Technologies 31 HBase への保存 rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig) map
Put オブジェクトを HBase に書き出す batch time 0-1 linesRDD DStream sensorRDD DStream batch time 1-2 map map batch time 1-2 HBase save save save 出⼒力力オペレーション: 外部ストレージにデータを永続化

® © 2015 MapR Technologies 32 データ受信の開始 sensorDStream.foreachRDD { rdd
=> . . . } // 処理理を開始 ssc.start() // 処理理が停⽌止されるのを待つ ssc.awaitTermination()

® © 2015 MapR Technologies 33 HBase を⼊入⼒力力元や出⼒力力先として使う Read Write
HBase データベース Spark アプリケーション例例: 集計処理理と保存、事前処理理、マテリアライズドビュー

® © 2015 MapR Technologies 34 HBase の読み書き HBase HBase
Read and Write val hBaseRDD = sc.newAPIHadoopRDD( conf,classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig) newAPIHadoopRDD Row key Result saveAsHadoopDataset Key Put HBase Scan Result val hBaseRDD = sc.newAPIHadoopRDD( conf,classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)

® © 2015 MapR Technologies 35 HBase からデータを読む // HBase
デーブルから (rowkey, Result) タプルからなる RDD をロード val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) // 結果を得る val resultRDD = hBaseRDD.map(tuple => tuple._2) // (RowKey, ColumnValue) の RDD に加⼯工 val keyValueRDD = resultRDD.map( result => (Bytes.toString(result.getRow()).split(" ")(0), Bytes.toDouble(result.value))) // rowkey で group by, カラムの値の統計情報を取得 val keyStatsRDD = keyValueRDD.groupByKey().mapValues(list => StatCounter(list))

® © 2015 MapR Technologies 36 HBase にデータを書き出す // HBase
テーブルのカラムファミリー data に保存 val jobConfig: JobConf = new JobConf(conf, this.getClass) jobConfig.setOutputFormat(classOf[TableOutputFormat]) jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName) // 圧⼒力力統計データを put に変換し hbase テーブルの stats カラムファミリーに書き出す keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)

Spark Streaming の基本とスケールする時系列データ処理

Spark Streaming の基本とスケールする時系列データ処理

More Decks by 草薙昭彦

Other Decks in Technology

Featured

Transcript