Oracle Cloud Hangout Cafe - Cloud Native × Streaming はじめの一歩

Oracle Cloud Hangout Cafe - Cloud Native × Streaming はじめの一歩

Oracle Cloud Hangout Cafe(おちゃかふぇ)のセッションスライドです。
Apache Spark Streamingについて概要から応用までご紹介しています。
- 導入編 - Spark/Spark Streamingの概要
- 実践編 - Spark Streamingを動かしてみよう
- 応用編 - Window集計処理あれこれ & デモ

(セッションの録画はこちら)
http://tiny.cc/ochacafe2-6-video

(イベントページ)
https://ochacafe.connpass.com/event/169396/

3115a782126be714b5f94d24073c957d?s=128

oracle4engineer

May 13, 2020
Tweet

Transcript

  1. 1.
  2. 3.

    3 Copyright © 2020, Oracle and/or its affiliates. All rights

    reserved 3 2 1 応用編 - Window集約処理あれこれ & デモ 実践編 - Spark Streamingを動かしてみよう 導入編 - Spark/Spark Streamingの概要 Agenda 野中 恭大郎 園田 憲一 古手川 忠久
  3. 4.

    Oracle Cloud Hangout Cafe 2 #6 – Cloud Native x

    Streaming Senior Solution Engineer Oracle Corporation Japan May 13th, 2020 Kenichi Sonoda
  4. 5.

    5 • • • • • • Spark Apache Spark

    1 hour 3 hours 1 hour 1 hour Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  5. 6.

    6 MapReduce HDFS 初期 データ 結果 データ 中間 データ 中間

    データ Read Write Write Read HDFS データ Read Read Read HDFS IO HDFS IO I/O Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  6. 7.

    7 Spark メモリ HDFS 初期 データ 結果 データ 中間 データ

    中間 データ メモリ HDFS データ キャッシュ HDFS I/O https://spark.apache.org/ Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  7. 8.

    8 • RDD Spark • • RDD • • LRU

    • • • RDD RDD • • RDD - - • RDD - RDD(Resilient Distributed Dataset) DRAM HDFS RDD Partition Block Block Partition Partition Block Block Partition Filter/Map/Reduce Partition Block Block Partition Partition Block Block Partition Filter/Map/Reduce Partition Block Block Partition Partition Block Block Partition Filter/Map/Reduce RDD Worker Worker Worker Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  8. 9.

    9 Dataset SQL • • • • DataFrame SQL •

    • • • RDD (filter/map/reduce ) • • • • 2013 2011 2015 Spark 1.3 Spark 1.0 Spark 2.0 Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  9. 10.

    10 • • Spark (RDD ) • • Spark (spark-submit)

    • • • • • • Standalone YARN MESOS Kubernetes Spark Cluster Manager Executor Task Partition Partition Task Partition Partition Filter/Map/Reduce Executor Task Partition Partition Task Partition Partition Filter/Map/Reduce Executor Task Partition Partition Task Partition Partition Filter/Map/Reduce Driver Program Executor Task Partition Partition Task Partition Partition Filter/Map/Reduce Executor Task Partition Partition Task Partition Partition Filter/Map/Reduce Executor Task Partition Partition Task Partition Partition Filter/Map/Reduce Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  10. 11.

    11 Spark Streaming • • T T + 1 T

    + 2 Spark • • Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  11. 12.

    12 • • RDD (DStream) Spark Streaming RDD RDD RDD

    RDD DStream • DStream • ( ) scala> val ssc = StreamingContext(sc, Seconds(5)) RDD RDD RDD RDD DStream RDD RDD RDD RDD DStream 5 Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  12. 13.

    13 • DataFrame/Dataset Spark SQL • • • • •

    • Spark Streaming • Structured Streaming https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  13. 14.

    14 • Data Source • Kafka/Flume/HDFS/S3/Kinesis/Twitter • Input(Input Source) •

    (DataFrame/Dataset) • Query(Streaming Query) • • • Result(Result Table) • • Output(Data Sink/Output Sink) • (Kafka/HDFS/S3/MySQL/Cassandra ) • - Complete: - Append : - Update : Structured Streaming https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Data Source Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  14. 15.

    15 Structured Streaming ( ) https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html // DataFrame val lines

    = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load() // netcat localhost:9999 $ nc -lk 9999 cat dog ……… // val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count() // val query = wordCounts.writeStream.outputMode("complete").format("console").start() Query Input Output Data Source Input source Data Sink Output Mode Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  15. 17.

    Oracle Cloud Hangout Cafe 2 #6 – Cloud Native x

    Streaming Solution Engineer Oracle Corporation Japan May 13th, 2020 Kyotaro Nonaka
  16. 18.

    18 • • • EC - WebSphere Seasar2 Flash •

    @non_kyon Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  17. 19.

    spark-submit • • Spark Cluster Manager (Standalone, Mesos, YARN, Kubernetes)

    • Spark Standalone(local) spark-submit spark-shell • / • - trial & error • spark-shell 19 Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  18. 20.

    • • Spark (RDD ) • • Spark (spark-submit) •

    • • • • • Standalone YARN MESOS Kubernetes Spark Cluster Manager Executor Task Partition Partition Task Partition Partition Filter/Map/Reduce Executor Task Partition Partition Task Partition Partition Filter/Map/Reduce Executor Task Partition Partition Task Partition Partition Filter/Map/Reduce Driver Program Executor Task Partition Partition Task Partition Partition Filter/Map/Reduce Executor Task Partition Partition Task Partition Partition Filter/Map/Reduce Executor Task Partition Partition Task Partition Partition Filter/Map/Reduce 20 Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  19. 21.

    spark-shell, spark-submit master Master Cluster Manager --master local( ) --master

    yarn Master Standalone ⇒1 Spark Master yarn ⇒ Spark Driver Executer Executer local machine Driver Executer Executer local machine Spark yarn master master 21 Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  20. 22.

    spark-shell, spark-submit deploy-mode Driver(Spark ) --deploy-mode client( ) --deploy-mode cluster

    Driver Driver Worker Driver Executer Executer local machine Spark Driver Executer Executer local machine Spark 22 Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  21. 23.

    • local Standalone word count Python ※3rd • python: spark-submit

    --py-files (.zip/.egg) • Java, Scala: Maven, sbt JAR 23 Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  22. 24.

    24 Copyright © 2020, Oracle and/or its affiliates. All rights

    reserved Data Lake Platform Oracle Gen 2 Cloud • Cloud Console • Oracle Cloud Platform - • Cloud SQL Oracle SQL Cloudera Enterprise Data Hub Managed PaaS Oracle Big Data Service v Infrastructure Data Management Database – Data Lake - Access – Integration - Preparation CPU – GPU – Storage - Network OCI Console Cloudera Manager
  23. 25.

    • spark-submit • --master yarn, --deploy-mode client • Spark Cloudera

    Oracle Big Data Service - Cloudera: Oracle Big Data Service(Cloudera) Driver Executer Executer Utility Worker 25 yarn master Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  24. 26.

    Oracle Cloud Hangout Cafe 2 #6 – Cloud Native x

    Streaming Senior Director Oracle Corporation Japan May 13th, 2020 Tadahisa Kotegawa
  25. 27.

    27 ( : Cloud Pursuit) 2020 https://codezine.jp/article/detail/12054 [CodeZine] Helidon @tkotegaw

    tkote / oracle-japan Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  26. 28.

    28 Oracle Global Hackathon - Chatbot Helidon pod CronJob pod

    Weather Batch Service API Helidon Digital Assistant Mobile Digital Assistant Internet Twitter Google Tokyo Metro Dark Sky (Big Data Service) Oracle Container Engine for Kubernetes Weather Twitter Object Storage Autonomous Database Minato City Oracle APEX Oracle Functions Events Oracle Cloud Tweet (MLLib ) by Google Object Storage Spark Streaming “Team OCHaCafe” APAC 2 Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  27. 29.

    29 Event Time v.s. Processing Time • Event Time: Timestamp

    • Processing Time: • / Event Time Windowing • Streaming Fixed Window Sliding Window Session Window window window window window window window window window window window Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  28. 30.

    30 (Aggregation) • Structured Streaming Window Fixed Window Aggregation Sliding

    Window Aggregation groupBy( window($“ts”, “10 minutes”) [, $“ “…] ) groupBy( window($"ts", "10 minutes", "5 minutes") [, $“ “…]) 10 window 10 window 5 | | | | | | event time 0 5 10 15 20 25 | | | | | | event time 0 5 10 15 20 25 window #1 window #2 window #3 window #1 window #2 window #3 window #4 window #5 event time Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  29. 31.

    31 • Watermark • Event Time • Event Time Watermark

    • Output mode “append” “update” - “complete” Watermark • Watermark Watermark Watermarking Structured Streaming Programming Guide val windowedCounts = words .withWatermark("timestamp", "10 minutes") .groupBy( window($"timestamp", "10 minutes", "5 minutes"), $"word") .count() Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  30. 32.

    32 Event Time Window ( ) • 3 100 •

    mapGroupsWithState 1 flatMapGroupsWithState (0 ) Arbitrary Stateful Processing (Scala Java ) Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  31. 33.

    33 m WARNING n NORMAL Slack Demo: - m sec

    n sec Status NORMAL NORMAL WARNING Notification Notification m = n = 30 = 100℃ 5 Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  32. 34.

    34 Sliding Window Window Window Window Sliding Window Window •

    < → NORMAL • > → WARNING NORMAL/WARNING • 1/2 - Sliding Window window window / - < → NORMAL - > → WARNING - → UNKNOWN TRANSIENT WARNING Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  33. 35.

    35 Arbitrary Stateful Processing Window = NORMAL/WARNING NORMAL/WARNING 2/2 -

    Arbitrary Stateful Processing - → NORMAL - → WARNING = WARNING Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  34. 36.

    Helidon Helidon Kafka API Kafka API Demo: OCI Streaming Spark

    Streaming Slack Alerter Monitor App Temp. Reporter Sensor Manager stream stream Event time processing Sensor Data { "rackId":"rack-03", "temperature":95.0, "timestamp":"2020-04-13T15:22:09+09:00 } Compute Streaming Big Data Service Oracle Cloud slack simulator Copyright © 2020, Oracle and/or its affiliates. All rights reserved 36 #1 SW #2 ASP
  35. 37.

    37 • - IoT SNS - • Apache Kafka Streaming

    - Kafka Broker Kafka API - Kafka Connect Oracle Cloud Infrastructure - Streaming API Gateway Oracle Functions Events IoT Mobile/Web Activities Object Storage Database System App Kafka Client (Producer/Consumer) Streaming Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  36. 39.

    39 : – streaming streaming Helidon Helidon Apache Spark Slack

    Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  37. 40.

    40 .selectExpr("CAST(value AS STRING)") .select(from_json($"value", monitorDataSchema).as("rackInfo")) .select( $"rackInfo.rackId".as("rackId"), $"rackInfo.temperature".as("temperature"), $"rackInfo.timestamp".as("timestamp")

    ) .withWatermark("timestamp", "0 second") .groupBy($"rackId", window($"timestamp", "30 seconds", "5 seconds")) .agg( max("temperature").as("maxTemp”), min(“temperature”).as(“minTemp"), count(“temperature”) ) .withColumn("status", checkState($"maxTemp", $"minTemp“, $”count”)) Sliding Window { "rackId":"rack-03", "temperature":95.0, "timestamp":"2020-04-13T15:22:09+09:00 } Kafka value Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  38. 42.

    42 ------------------------------------------- Batch: 100 ------------------------------------------- +-------+-------------------+-------------------+---------+-----+-----+-----+ |rackId |start |end |status

    |max |min |count| +-------+-------------------+-------------------+---------+-----+-----+-----+ |rack-02|2020-05-04 13:32:00|2020-05-04 13:32:30|Transient|109.0|85.0 |6 | |rack-02|2020-05-04 13:32:05|2020-05-04 13:32:35|Warning |109.0|109.0|6 | +-------+-------------------+-------------------+---------+-----+-----+-----+ ------------------------------------------- Batch: 101 ------------------------------------------- +-------+-------------------+-------------------+-------+-----+-----+-----+ |rackId |start |end |status |max |min |count| +-------+-------------------+-------------------+-------+-----+-----+-----+ |rack-02|2020-05-04 13:32:15|2020-05-04 13:32:45|Warning|109.0|109.0|6 | |rack-02|2020-05-04 13:32:10|2020-05-04 13:32:40|Warning|109.0|109.0|6 | +-------+-------------------+-------------------+-------+-----+-----+-----+ ------------------------------------------- Batch: 97 ------------------------------------------- +-------+-------------------+-------------------+------+----+----+-----+ |rackId |start |end |status|max |min |count| +-------+-------------------+-------------------+------+----+----+-----+ |rack-02|2020-05-04 13:31:25|2020-05-04 13:31:55|Normal|85.0|85.0|6 | |rack-02|2020-05-04 13:31:30|2020-05-04 13:32:00|Normal|85.0|85.0|6 | |rack-02|2020-05-04 13:31:35|2020-05-04 13:32:05|Normal|85.0|85.0|6 | +-------+-------------------+-------------------+------+----+----+-----+ ------------------------------------------- Batch: 98 ------------------------------------------- +-------+-------------------+-------------------+---------+-----+----+-----+ |rackId |start |end |status |max |min |count| +-------+-------------------+-------------------+---------+-----+----+-----+ |rack-02|2020-05-04 13:31:40|2020-05-04 13:32:10|Transient|109.0|85.0|6 | |rack-02|2020-05-04 13:31:45|2020-05-04 13:32:15|Transient|109.0|85.0|6 | +-------+-------------------+-------------------+---------+-----+----+-----+ ------------------------------------------- Batch: 99 ------------------------------------------- +-------+-------------------+-------------------+---------+-----+----+-----+ |rackId |start |end |status |max |min |count| +-------+-------------------+-------------------+---------+-----+----+-----+ |rack-02|2020-05-04 13:31:50|2020-05-04 13:32:20|Transient|109.0|85.0|6 | |rack-02|2020-05-04 13:31:55|2020-05-04 13:32:25|Transient|109.0|85.0|6 | +-------+-------------------+-------------------+---------+-----+----+-----+ Sliding Window Copyright © 2020, Oracle and/or its affiliates. All rights reserved Alert
  39. 43.

    43 .selectExpr("CAST(value AS STRING)").as("value") .select(from_json($"value", monitorDataSchema).as("rackInfo")) .select( $"rackInfo.rackId".as("rackId"), $"rackInfo.temperature".as("temperature"), $"rackInfo.timestamp".as("timestamp")

    ) .as[RackInfo] .groupByKey(_.rackId) .mapGroupsWithState[RackState, RackState](GroupStateTimeout.NoTimeout)(updateAcrossAllRackStatus) .where($"status" =!= $"prevStatus") Arbitrary Stateful Processing case class RackInfo(rackId:String, temperature:Double, timestamp:Timestamp) case class RackState(var rackId:String, var status:String, var prevStatus:String, var eventTS:Timestamp, var ts:Timestamp, var temperature:Double) def updateAcrossAllRackStatus( rackId : String, inputs : Iterator[RackInfo], oldState : GroupState[RackState] ) : RackState = { // … } RackState(rack-03,Normal,Normal,2020-04-06 11:36:18.0,2020-04-06 11:36:43.0,102.0) ID ※RackState Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  40. 44.

    44 Copyright © 2020, Oracle and/or its affiliates. All rights

    reserved Arbitrary Stateful Processing -
  41. 45.

    45 ( ...) [rack-02] >> updateAcrossAllRackStatus RackInfo(rack-02,106.0,2020-05-10 15:03:46.0) = Above

    100.0 for 25000 msec since changed. +-----+--------+-----+--------+--------+---------------------+---------------------+ | |Rack ID |Temp.|Current |Previous|TS of Initial Event |TS of Last Event | +-----+--------+-----+--------+--------+---------------------+---------------------+ |Last |rack-02 |106.0|Normal |Normal |2020-05-10 15:03:21.0|2020-05-10 15:03:41.0| |New |rack-02 |106.0|Normal |Normal |2020-05-10 15:03:21.0|2020-05-10 15:03:46.0| +-----+--------+-----+--------+--------+---------------------+---------------------+ [rack-02] >> updateAcrossAllRackStatus RackInfo(rack-02,106.0,2020-05-10 15:03:51.0) = Above 100.0 for 30000 msec since changed. +-----+--------+-----+--------+--------+---------------------+---------------------+ | |Rack ID |Temp.|Current |Previous|TS of Initial Event |TS of Last Event | +-----+--------+-----+--------+--------+---------------------+---------------------+ |Last |rack-02 |106.0|Normal |Normal |2020-05-10 15:03:21.0|2020-05-10 15:03:46.0| |New |rack-02 |106.0|Warning |Normal | null|2020-05-10 15:03:51.0| +-----+--------+-----+--------+--------+---------------------+---------------------+ !!!!! Status has changed !!!!! [rack-02] >> updateAcrossAllRackStatus RackInfo(rack-02,106.0,2020-05-10 15:03:56.0) = Above 100.0 for 0 msec since changed. +-----+--------+-----+--------+--------+---------------------+---------------------+ | |Rack ID |Temp.|Current |Previous|TS of Initial Event |TS of Last Event | +-----+--------+-----+--------+--------+---------------------+---------------------+ |Last |rack-02 |106.0|Warning |Normal | null|2020-05-10 15:03:51.0| |New |rack-02 |106.0|Warning |Warning | null|2020-05-10 15:03:56.0| +-----+--------+-----+--------+--------+---------------------+---------------------+ [rack-02] >> updateAcrossAllRackStatus RackInfo(rack-02,81.0,2020-05-10 15:03:16.0) = Below 100.0 for 0 msec since changed. +-----+--------+-----+--------+--------+---------------------+---------------------+ | |Rack ID |Temp.|Current |Previous|TS of Initial Event |TS of Last Event | +-----+--------+-----+--------+--------+---------------------+---------------------+ |Last |rack-02 | 81.0|Normal |Normal | null|2020-05-10 15:03:11.0| |New |rack-02 | 81.0|Normal |Normal | null|2020-05-10 15:03:16.0| +-----+--------+-----+--------+--------+---------------------+---------------------+ [rack-02] >> updateAcrossAllRackStatus RackInfo(rack-02,106.0,2020-05-10 15:03:21.0) = Above 100.0 for 0 msec since changed. +-----+--------+-----+--------+--------+---------------------+---------------------+ | |Rack ID |Temp.|Current |Previous|TS of Initial Event |TS of Last Event | +-----+--------+-----+--------+--------+---------------------+---------------------+ |Last |rack-02 | 81.0|Normal |Normal | null|2020-05-10 15:03:16.0| |New |rack-02 |106.0|Normal |Normal |2020-05-10 15:03:21.0|2020-05-10 15:03:21.0| +-----+--------+-----+--------+--------+---------------------+---------------------+ [rack-02] >> updateAcrossAllRackStatus RackInfo(rack-02,106.0,2020-05-10 15:03:26.0) = Above 100.0 for 5000 msec since changed. +-----+--------+-----+--------+--------+---------------------+---------------------+ | |Rack ID |Temp.|Current |Previous|TS of Initial Event |TS of Last Event | +-----+--------+-----+--------+--------+---------------------+---------------------+ |Last |rack-02 |106.0|Normal |Normal |2020-05-10 15:03:21.0|2020-05-10 15:03:21.0| |New |rack-02 |106.0|Normal |Normal |2020-05-10 15:03:21.0|2020-05-10 15:03:26.0| +-----+--------+-----+--------+--------+---------------------+---------------------+ [rack-02] >> updateAcrossAllRackStatus RackInfo(rack-02,106.0,2020-05-10 15:03:31.0) = Above 100.0 for 10000 msec since changed. +-----+--------+-----+--------+--------+---------------------+---------------------+ | |Rack ID |Temp.|Current |Previous|TS of Initial Event |TS of Last Event | +-----+--------+-----+--------+--------+---------------------+---------------------+ |Last |rack-02 |106.0|Normal |Normal |2020-05-10 15:03:21.0|2020-05-10 15:03:26.0| |New |rack-02 |106.0|Normal |Normal |2020-05-10 15:03:21.0|2020-05-10 15:03:31.0| +-----+--------+-----+--------+--------+---------------------+---------------------+ Arbitrary Stateful Processing Copyright © 2020, Oracle and/or its affiliates. All rights reserved Alert
  42. 46.

    46 Event Time Processing Time • - Event Time (

    ) – Windowing • • → Watermark Spark Streaming Window Event Time • Fixed Window • Sliding Window Arbitrary Stateful Processing • - Session Window Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  43. 47.

    47 • https://github.com/oracle-japan/ochacafe-spark-streaming Spark Streaming Structured Streaming Programming Guide •

    https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html • Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library by Hien Luu Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  44. 48.