Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up for free
Spark Streaming Snippets
AGAWA Koji
March 14, 2016
Programming
0
850
Spark Streaming Snippets
AGAWA Koji
March 14, 2016
Tweet
Share
More Decks by AGAWA Koji
See All by AGAWA Koji
Scala + Caliban で作るGraphQL バックエンド / Making GraphQL Backend with Scala + Caliban
atty303
0
89
Scala.jsとAndroidでドメイン層を共有しよう / Scala.js and Android
atty303
0
290
もう一つのビルドツール mill で作る Docker イメージ / Build docker image with mill the yet another build tool
atty303
2
1.6k
Case of Ad Delivery System is Implemented by Scala and DDD
atty303
4
2.4k
ログのメトリックを取ってみる話
atty303
0
730
ADC2016: Axion meets HashiCorp
atty303
0
630
scala-native 試してみた
atty303
0
240
Why Docker?
atty303
3
160
Clientから見るFinagle
atty303
1
1.7k
Other Decks in Programming
See All in Programming
開発速度を5倍早くするVSCodeの拡張機能を作った
purp1eeeee
2
150
Cybozu GoogleI/O 2022 LT会 - Input for all screens
jaewgwon
0
320
Get Ready for Jakarta EE 10
ivargrimstad
0
2.2k
設計の学び方:自分流のススメ
masuda220
PRO
7
4.7k
GitHub Actions を導入した経緯
tamago3keran
1
430
Amazon Aurora の v1 が EOL になるので 10 クラスタアップグレードして出てきたノウハウ
dekokun
0
860
マルチプロダクト×非構造化データ×機械学習を支えるデータ信頼性
akino
0
130
LINE Messaging APIの概要 - LINE API総復習シリーズ
uezo
1
180
Independently together: better developer experience & App performance
bcinarli
0
180
CakePHPの内部実装 から理解するPSR-7
boro1234
0
200
Jetpack Compose, 어디까지 알고 있을까?
jisungbin
0
110
こそこそアジャイル導入しようぜ!
ichimichi
0
1.2k
Featured
See All Featured
Fontdeck: Realign not Redesign
paulrobertlloyd
73
4.1k
The Illustrated Children's Guide to Kubernetes
chrisshort
15
36k
Rails Girls Zürich Keynote
gr2m
86
12k
Docker and Python
trallard
27
1.6k
ParisWeb 2013: Learning to Love: Crash Course in Emotional UX Design
dotmariusz
100
5.9k
YesSQL, Process and Tooling at Scale
rocio
157
12k
Imperfection Machines: The Place of Print at Facebook
scottboms
253
12k
Stop Working from a Prison Cell
hatefulcrawdad
261
17k
Designing on Purpose - Digital PM Summit 2013
jponch
106
5.6k
Teambox: Starting and Learning
jrom
123
7.7k
The Art of Programming - Codeland 2020
erikaheidi
32
11k
Building Better People: How to give real-time feedback that sticks.
wjessup
344
17k
Transcript
Spark Streaming Snippets @a#y303
ࠓ·Ͱ࡞ͬͨ Spark Streaming ΞϓϦ • rtg -- ϦΞϧλΠϜϦλή༻σʔλੜ • pixelwriter
-- Dynamic Crea4ve ༻ϚʔΫσʔλॻ͖ࠐΈ • feedsync -- ϑΟʔυಉظ • segment:elas4c -- ϦΞϧλΠϜηάϝϯτԽ • (logblend -- ҟͳΔΠϕϯτϩάͷ Join)
͜ΕΒͷΞϓϦ͔Βదʹڞ༗͢Δͱخͦ͠͏ͳͱ͜ ΖΛൈ͍ͯΈͨ • SparkBoot • Cron • UpdatableBroadcast • Connector
• SparkStreamingSpec
SparkBoot h"ps:/ /gist.github.com/a"y303/c83f3c8cb8a930951be0 • Spark ΞϓϦͷ main ࣮ • SparkContext
/ StreamingContext Λఏڙ͢Δ • Configura4on ཧ • spark-submit ͷ --files Ͱ applica4on.conf ΛૹͬͯΧελ ϚΠζ
όονͷ߹ object TrainingBatchApp extends SparkBatchBoot { val appName = "TrainingBatchApp"
override def mkApp(sc: SparkContext, args: Array[String]): SparkApp = new TrainingBatchApp(sc, appConfig) } class TrainingBatchApp( sc: SparkContext, appConfig: Config) extends SparkApp { def run(): Try[Int] = Try { 0 } }
ετϦʔϛϯάͷ߹ object PredictStreamingApp extends SparkStreamingBoot { val appName = "PredictStreamingApp"
override val checkpointPath: String = "app.training.streaming.checkpoint-path" override val batchDurationPath: String = "app.training.streaming.batch-duration" override def mkApp(sc: SparkContext, ssc: StreamingContext, args: Array[String]): SparkApp = new PredictStreamingApp(ssc, batchDuration, appConfig) } class PredictStreamingApp( ssc: StreamingContext, appConfig: Config) extends SparkApp { val sparkContext = ssc.sparkContext def run(): Try[Int] = Try { 0 } }
Cron తͳ͜ͱΛΔ • batch-dura+on ΑΓִ͍ؒͰఆظ࣮ߦ͍ͨ͠ॲཧ͕͋Δ • ֎෦σʔλετΞ͔ΒಡΜͰ͍ΔϚελσʔλͷϦϑϨογϡ ͳͲ
def repeatedly(streamingContext: StreamingContext, interval: Duration) (f: (SparkContext, Time) => Unit):
Unit = { // τϦΨʔΛੜ͢Δ DStream val s = streamingContext.queueStream( mutable.Queue.empty[RDD[Unit]], oneAtATime = true, defaultRDD = streamingContext.sparkContext.makeRDD(Seq(()))) .repartition(1) s.window(s.slideDuration, interval) .foreachRDD { (rdd, time) => f(rdd.context, time) rdd.foreach(_ => ()) } }
͍ํ repetedly(streamingContext, Durations.seconds(300)) { (sc, time) => // @driver: 5
ຖʹ࣮ߦ͢Δॲཧ }
ߋ৽Մೳͳ Broadcast • Streaming ͕ಈ͖࢝Ίͨޙʹ Broadcast Λߋ৽͍ͨ͠ • ୯ͳΔ Broadcast
Λอ࣋͢Δϥούʔ
/** * ͷߋ৽(࠶ϒϩʔυΩϟετ)͕Մೳͳ Broadcast * * https://gist.github.com/Reinvigorate/040a362ca8100347e1a6 * @author Reinvigorate
*/ case class UpdatableBroadcast[T: ClassTag]( @transient private val ssc: StreamingContext, @transient private val _v: T) { @transient private var v = ssc.sparkContext.broadcast(_v) def update(newValue: T, blocking: Boolean = false): Unit = { v.unpersist(blocking) v = ssc.sparkContext.broadcast(newValue) } def value: T = v.value private def writeObject(out: ObjectOutputStream): Unit = { out.writeObject(v) } private def readObject(in: ObjectInputStream): Unit = { v = in.readObject().asInstanceOf[Broadcast[T]] } }
͍ํ def loadModel(): Model = ??? val ub: UpdatableBroadcast[Model] =
UpdatableBroadcast(streamingContext, loadModel()) StreamingUtil.repeatedly(streamingContext, refreshInterval) { (_, _) => ub.update(loadModel()) } dstream.foreachRDD { rdd => val model = ub.value // use model }
֎෦ଓͷநԽ • Spark Ͱ֎෦ϦιʔεʹΞΫηε͢Δͱ͖ɺݸʑͷ Executor ͕ ଓΛҡ࣋͢Δඞཁ͕͋Δ • Driver ͔Β
Executor ʹʮଓͦͷͷʯΛૹ৴͢Δ͜ͱͰ͖ ͳ͍ • ʮଓ͢Δํ๏ʯΛ Connector trait ͱͯ͠நԽ͍ͯ͠Δ
trait Connector[A] extends java.io.Closeable with Serializable { def get: A
def close(): Unit def using[B](f: A => B): B = f(get) }
case class PoolAerospikeConnector(name: Symbol, config: AerospikeConfig) extends Connector[AerospikeClient] { def
get: AerospikeClient = PoolAerospikeConnector.defaultHolder.getOrCreate(name, mkClient) def close(): Unit = PoolAerospikeConnector.defaultHolder.remove(name)( AerospikeConnector.aerospikeClientClosable) private val mkClient: () => AerospikeClient = () => new AerospikeClient(config.clientPolicy.underlying, config.asHosts:_*) } object PoolAerospikeConnector { private val defaultHolder = new DefaultResourceHolder[AerospikeClient] }
case class ScalikeJdbcConnector(name: Symbol, config: Config) extends Connector[Unit] { def
get: Unit = { if (!ConnectionPool.isInitialized(name)) { // Load MySQL JDBC driver class Class.forName("com.mysql.jdbc.Driver") ConnectionPool.add(name, config.getString("url"), config.getString("user"), config.getString("password")) } } def close(): Unit = ConnectionPool.close(name) }
case class KafkaProducerConnector[K :ClassTag, V :ClassTag]( name: Symbol, config: java.util.Map[String,
AnyRef]) extends Connector[ScalaKafkaProducer[K, V]] { def get: ScalaKafkaProducer[K, V] = KafkaProducerConnector.defaultHolder.getOrCreate(name, mkResource) .asInstanceOf[ScalaKafkaProducer[K, V]] def close(): Unit = KafkaProducerConnector.defaultHolder.remove(name)( KafkaProducerConnector.kafkaProducerClosable) private val mkResource = () => { val keySer = mkDefaultSerializer[K](ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG) val valueSer = mkDefaultSerializer[V](ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG) new ScalaKafkaProducer[K, V]( new KafkaProducer[K, V](config, keySer.orNull, valueSer.orNull)) } private def mkDefaultSerializer[A :ClassTag](configKey: String): Option[Serializer[A]] = { if (!config.containsKey(configKey)) { implicitly[ClassTag[A]].runtimeClass match { case c if c == classOf[Array[Byte]] => Some(new ByteArraySerializer().asInstanceOf[Serializer[A]]) case c if c == classOf[String] => Some(new StringSerializer().asInstanceOf[Serializer[A]]) case _ => None } } else None } }
Spark Streaming ͷςετ h"ps:/ /gist.github.com/a"y303/18e64e718f0cf3261c0e
class CountProductSpec extends SpecWithJUnit with SparkStreamingSpec { val batchDuration: Duration
= Duration(1000) "Count" >> { val (sourceQueue, resultQueue) = startQueueStream[Product, (Product, Long)] { inStream => // ςετରͷ Streaming ॲཧ CountProduct(inStream).run(sc) } // ೖྗΩϡʔʹςετσʔλΛೖ͢Δ sourceQueue += sc.parallelize(Seq( Product(1, "id"), Product(1, "id"), Product(2, "id"))) // ࣌ؒΛਐΊΔ advance() // ग़ྗ͞ΕΔσʔλΛςετ͢Δ resultQueue.dequeue must eventually(contain(exactly( Product(1, "id") -> 2L, Product(2, "id") -> 1L ))) } }