Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Spark Streaming Snippets
Search
AGAWA Koji
March 14, 2016
Programming
1.1k
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Spark Streaming Snippets
AGAWA Koji
March 14, 2016
More Decks by AGAWA Koji
See All by AGAWA Koji
Software Architecture in an AI-Driven World
atty303
79
47k
PipeCDプラグインへの期待 / Anticipating PipeCD Plugins
atty303
0
120
EmscriptenでC/C++アプリをWASM化してブラウザで動かしてみた
atty303
0
660
良いソフトウェアとコードレビュー / Good software and code review
atty303
38
18k
Scala + Caliban で作るGraphQL バックエンド / Making GraphQL Backend with Scala + Caliban
atty303
0
600
Scala.jsとAndroidでドメイン層を共有しよう / Scala.js and Android
atty303
0
810
もう一つのビルドツール mill で作る Docker イメージ / Build docker image with mill the yet another build tool
atty303
2
2.6k
Case of Ad Delivery System is Implemented by Scala and DDD
atty303
4
3.7k
ログのメトリックを取ってみる話
atty303
0
1k
Other Decks in Programming
See All in Programming
代数的データ型って何が嬉しいの? #frontend_phpcon_do
kajitack
8
3.3k
OSもどきOS
arkw
0
480
LLM Plugin for Node-REDの利用方法と開発について
404background
0
170
Agentic UI
manfredsteyer
PRO
0
130
タクシーアプリ『GO』の バックエンド開発のおける AI利活用と若者のすべて
pyama86
3
1.9k
「エンジニアインターン、どうやって取った?」準備のリアルを語るLT会 Progate BAR
akiomatic
0
130
Skillsは効率化、Agentsは"自分の拡張"——Builder時代のエージェント編成(CC Night 2026)
wemra
1
120
Semantic Version 単位で戦略を柔軟に変えて、パッケージアップデートを自動化する
daitasu
0
200
Language Server 使ってる? 〜VSCode と Zed の場合〜 / Are you using a Language Server? ~For VS Code and Zed~
handlename
0
780
New "Type" system on PicoRuby
pocke
1
790
TAKTでAI駆動開発の品質を設計する
j5ik2o
6
1.1k
These Five Tricks Can Make Your Apps Greener, Cheaper, & Nicer
hollycummins
0
280
Featured
See All Featured
Art, The Web, and Tiny UX
lynnandtonic
304
22k
The innovator’s Mindset - Leading Through an Era of Exponential Change - McGill University 2025
jdejongh
PRO
1
200
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
3.4k
Bridging the Design Gap: How Collaborative Modelling removes blockers to flow between stakeholders and teams @FastFlow conf
baasie
0
580
svc-hook: hooking system calls on ARM64 by binary rewriting
retrage
2
290
It's Worth the Effort
3n
188
29k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
Build The Right Thing And Hit Your Dates
maggiecrowley
39
3.2k
The Director’s Chair: Orchestrating AI for Truly Effective Learning
tmiket
1
190
Neural Spatial Audio Processing for Sound Field Analysis and Control
skoyamalab
0
330
Navigating Weather and Climate Data
rabernat
0
220
Making the Leap to Tech Lead
cromwellryan
135
9.9k
Transcript
Spark Streaming Snippets @a#y303
ࠓ·Ͱ࡞ͬͨ Spark Streaming ΞϓϦ • rtg -- ϦΞϧλΠϜϦλή༻σʔλੜ • pixelwriter
-- Dynamic Crea4ve ༻ϚʔΫσʔλॻ͖ࠐΈ • feedsync -- ϑΟʔυಉظ • segment:elas4c -- ϦΞϧλΠϜηάϝϯτԽ • (logblend -- ҟͳΔΠϕϯτϩάͷ Join)
͜ΕΒͷΞϓϦ͔Βదʹڞ༗͢Δͱخͦ͠͏ͳͱ͜ ΖΛൈ͍ͯΈͨ • SparkBoot • Cron • UpdatableBroadcast • Connector
• SparkStreamingSpec
SparkBoot h"ps:/ /gist.github.com/a"y303/c83f3c8cb8a930951be0 • Spark ΞϓϦͷ main ࣮ • SparkContext
/ StreamingContext Λఏڙ͢Δ • Configura4on ཧ • spark-submit ͷ --files Ͱ applica4on.conf ΛૹͬͯΧελ ϚΠζ
όονͷ߹ object TrainingBatchApp extends SparkBatchBoot { val appName = "TrainingBatchApp"
override def mkApp(sc: SparkContext, args: Array[String]): SparkApp = new TrainingBatchApp(sc, appConfig) } class TrainingBatchApp( sc: SparkContext, appConfig: Config) extends SparkApp { def run(): Try[Int] = Try { 0 } }
ετϦʔϛϯάͷ߹ object PredictStreamingApp extends SparkStreamingBoot { val appName = "PredictStreamingApp"
override val checkpointPath: String = "app.training.streaming.checkpoint-path" override val batchDurationPath: String = "app.training.streaming.batch-duration" override def mkApp(sc: SparkContext, ssc: StreamingContext, args: Array[String]): SparkApp = new PredictStreamingApp(ssc, batchDuration, appConfig) } class PredictStreamingApp( ssc: StreamingContext, appConfig: Config) extends SparkApp { val sparkContext = ssc.sparkContext def run(): Try[Int] = Try { 0 } }
Cron తͳ͜ͱΛΔ • batch-dura+on ΑΓִ͍ؒͰఆظ࣮ߦ͍ͨ͠ॲཧ͕͋Δ • ֎෦σʔλετΞ͔ΒಡΜͰ͍ΔϚελσʔλͷϦϑϨογϡ ͳͲ
def repeatedly(streamingContext: StreamingContext, interval: Duration) (f: (SparkContext, Time) => Unit):
Unit = { // τϦΨʔΛੜ͢Δ DStream val s = streamingContext.queueStream( mutable.Queue.empty[RDD[Unit]], oneAtATime = true, defaultRDD = streamingContext.sparkContext.makeRDD(Seq(()))) .repartition(1) s.window(s.slideDuration, interval) .foreachRDD { (rdd, time) => f(rdd.context, time) rdd.foreach(_ => ()) } }
͍ํ repetedly(streamingContext, Durations.seconds(300)) { (sc, time) => // @driver: 5
ຖʹ࣮ߦ͢Δॲཧ }
ߋ৽Մೳͳ Broadcast • Streaming ͕ಈ͖࢝Ίͨޙʹ Broadcast Λߋ৽͍ͨ͠ • ୯ͳΔ Broadcast
Λอ࣋͢Δϥούʔ
/** * ͷߋ৽(࠶ϒϩʔυΩϟετ)͕Մೳͳ Broadcast * * https://gist.github.com/Reinvigorate/040a362ca8100347e1a6 * @author Reinvigorate
*/ case class UpdatableBroadcast[T: ClassTag]( @transient private val ssc: StreamingContext, @transient private val _v: T) { @transient private var v = ssc.sparkContext.broadcast(_v) def update(newValue: T, blocking: Boolean = false): Unit = { v.unpersist(blocking) v = ssc.sparkContext.broadcast(newValue) } def value: T = v.value private def writeObject(out: ObjectOutputStream): Unit = { out.writeObject(v) } private def readObject(in: ObjectInputStream): Unit = { v = in.readObject().asInstanceOf[Broadcast[T]] } }
͍ํ def loadModel(): Model = ??? val ub: UpdatableBroadcast[Model] =
UpdatableBroadcast(streamingContext, loadModel()) StreamingUtil.repeatedly(streamingContext, refreshInterval) { (_, _) => ub.update(loadModel()) } dstream.foreachRDD { rdd => val model = ub.value // use model }
֎෦ଓͷநԽ • Spark Ͱ֎෦ϦιʔεʹΞΫηε͢Δͱ͖ɺݸʑͷ Executor ͕ ଓΛҡ࣋͢Δඞཁ͕͋Δ • Driver ͔Β
Executor ʹʮଓͦͷͷʯΛૹ৴͢Δ͜ͱͰ͖ ͳ͍ • ʮଓ͢Δํ๏ʯΛ Connector trait ͱͯ͠நԽ͍ͯ͠Δ
trait Connector[A] extends java.io.Closeable with Serializable { def get: A
def close(): Unit def using[B](f: A => B): B = f(get) }
case class PoolAerospikeConnector(name: Symbol, config: AerospikeConfig) extends Connector[AerospikeClient] { def
get: AerospikeClient = PoolAerospikeConnector.defaultHolder.getOrCreate(name, mkClient) def close(): Unit = PoolAerospikeConnector.defaultHolder.remove(name)( AerospikeConnector.aerospikeClientClosable) private val mkClient: () => AerospikeClient = () => new AerospikeClient(config.clientPolicy.underlying, config.asHosts:_*) } object PoolAerospikeConnector { private val defaultHolder = new DefaultResourceHolder[AerospikeClient] }
case class ScalikeJdbcConnector(name: Symbol, config: Config) extends Connector[Unit] { def
get: Unit = { if (!ConnectionPool.isInitialized(name)) { // Load MySQL JDBC driver class Class.forName("com.mysql.jdbc.Driver") ConnectionPool.add(name, config.getString("url"), config.getString("user"), config.getString("password")) } } def close(): Unit = ConnectionPool.close(name) }
case class KafkaProducerConnector[K :ClassTag, V :ClassTag]( name: Symbol, config: java.util.Map[String,
AnyRef]) extends Connector[ScalaKafkaProducer[K, V]] { def get: ScalaKafkaProducer[K, V] = KafkaProducerConnector.defaultHolder.getOrCreate(name, mkResource) .asInstanceOf[ScalaKafkaProducer[K, V]] def close(): Unit = KafkaProducerConnector.defaultHolder.remove(name)( KafkaProducerConnector.kafkaProducerClosable) private val mkResource = () => { val keySer = mkDefaultSerializer[K](ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG) val valueSer = mkDefaultSerializer[V](ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG) new ScalaKafkaProducer[K, V]( new KafkaProducer[K, V](config, keySer.orNull, valueSer.orNull)) } private def mkDefaultSerializer[A :ClassTag](configKey: String): Option[Serializer[A]] = { if (!config.containsKey(configKey)) { implicitly[ClassTag[A]].runtimeClass match { case c if c == classOf[Array[Byte]] => Some(new ByteArraySerializer().asInstanceOf[Serializer[A]]) case c if c == classOf[String] => Some(new StringSerializer().asInstanceOf[Serializer[A]]) case _ => None } } else None } }
Spark Streaming ͷςετ h"ps:/ /gist.github.com/a"y303/18e64e718f0cf3261c0e
class CountProductSpec extends SpecWithJUnit with SparkStreamingSpec { val batchDuration: Duration
= Duration(1000) "Count" >> { val (sourceQueue, resultQueue) = startQueueStream[Product, (Product, Long)] { inStream => // ςετରͷ Streaming ॲཧ CountProduct(inStream).run(sc) } // ೖྗΩϡʔʹςετσʔλΛೖ͢Δ sourceQueue += sc.parallelize(Seq( Product(1, "id"), Product(1, "id"), Product(2, "id"))) // ࣌ؒΛਐΊΔ advance() // ग़ྗ͞ΕΔσʔλΛςετ͢Δ resultQueue.dequeue must eventually(contain(exactly( Product(1, "id") -> 2L, Product(2, "id") -> 1L ))) } }