Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scala + Google Dataflow = Serverless Spark

Scala + Google Dataflow = Serverless Spark

喜歡 Spark 簡潔強大的 Scala 語法,但又覺得自己開一個 Spark Cluster 太多設定要弄懂嗎?一起來看看 Scio,由 Spotify 開發的 Google Cloud Dataflow Scala API。

Pishen Tsai

June 21, 2018
Tweet

More Decks by Pishen Tsai

Other Decks in Programming

Transcript

  1. Google Dataflow? a Spark/Hadoop-like service with Serverless design on Google

    Cloud Platform https://cloud.google.com/dataflow/
  2. Traditional Cluster EMR Spark EC2 Subnet setting? SSH setting? Firewall

    setting? Jobs How to see the dashboard? Memory setting? Routing table? What ports? Where is my key? How many GB do I have? Jumper? Tunneling?
  3. Google Dataflow Job GCE (Google's EC2) Subnet setting? SSH setting?

    Firewall setting? How to see the dashboard? Memory setting? Routing table? What ports? Where is my key? How many GB do I have? Jumper? Tunneling?
  4. Google Dataflow Job GCE (Google's EC2) Subnet setting? SSH setting?

    Firewall setting? How to see the dashboard? Memory setting? Routing table? What ports? Where is my key? How many GB do I have? Jumper? Tunneling?
  5. def main() = { ... sc.textFile() .map() .reduce() ... }

    Traditional Submission EMR Spark assembly run! spark-submit
  6. Traditional Planning source rdd r1 r2 val rdd = source.map(...)

    val r1 = rdd.count() val r2 = rdd.reduce(...)
  7. Traditional Planning source rdd r1 r2 val rdd = source.map(...)

    val r1 = rdd.count() val r2 = rdd.reduce(...)
  8. Traditional Planning source rdd r1 r2 val rdd = source.map(...)

    val r1 = rdd.count() val r2 = rdd.reduce(...)
  9. Traditional Planning source rdd r1 r2 val rdd = source.map(...)

    val r1 = rdd.count() val r2 = rdd.reduce(...)
  10. Traditional Planning source rdd r1 r2 val rdd = source.map(...)

    val r1 = rdd.count() val r2 = rdd.reduce(...) actions
  11. Traditional Planning source rdd r1 r2 val rdd = source.map(...)

    val r1 = rdd.count() val r2 = rdd.reduce(...) rdd.cache()
  12. Traditional Planning source rdd r1 r2 val rdd = source.map(...)

    val r1 = rdd.count() val r2 = rdd.reduce(...) rdd.cache()
  13. Traditional Planning source rdd r1 r2 val rdd = source.map(...)

    val r1 = rdd.count() val r2 = rdd.reduce(...) rdd.cache()
  14. Traditional Planning source rdd r1 r2 val rdd = source.map(...)

    val r1 = rdd.count() val r2 = rdd.reduce(...) rdd.cache()
  15. Google Dataflow source rdd r1 r2 val rdd = source.map(...)

    val r1 = rdd.count() val r2 = rdd.reduce(...) sc.submit()
  16. Google Dataflow source rdd r1 r2 val rdd = source.map(...)

    val r1 = rdd.count() val r2 = rdd.reduce(...) sc.submit()
  17. Google Dataflow source rdd r1 r2 val rdd = source.map(...)

    val r1 = rdd.count() val r2 = rdd.reduce(...) sc.submit()
  18. Google Dataflow source rdd r1 r2 val rdd = source.map(...)

    val r1 = rdd.count() val r2 = rdd.reduce(...) sc.submit()
  19. Google Dataflow source rdd r1 r2 val rdd = source.map(...)

    val r1 = rdd.count() val r2 = rdd.reduce(...) No actions at all! sc.submit() Doesn't need cache()
  20. The Ugly Part sc.textFile("s3://bucket/*").flatMap(_.split("[^\\p{L}]+")) Spark p.apply(TextIO.read().from("gs://bucket/*")) .apply("ExtractWords", FlatMapElements.into(TypeDescriptors.strings()) .via((String word)

    -> Arrays.asList(word.split("[^\\p{L}]+")))) Beam Java (p | beam.io.ReadFromText('gs://bucket/*') | 'ExtractWords' >> beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))) Beam Python lines := textio.Read(s, "gs://bucket/*") words := beam.ParDo(s, func(line string, emit func(string)) { for _, word := range wordRE.FindAllString(line, -1) { emit(word) } }, lines) Beam Go
  21. .collect .count .distinct .filter .flatMap .groupBy .keyBy .map .max .reduce

    .sortBy .take .zipWithIndex .count .distinct .filter .flatMap .groupBy .keyBy .map .max .reduce .take RDD SCollection actions
  22. .countByKey .groupByKey .join .mapValues .reduceByKey .rightOuterJoin .countByKey .groupByKey .intersectByKey .join

    .mapValues .maxByKey .reduceByKey .rightOuterJoin .sumByKey PairRDDFunctions PairSCollectionFunctions
  23. Spark val size: Long = rdd1.count val bSize: Broadcast[Long] =

    sc.broadcast(size) rdd2.map(x => x + bSize.value) val size: SCollection[Long] = scollection1.count val sSize: SideInput[Long] = size.asSingletonSideInput scollection2 .withSideInputs(sSize) .map { case (x, s) => x + s(sSize)) Scio
  24. The Power of Scala Macros @BigQueryType.fromQuery("SELECT user, ad FROM click_logs")

    class Row sc.typedBigQuery[Row]() .groupBy(row => row.user) .mapValues(_.map(row => row.ad).sorted.takeRight(100))
  25. The Power of Scala Macros @BigQueryType.fromQuery("SELECT user, ad FROM click_logs")

    class Row(user: String, ad: Long) sc.typedBigQuery[Row]() .groupBy(row => row.user) .mapValues(_.map(row => row.ad).sorted.takeRight(100))
  26. The Power of Scala Macros @BigQueryType.fromQuery("SELECT user, ab FROM click_logs")

    class Row sc.typedBigQuery[Row]() .groupBy(row => row.user) .mapValues(_.map(row => row.ab).sorted.takeRight(100)) Compile error: Field 'ab' not found.
  27. Step 0 1. Open a GCP project 3. Enable Dataflow

    API 2. Give your credit card to Google $300 free trial~!
  28. Step 1 brew cask install google-cloud-sdk Install Google Cloud SDK

    Setup gcloud gcloud init gcloud auth application-default login
  29. Step 2 libraryDependencies ++= Seq( "com.spotify" %% "scio-core" % "0.5.2",

    "org.apache.beam" % "beam-runners-google-cloud-dataflow-java" % "2.4.0" ) https://github.com/spotify/scio/wiki/Getting-Started#sbt-project-setup
  30. Step 3 def main(args: Array[String]): Unit = { val sc

    = ScioContext { val opts = PipelineOptionsFactory.as(classOf[DataflowPipelineOptions]) opts.setRegion("asia-east1") opts.setProject("your-project-id") opts.setRunner(classOf[DataflowRunner]) opts } sc.textFile("gs://your-bucket/*.txt") .flatMap(_.split(" ")) .countByValue() .saveAsTextFile("gs://output-bucket/output-folder/") sc.close() }