Slide 1

Slide 1 text

Scala + Google Dataflow = Serverless Spark @pishen

Slide 2

Slide 2 text

Google Dataflow? a Spark/Hadoop-like service with Serverless design on Google Cloud Platform https://cloud.google.com/dataflow/

Slide 3

Slide 3 text

Traditional Cluster EMR Spark EC2 Subnet setting? SSH setting? Firewall setting? Jobs How to see the dashboard? Memory setting? Routing table? What ports? Where is my key? How many GB do I have? Jumper? Tunneling?

Slide 4

Slide 4 text

Google Dataflow Job GCE (Google's EC2)

Slide 5

Slide 5 text

Google Dataflow Job GCE (Google's EC2) Subnet setting? SSH setting? Firewall setting? How to see the dashboard? Memory setting? Routing table? What ports? Where is my key? How many GB do I have? Jumper? Tunneling?

Slide 6

Slide 6 text

Google Dataflow Job GCE (Google's EC2) Subnet setting? SSH setting? Firewall setting? How to see the dashboard? Memory setting? Routing table? What ports? Where is my key? How many GB do I have? Jumper? Tunneling?

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Google Dataflow

Slide 10

Slide 10 text

def main() = { ... sc.textFile() .map() .reduce() ... } Traditional Submission EMR Spark assembly run! spark-submit

Slide 11

Slide 11 text

def main() = { ... sc.textFile() .map() .reduce() ... sc.submit() } run! Google Dataflow API JSON DAG Job

Slide 12

Slide 12 text

Traditional Planning source rdd r1 val rdd = source.map(...) val r1 = rdd.count()

Slide 13

Slide 13 text

Traditional Planning source rdd r1 val rdd = source.map(...) val r1 = rdd.count()

Slide 14

Slide 14 text

Traditional Planning source rdd r1 val rdd = source.map(...) val r1 = rdd.count()

Slide 15

Slide 15 text

Traditional Planning source rdd r1 val rdd = source.map(...) val r1 = rdd.count()

Slide 16

Slide 16 text

Traditional Planning source rdd r1 val rdd = source.map(...) val r1 = rdd.count()

Slide 17

Slide 17 text

Traditional Planning source rdd r1 r2 val rdd = source.map(...) val r1 = rdd.count() val r2 = rdd.reduce(...)

Slide 18

Slide 18 text

Traditional Planning source rdd r1 r2 val rdd = source.map(...) val r1 = rdd.count() val r2 = rdd.reduce(...)

Slide 19

Slide 19 text

Traditional Planning source rdd r1 r2 val rdd = source.map(...) val r1 = rdd.count() val r2 = rdd.reduce(...)

Slide 20

Slide 20 text

Traditional Planning source rdd r1 r2 val rdd = source.map(...) val r1 = rdd.count() val r2 = rdd.reduce(...)

Slide 21

Slide 21 text

Traditional Planning source rdd r1 r2 val rdd = source.map(...) val r1 = rdd.count() val r2 = rdd.reduce(...) actions

Slide 22

Slide 22 text

Traditional Planning source rdd r1 val rdd = source.map(...) val r1 = rdd.count() rdd.cache()

Slide 23

Slide 23 text

Traditional Planning source rdd r1 val rdd = source.map(...) val r1 = rdd.count() rdd.cache()

Slide 24

Slide 24 text

Traditional Planning source rdd r1 val rdd = source.map(...) val r1 = rdd.count() rdd.cache()

Slide 25

Slide 25 text

Traditional Planning source rdd r1 val rdd = source.map(...) val r1 = rdd.count() rdd.cache()

Slide 26

Slide 26 text

Traditional Planning source rdd val rdd = source.map(...) val r1 = rdd.count() rdd.cache() r1

Slide 27

Slide 27 text

Traditional Planning source rdd r1 r2 val rdd = source.map(...) val r1 = rdd.count() val r2 = rdd.reduce(...) rdd.cache()

Slide 28

Slide 28 text

Traditional Planning source rdd r1 r2 val rdd = source.map(...) val r1 = rdd.count() val r2 = rdd.reduce(...) rdd.cache()

Slide 29

Slide 29 text

Traditional Planning source rdd r1 r2 val rdd = source.map(...) val r1 = rdd.count() val r2 = rdd.reduce(...) rdd.cache()

Slide 30

Slide 30 text

Traditional Planning source rdd r1 r2 val rdd = source.map(...) val r1 = rdd.count() val r2 = rdd.reduce(...) rdd.cache()

Slide 31

Slide 31 text

Google Dataflow source rdd r1 r2 val rdd = source.map(...) val r1 = rdd.count() val r2 = rdd.reduce(...) sc.submit()

Slide 32

Slide 32 text

Google Dataflow source rdd r1 r2 val rdd = source.map(...) val r1 = rdd.count() val r2 = rdd.reduce(...) sc.submit()

Slide 33

Slide 33 text

Google Dataflow source rdd r1 r2 val rdd = source.map(...) val r1 = rdd.count() val r2 = rdd.reduce(...) sc.submit()

Slide 34

Slide 34 text

Google Dataflow source rdd r1 r2 val rdd = source.map(...) val r1 = rdd.count() val r2 = rdd.reduce(...) sc.submit()

Slide 35

Slide 35 text

Google Dataflow source rdd r1 r2 val rdd = source.map(...) val r1 = rdd.count() val r2 = rdd.reduce(...) No actions at all! sc.submit() Doesn't need cache()

Slide 36

Slide 36 text

Other Features ● Cloud Shuffle ● Streaming ● Built-in I/O Transforms

Slide 37

Slide 37 text

Cloud Storage Datastore BigQuery Dataflow Pub/Sub Cloud Storage Datastore BigQuery Pub/Sub I/O Transforms

Slide 38

Slide 38 text

Looks perfect, right?

Slide 39

Slide 39 text

Dataflow The Ugly Part Apache Beam https://beam.apache.org/

Slide 40

Slide 40 text

Dataflow Beam Java Beam Python Beam Go The Ugly Part 1st priority

Slide 41

Slide 41 text

The Ugly Part Beam Java Beam Python Beam Go

Slide 42

Slide 42 text

The Ugly Part sc.textFile("s3://bucket/*").flatMap(_.split("[^\\p{L}]+")) Spark p.apply(TextIO.read().from("gs://bucket/*")) .apply("ExtractWords", FlatMapElements.into(TypeDescriptors.strings()) .via((String word) -> Arrays.asList(word.split("[^\\p{L}]+")))) Beam Java (p | beam.io.ReadFromText('gs://bucket/*') | 'ExtractWords' >> beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))) Beam Python lines := textio.Read(s, "gs://bucket/*") words := beam.ParDo(s, func(line string, emit func(string)) { for _, word := range wordRE.FindAllString(line, -1) { emit(word) } }, lines) Beam Go

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

Dataflow Beam Java Beam Python Beam Go Scio

Slide 46

Slide 46 text

sc.textFile("s3://bucket/*").flatMap(_.split("[^\\p{L}]+")) Spark sc.textFile("gs://bucket/*").flatMap(_.split("[^\\p{L}]+")) Scio

Slide 47

Slide 47 text

sc.textFile("s3://bucket/*").flatMap(_.split("[^\\p{L}]+")) Spark sc.textFile("gs://bucket/*").flatMap(_.split("[^\\p{L}]+")) Scio SparkContext ScioContext RDD[String] SCollection[String]

Slide 48

Slide 48 text

.collect .count .distinct .filter .flatMap .groupBy .keyBy .map .max .reduce .sortBy .take .zipWithIndex .count .distinct .filter .flatMap .groupBy .keyBy .map .max .reduce .take RDD SCollection actions

Slide 49

Slide 49 text

RDD def count(): Long SCollection def count(): SCollection[Long] SCollection with single element

Slide 50

Slide 50 text

.countByKey .groupByKey .join .mapValues .reduceByKey .rightOuterJoin .countByKey .groupByKey .intersectByKey .join .mapValues .maxByKey .reduceByKey .rightOuterJoin .sumByKey PairRDDFunctions PairSCollectionFunctions

Slide 51

Slide 51 text

Spark val size: Long = rdd1.count val bSize: Broadcast[Long] = sc.broadcast(size) rdd2.map(x => x + bSize.value) val size: SCollection[Long] = scollection1.count val sSize: SideInput[Long] = size.asSingletonSideInput scollection2 .withSideInputs(sSize) .map { case (x, s) => x + s(sSize)) Scio

Slide 52

Slide 52 text

The Power of Scala Macros @BigQueryType.fromQuery("SELECT user, ad FROM click_logs") class Row sc.typedBigQuery[Row]() .groupBy(row => row.user) .mapValues(_.map(row => row.ad).sorted.takeRight(100))

Slide 53

Slide 53 text

The Power of Scala Macros @BigQueryType.fromQuery("SELECT user, ad FROM click_logs") class Row(user: String, ad: Long) sc.typedBigQuery[Row]() .groupBy(row => row.user) .mapValues(_.map(row => row.ad).sorted.takeRight(100))

Slide 54

Slide 54 text

The Power of Scala Macros @BigQueryType.fromQuery("SELECT user, ab FROM click_logs") class Row sc.typedBigQuery[Row]() .groupBy(row => row.user) .mapValues(_.map(row => row.ab).sorted.takeRight(100)) Compile error: Field 'ab' not found.

Slide 55

Slide 55 text

5 steps to run a Scio job

Slide 56

Slide 56 text

Step 0 1. Open a GCP project 3. Enable Dataflow API 2. Give your credit card to Google $300 free trial~!

Slide 57

Slide 57 text

Step 1 brew cask install google-cloud-sdk Install Google Cloud SDK Setup gcloud gcloud init gcloud auth application-default login

Slide 58

Slide 58 text

Step 2 libraryDependencies ++= Seq( "com.spotify" %% "scio-core" % "0.5.2", "org.apache.beam" % "beam-runners-google-cloud-dataflow-java" % "2.4.0" ) https://github.com/spotify/scio/wiki/Getting-Started#sbt-project-setup

Slide 59

Slide 59 text

Step 3 def main(args: Array[String]): Unit = { val sc = ScioContext { val opts = PipelineOptionsFactory.as(classOf[DataflowPipelineOptions]) opts.setRegion("asia-east1") opts.setProject("your-project-id") opts.setRunner(classOf[DataflowRunner]) opts } sc.textFile("gs://your-bucket/*.txt") .flatMap(_.split(" ")) .countByValue() .saveAsTextFile("gs://output-bucket/output-folder/") sc.close() }

Slide 60

Slide 60 text

Step 4 sbt run

Slide 61

Slide 61 text

Step 5 Check the Dataflow web console on GCP

Slide 62

Slide 62 text

sc.thankYou()