Apache Spark - Overview

Slide 1

Slide 1 text

Apache Spark Arnon Rotem-Gal-Oz

Slide 2

Slide 2 text

•Distributed Data processing •Declerative •Lazy •Multiple languages •Highly scalable •Quite complex

Slide 3

Slide 3 text

2004 MapReduce:  Simplified Data Processing on Large Clusters     Jeff Dean, Sanjay Ghemawat   Google, Inc. https://research.google.com/archive/mapreduce-osdi04-slides/index.html

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

• Re-execute on fail • Skip bad-records • Redundent execution (copies of tasks) • Data locality optimization • Combiners (map-side reduce) • Compression of data

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

(reduce + (map #(+ % 2) (range 0 10))) (0 to 10).map(_+2).reduce(_+_) Enumerable.Range(0, 10).Select(x => x + 2).Aggregate(0, (acc, x) => acc + x);

Slide 8

Slide 8 text

Sort is shuffle  

Slide 9

Slide 9 text

Microsoft DryadLINQ / LINQ to HPC   (2009-2011) • DAG • Compile not run directly https://www.microsoft.com/en-us/research/project/dryadlinq/

Slide 10

Slide 10 text

AMPLabs Spark • Born as a way to test Mesos • Open sourced 2010

Slide 11

Slide 11 text

import spark.implicits._ import org.apache.spark.sql.functions._ import org.apache.spark.sql._     case class Data(InvoiceNo: String, StockCode: String, Description: String, Quantity: Long, InvoiceDate: String, UnitPrice: Double, CustomerID: String, Country: String) val schema = Encoders.product[Data].schema   val df=spark.read.option("header",true).schema(schema).csv("./data.csv") val clean=df.na.drop(Seq("CustomerID")).dropDuplicates()   val data = clean.withColumn("total",when($"StockCode"!=="D",$"UnitPrice"*$"Quantity").otherwise(0)) .withColumn("Discount",when($"StockCode"==="D",$"UnitPrice"*$"Quantity").otherwise(0)) .withColumn("Postage",when($"StockCode"==="P",1).otherwise(0)) .withColumn("Invoice",regexp_replace($"InvoiceNo","^C","")) .withColumn("Cancelled",when(substring($"InvoiceNo",0,1)==="C",1).otherwise(0)) val aggregated=data.groupBy($"Invoice",$"Country",$"CustomerID") .agg(sum($"Discount").as("Discount"),sum($"total").as("Total"),max($"Cancelled").as("Cancelled"))   val customers =aggregated.groupBy($"CustomerID") .agg(sum($"Total").as("Total"),sum($"Discount").as("Discount"),sum($"Cancelled").as("Cancelled"),count($"Invoice").as("In voices"))   import org.apache.spark.ml.feature.VectorAssembler val assembler = new VectorAssembler().setInputCols(Array("Total","Discount","Cancelled","Invoices")).setOutputCol("features") val features=assembler.transform(customers)   import org.apache.spark.ml.clustering.KMeans import org.apache.spark.ml.evaluation.ClusteringEvaluator   val Array(test,train)= features.randomSplit(Array(0.3,0.7)) val kmeans=new KMeans().setK(12).setFeaturesCol("features").setPredictionCol("prediction") val model = kmeans.fit(train) model.clusterCenters.foreach(println) val predictions=model.transform(test)

Slide 12

Slide 12 text

Spark Component

Slide 13

Slide 13 text

Resilient Distributed Dataset

Slide 14

Slide 14 text

Dataframe and DataSet • Higher abstraction • More like a database table than an array • Adds Optimizers

Slide 15

Slide 15 text

parsed plan logical plan Optimized plan Physical plan

Slide 16

Slide 16 text

Spark UI

Slide 17

Slide 17 text

With Batch all data is there https://www2.slideshare.net/VadimSolovey/dataflow-a-unified-model-for-batch-and-streaming-data-processing

Slide 18

Slide 18 text

Streaming – event by event

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

Streaming challenges   Watermarks – describe event time progress Events earlier than watermark are ignored (too slow – delay, too fast – more late events)

Slide 21

Slide 21 text

• Spark Streaming • Spark Structured Streaming (unified code for batch & Streaming) • Demo - Dimensio

Slide 22

Slide 22 text

Data warehouse ? Data lake ? Lakehouse??!   Oh my

Slide 23

Slide 23 text

Enterprise Dataware house landing ODS DW Datamart Datamart Datamart

Slide 24

Slide 24 text

Newfangled “Medalion architecture” Landing Bronze Delta live tables Silver Aggegated Gold Aggegated Gold Aggegated Gold

Slide 25

Slide 25 text

Block Storage (e.g. EBS) Object Storage (e.g. S3) How data is stored Blocks (like local disks) Objects with metadata Strucutre None (OS layer provides order) Flat with buckets (can be hierarchical on Azure) Scalability TBs+, limited IO PBs+ , almost unlimited (usually limits are on prefixes) Latency Very low low Cost High Cost effective*

Slide 26

Slide 26 text

Parquet https://www.oreilly.com/library/view/ operationalizing-the-data/9781492049517/

Slide 27

Slide 27 text

https://dkharazi.github.io/blog/parquet

Slide 28

Slide 28 text

Delta

Slide 29

Slide 29 text

Caveat emptor • Also bugs (spark 1.6) The fine print

Slide 30

Slide 30 text

Bugs.. • https://issues.apache.org/jira/browse/SPARK-8406

Slide 31

Slide 31 text

Long DAGs

Slide 32

Slide 32 text

Data Skew

Slide 33

Slide 33 text

Understanding what’s running on the driver and what’s running on the executor

Slide 34

Slide 34 text

Moving from local dev to running on all data

Slide 35

Slide 35 text

Big Data 2004 != Big Data 2023 1.1 Billion Taxi rides https://tech.marksblogg.com/benchmarks.html

Slide 36

Slide 36 text

Let’s play

Slide 37

Slide 37 text

Let’s play • Image • docker run –p 8888:8888 -p 4040:4040 -v /$HOME/projects/spark-demo:/home/ jovyan/work/demo jupyter/pyspark-notebook:latest • docker run -it -v /$HOME/projects/spark-demo:/opt/spark/work-dir/demo -p 4040:4040 apache/spark:3.4.1-scala2.12-java11-python3-r-ubuntu /opt/spark/bin/ pyspark • Data • https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page • Resources • https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/ functions.html • https://sparkbyexamples.com/pyspark-tutorial/

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

Spark • Lots of things out of the box • Batch (RDD, DataFrames, DataSets) • Streaming • Structured Streaming (unify batch and streaming) • Graph • (“Classic”) ML • Runs on Hadoop, Mesos, Kubernetes, Standalone or managed (Databricks)

Slide 40

Slide 40 text

Extensible • Spark NLP - John Snow Labs • Blaze (OSS)/ Photon(Databricks) native execution engine • Spark Deep Learning - Databricks, Intel (BidDL), DeepLearing4j, H2O • Connectors to any DB that respects itself • (Even Hades ☺)