Apache Spark - Overview

Apache Spark Arnon Rotem-Gal-Oz

•Distributed Data processing •Declerative •Lazy •Multiple languages •Highly scalable •Quite
complex

2004 MapReduce:  Simplified Data Processing on Large Clusters    
Jeff Dean, Sanjay Ghemawat   Google, Inc. https://research.google.com/archive/mapreduce-osdi04-slides/index.html

• Re-execute on fail • Skip bad-records • Redundent execution
(copies of tasks) • Data locality optimization • Combiners (map-side reduce) • Compression of data

(reduce + (map #(+ % 2) (range 0 10))) (0
to 10).map(_+2).reduce(_+_) Enumerable.Range(0, 10).Select(x => x + 2).Aggregate(0, (acc, x) => acc + x);

Sort is shuffle  

Microsoft DryadLINQ / LINQ to HPC   (2009-2011) • DAG
• Compile not run directly https://www.microsoft.com/en-us/research/project/dryadlinq/

AMPLabs Spark • Born as a way to test Mesos
• Open sourced 2010

import spark.implicits._ import org.apache.spark.sql.functions._ import org.apache.spark.sql._     case class
Data(InvoiceNo: String, StockCode: String, Description: String, Quantity: Long, InvoiceDate: String, UnitPrice: Double, CustomerID: String, Country: String) val schema = Encoders.product[Data].schema   val df=spark.read.option("header",true).schema(schema).csv("./data.csv") val clean=df.na.drop(Seq("CustomerID")).dropDuplicates()   val data = clean.withColumn("total",when($"StockCode"!=="D",$"UnitPrice"*$"Quantity").otherwise(0)) .withColumn("Discount",when($"StockCode"==="D",$"UnitPrice"*$"Quantity").otherwise(0)) .withColumn("Postage",when($"StockCode"==="P",1).otherwise(0)) .withColumn("Invoice",regexp_replace($"InvoiceNo","^C","")) .withColumn("Cancelled",when(substring($"InvoiceNo",0,1)==="C",1).otherwise(0)) val aggregated=data.groupBy($"Invoice",$"Country",$"CustomerID") .agg(sum($"Discount").as("Discount"),sum($"total").as("Total"),max($"Cancelled").as("Cancelled"))   val customers =aggregated.groupBy($"CustomerID") .agg(sum($"Total").as("Total"),sum($"Discount").as("Discount"),sum($"Cancelled").as("Cancelled"),count($"Invoice").as("In voices"))   import org.apache.spark.ml.feature.VectorAssembler val assembler = new VectorAssembler().setInputCols(Array("Total","Discount","Cancelled","Invoices")).setOutputCol("features") val features=assembler.transform(customers)   import org.apache.spark.ml.clustering.KMeans import org.apache.spark.ml.evaluation.ClusteringEvaluator   val Array(test,train)= features.randomSplit(Array(0.3,0.7)) val kmeans=new KMeans().setK(12).setFeaturesCol("features").setPredictionCol("prediction") val model = kmeans.fit(train) model.clusterCenters.foreach(println) val predictions=model.transform(test)

Spark Component

Resilient Distributed Dataset

Dataframe and DataSet • Higher abstraction • More like a
database table than an array • Adds Optimizers

parsed plan logical plan Optimized plan Physical plan

Spark UI

With Batch all data is there https://www2.slideshare.net/VadimSolovey/dataflow-a-unified-model-for-batch-and-streaming-data-processing

Streaming – event by event

Streaming challenges   Watermarks – describe event time progress Events
earlier than watermark are ignored (too slow – delay, too fast – more late events)

• Spark Streaming • Spark Structured Streaming (unified code for
batch & Streaming) • Demo - Dimensio

Data warehouse ? Data lake ? Lakehouse??!   Oh my

Enterprise Dataware house landing ODS DW Datamart Datamart Datamart

Newfangled “Medalion architecture” Landing Bronze Delta live tables Silver Aggegated
Gold Aggegated Gold Aggegated Gold

Block Storage (e.g. EBS) Object Storage (e.g. S3) How data
is stored Blocks (like local disks) Objects with metadata Strucutre None (OS layer provides order) Flat with buckets (can be hierarchical on Azure) Scalability TBs+, limited IO PBs+ , almost unlimited (usually limits are on prefixes) Latency Very low low Cost High Cost effective*

Parquet https://www.oreilly.com/library/view/ operationalizing-the-data/9781492049517/

https://dkharazi.github.io/blog/parquet

Caveat emptor • Also bugs (spark 1.6) The fine print

Bugs.. • https://issues.apache.org/jira/browse/SPARK-8406

Long DAGs

Data Skew

Understanding what’s running on the driver and what’s running on
the executor

Moving from local dev to running on all data

Big Data 2004 != Big Data 2023 1.1 Billion Taxi
rides https://tech.marksblogg.com/benchmarks.html

Let’s play

Let’s play • Image • docker run –p 8888:8888 -p
4040:4040 -v /$HOME/projects/spark-demo:/home/ jovyan/work/demo jupyter/pyspark-notebook:latest • docker run -it -v /$HOME/projects/spark-demo:/opt/spark/work-dir/demo -p 4040:4040 apache/spark:3.4.1-scala2.12-java11-python3-r-ubuntu /opt/spark/bin/ pyspark • Data • https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page • Resources • https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/ functions.html • https://sparkbyexamples.com/pyspark-tutorial/

Spark • Lots of things out of the box •
Batch (RDD, DataFrames, DataSets) • Streaming • Structured Streaming (unify batch and streaming) • Graph • (“Classic”) ML • Runs on Hadoop, Mesos, Kubernetes, Standalone or managed (Databricks)

Extensible • Spark NLP - John Snow Labs • Blaze
(OSS)/ Photon(Databricks) native execution engine • Spark Deep Learning - Databricks, Intel (BidDL), DeepLearing4j, H2O • Connectors to any DB that respects itself • (Even Hades ☺)

Apache Spark - Overview

Apache Spark - Overview

More Decks by Arnon Rotem-Gal-Oz

Other Decks in Technology

Featured

Transcript