Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Spark - Overview

Apache Spark - Overview

Lap around Apache Spark capabilities, history and design

Arnon Rotem-Gal-Oz

May 20, 2022
Tweet

More Decks by Arnon Rotem-Gal-Oz

Other Decks in Technology

Transcript

  1. 2004 MapReduce:
 Simplified Data Processing on Large Clusters 
 


    Jeff Dean, Sanjay Ghemawat 
 Google, Inc. https://research.google.com/archive/mapreduce-osdi04-slides/index.html
  2. • Re-execute on fail • Skip bad-records • Redundent execution

    (copies of tasks) • Data locality optimization • Combiners (map-side reduce) • Compression of data
  3. (reduce + (map #(+ % 2) (range 0 10))) (0

    to 10).map(_+2).reduce(_+_) Enumerable.Range(0, 10).Select(x => x + 2).Aggregate(0, (acc, x) => acc + x);
  4. Microsoft DryadLINQ / LINQ to HPC 
 (2009-2011) • DAG

    • Compile not run directly https://www.microsoft.com/en-us/research/project/dryadlinq/
  5. import spark.implicits._ import org.apache.spark.sql.functions._ import org.apache.spark.sql._ 
 
 case class

    Data(InvoiceNo: String, StockCode: String, Description: String, Quantity: Long, InvoiceDate: String, UnitPrice: Double, CustomerID: String, Country: String) val schema = Encoders.product[Data].schema 
 val df=spark.read.option("header",true).schema(schema).csv("./data.csv") val clean=df.na.drop(Seq("CustomerID")).dropDuplicates() 
 val data = clean.withColumn("total",when($"StockCode"!=="D",$"UnitPrice"*$"Quantity").otherwise(0)) .withColumn("Discount",when($"StockCode"==="D",$"UnitPrice"*$"Quantity").otherwise(0)) .withColumn("Postage",when($"StockCode"==="P",1).otherwise(0)) .withColumn("Invoice",regexp_replace($"InvoiceNo","^C","")) .withColumn("Cancelled",when(substring($"InvoiceNo",0,1)==="C",1).otherwise(0)) val aggregated=data.groupBy($"Invoice",$"Country",$"CustomerID") .agg(sum($"Discount").as("Discount"),sum($"total").as("Total"),max($"Cancelled").as("Cancelled")) 
 val customers =aggregated.groupBy($"CustomerID") .agg(sum($"Total").as("Total"),sum($"Discount").as("Discount"),sum($"Cancelled").as("Cancelled"),count($"Invoice").as("In voices")) 
 import org.apache.spark.ml.feature.VectorAssembler val assembler = new VectorAssembler().setInputCols(Array("Total","Discount","Cancelled","Invoices")).setOutputCol("features") val features=assembler.transform(customers) 
 import org.apache.spark.ml.clustering.KMeans import org.apache.spark.ml.evaluation.ClusteringEvaluator 
 val Array(test,train)= features.randomSplit(Array(0.3,0.7)) val kmeans=new KMeans().setK(12).setFeaturesCol("features").setPredictionCol("prediction") val model = kmeans.fit(train) model.clusterCenters.foreach(println) val predictions=model.transform(test)
  6. Dataframe and DataSet • Higher abstraction • More like a

    database table than an array • Adds Optimizers
  7. Streaming challenges 
 Watermarks – describe event time progress Events

    earlier than watermark are ignored (too slow – delay, too fast – more late events)
  8. Block Storage (e.g. EBS) Object Storage (e.g. S3) How data

    is stored Blocks (like local disks) Objects with metadata Strucutre None (OS layer provides order) Flat with buckets (can be hierarchical on Azure) Scalability TBs+, limited IO PBs+ , almost unlimited (usually limits are on prefixes) Latency Very low low Cost High Cost effective*
  9. Big Data 2004 != Big Data 2023 1.1 Billion Taxi

    rides https://tech.marksblogg.com/benchmarks.html
  10. Let’s play • Image • docker run –p 8888:8888 -p

    4040:4040 -v /$HOME/projects/spark-demo:/home/ jovyan/work/demo jupyter/pyspark-notebook:latest • docker run -it -v /$HOME/projects/spark-demo:/opt/spark/work-dir/demo -p 4040:4040 apache/spark:3.4.1-scala2.12-java11-python3-r-ubuntu /opt/spark/bin/ pyspark • Data • https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page • Resources • https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/ functions.html • https://sparkbyexamples.com/pyspark-tutorial/
  11. Spark • Lots of things out of the box •

    Batch (RDD, DataFrames, DataSets) • Streaming • Structured Streaming (unify batch and streaming) • Graph • (“Classic”) ML • Runs on Hadoop, Mesos, Kubernetes, Standalone or managed (Databricks)
  12. Extensible • Spark NLP - John Snow Labs • Blaze

    (OSS)/ Photon(Databricks) native execution engine • Spark Deep Learning - Databricks, Intel (BidDL), DeepLearing4j, H2O • Connectors to any DB that respects itself • (Even Hades ☺)