Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Spark—the light at the end of the tunnel?

Apache Spark—the light at the end of the tunnel?

Talk at 6th Data Science Day in Berlin, 2014-05-08

Michael Hausenblas

May 08, 2014
Tweet

More Decks by Michael Hausenblas

Other Decks in Technology

Transcript

  1. Apache Spark—the light at the end of the tunnel? Michael

    Hausenblas, Chief Data Engineer EMEA, MapR Technologies
 6th Data Science Day, Berlin, 2014-05-07
  2. Apache Spark • Originally developed in 2009 in UC Berkeley’s

    AMP Lab • Fully open sourced in 2010 • Top-level Apache Project as of 2014 https://spark.apache.org/
  3. Most Active Open Source Project in Big Data Giraph! Storm!

    Tez! 0! 20! 40! 60! 80! 100! 120! 140! Project contributors in past year
  4. Easy & Fast Big Data Easy to Develop • Rich

    APIs in Java, Scala, Python • Interactive shell Fast to Run • General execution graphs • In-memory storage
  5. Easy & Fast Big Data Easy to Develop • Rich

    APIs in Java, Scala, Python • Interactive shell Fast to Run • General execution graphs • In-memory storage 2-5 × less code Up to 10 × faster on disk,
 and 100 × in memory https://amplab.cs.berkeley.edu/benchmark/
  6. Easy: Get Started Immediately • Multi-language support • Interactive Shell

    Python lines = sc.textFile(...)
 lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...)
 lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...);
 lines.filter(new Function<String, Boolean>() {
 Boolean call(String s) {
 return s.contains(“error”);
 }
 }).count();
  7. Easy: Clean API Resilient Distributed Datasets • Collections of objects

    spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure Operations • Transformations
 (e.g. map, filter, groupBy) • Actions
 (e.g. count, collect, save) Write programs in terms of transformations on distributed datasets
  8. Easy: Expressive API map filter groupBy sort union join leftOuterJoin

    rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
  9. Easy: Example – Word Count public static class WordCountMapClass extends

    MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } ! public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } Hadoop MapReduce val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Spark
  10. Easy: Example – Word Count val spark = new SparkContext(master,

    appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  11. Easy: Works Well With Hadoop Data Compatibility • Access your

    existing Hadoop data (HDFS, HBase, S3, etc.) • Use the same data formats • Adheres to data locality for efficient processing Deployment Models • “Standalone” deployment • YARN-based deployment • Mesos-based deployment • Deploy on existing Hadoop cluster or side-by-side
  12. Example: Logistic Regression data = spark.textFile(...).map(readPoint).cache() ! w = numpy.random.rand(D)

    ! for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) 
 * p.y * p.x)
 .reduce(lambda x, y: x + y) w -= gradient ! print “Final w: %s” % w
  13. Fast: Logistic Regression Performance Running Time (s) 0 1000 2000

    3000 4000 Number of Iterations 1 5 10 20 30 Hadoop Spark 110  s  /  iteration first  iteration  80  s further  iterations  1  s
  14. Fast: Using RAM, Operator Graphs In-memory Caching • Data Partitions

    read from RAM instead of disk Operator Graphs • Scheduling Optimizations • Fault Tolerance =  cached  partition =  RDD join filter groupBy Stage  3 Stage  1 Stage  2 A: B: C: D: E: F: map
  15. Easy: Fault Recovery RDDs track lineage information that can be

    used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2]) HDFS File Filtered RDD Mapped RDD filter
 (func  =  startsWith(…)) map
 (func  =  split(...))
  16. Easy: Unified Platform Spark SQL (SQL) Spark Streaming (Streaming) MLlib

    (Machine learning) Spark (General execution engine) GraphX (Graph computation) Continued innovation bringing new functionality, e.g.: • BlinkDB (Approximate Queries) • SparkR (R wrapper for Spark) • Tachyon (off-heap RDD caching)
  17. Spark SQL (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark

    (General execution engine) GraphX (Graph computation)
  18. Hive Compatibility • Interfaces to access data and code in

    the Hive ecosystem: o Support for writing queries in HQL o Catalog for that interfaces with the 
 Hive MetaStore o Tablescan operator that uses Hive SerDes o Wrappers for Hive UDFs, UDAFs, UDTFs
  19. Parquet Support Native support for reading data stored in Parquet:

    • Columnar storage avoids reading unneeded data. • Currently only supports flat structures (nested data on short-term roadmap). • RDDs can be written to parquet files, preserving the schema. http://parquet.io/
  20. Mixing SQL and Machine Learning val trainingDataTable = sql(""" SELECT

    e.action, u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userId = e.userId""")// Since `sql` returns an RDD, the results of can be easily used in MLlib
 val trainingData = trainingDataTable.map { row =>
 val features = Array[Double](row(1), row(2), row(3))
 LabeledPoint(row(0), features)
 } val model = new LogisticRegressionWithSGD().run(trainingData)
  21. Relationship to Borrows • Hive data loading code / in-

    memory columnar representation • hardened spark execution engine Adds • RDD-aware optimizer/ query planner • execution engine • language interfaces. Catalyst/SparkSQL is a nearly from scratch rewrite that leverages the best parts of Shark
  22. Spark Streaming Run a streaming computation as a series of

    very small, deterministic batch jobs Spark Spark Streaming batches  of  X   seconds live  data  stream processed   results • Chop up the live stream into batches of ½ second or more, leverage RDDs for micro-batch processing • Use the same familiar Spark APIs to process streams • Combine your batch and online processing in a single system • Guarantee exactly-once semantics
  23. DStream  of  data Window-based Transformations val  tweets  =  ssc.twitterStream()  

    val  hashTags  =  tweets.flatMap(status  =>  getTags(status))   val  tagCounts  =  hashTags.window(Minutes(1),  Seconds(5)).countByValue() sliding  window   operation window  length sliding  interval window  length sliding  interval
  24. MLlib – Machine Learning library Logis]c*Regression,*Linear*SVM*(+L1,*L2),*Decision* Trees,*Naive*Bayes" Linear*Regression*(+Lasso,*Ridge)* Alterna]ng*Least*Squares* KZMeans,*SVD*

    SGD,*Parallel*Gradient* Scala,*Java,*PySpark*(0.9) MLlib Classifica.on:" Regression:" Collabora.ve"Filtering:" Clustering"/"Explora.on:" Op.miza.on"Primi.ves:" Interopera.lity:"
  25. Enabling users to easily and efficiently express the entire graph

    analytics pipeline New API Blurs the distinction between Tables and Graphs New System Combines Data-Parallel Graph-Parallel Systems The GraphX Unified Approach
  26. Interactive Exploratory Analytics • Leverage Spark’s in-memory caching and efficient

    execution to explore large distributed datasets • Use Spark’s APIs to explore any kind of data (structured, unstructured, semi-structured, etc.) and combine programming models • Execute arbitrary code using a fully-functional interactive programming environment • Connect external tools via SQL Drivers
  27. Machine Learning • Improve performance of iterative algorithms by caching

    frequently accessed datasets • Develop programs that are easy to reason using a fully-capable functional programming style • Refine algorithms using the interactive REPL • Use carefully-curated algorithms out-of-the-box with MLlib
  28. Power Real-time Dashboards • Use Spark Streaming to perform low-latency

    window-based aggregations • Combine offline models with streaming data for online clustering and classification within the dashboard • Use Spark’s core APIs and/or Spark SQL to give users large-scale, low-latency drill-down capabilities in exploring dashboard data
  29. Faster ETL • Leverage Spark’s optimized scheduling for more efficient

    I/O on large datasets, and in-memory processing for aggregations, shuffles, and more • Use Spark SQL to perform ETL using a familiar SQL interface • Easily port PIG scripts to Spark’s API • Run existing HIVE queries directly on Spark SQL or Shark
  30. NEW DATA 
 STREAM QUERY BATCH VIEWS √ View 1

    View 2 View N REAL-TIME VIEWS BATCH LAYER SERVING LAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWS BATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWS REAL-TIME INCREMENT View 1 View 2 View N Lambda Architecture