Slide 1

Slide 1 text

Matei  Zaharia  and  Reynold  Xin     University  of  California,  Berkeley     www.spark-­‐project.org   The                                Ecosystem   Fast  and  Expressive  Big  Data  Analytics  in  Scala   UC  BERKELEY  

Slide 2

Slide 2 text

What  is  Spark?   Fast  and  expressive  cluster  computing  system   interoperable  with  Apache  Hadoop   Improves  efficiency  through:   » In-­‐memory  computing  primitives   » General  computation  graphs   Improves  usability  through:   » Rich  APIs  in  Scala,  Java,  Python   » Interactive  shell   Up  to  100×  faster   (2-­‐10×  on  disk)   Often  5×  less  code  

Slide 3

Slide 3 text

Project  History   Spark  started  in  2009,  open  sourced  2010   In  use  at  Intel,  Yahoo!,  Adobe,  Quantifind,   Conviva,  Ooyala,  Bizo  and  others   17  companies  now  contributing  code  

Slide 4

Slide 4 text

A  Growing  Stack   Part  of  the  Berkeley  Data  Analytics  Stack  (BDAS)  project   to  build  an  open  source  next-­‐gen  analytics  system   Spark   Shark   SQL   Spark   Streaming   real-­‐time   GraphX   graph   MLbase   machine   learning   …  

Slide 5

Slide 5 text

This  Talk   Spark  introduction  &  use  cases   GraphX:  graph  computation   Shark:  SQL  over  Spark       See  tomorrow  for  a  talk  on  Streaming!  

Slide 6

Slide 6 text

Why  a  New  Programming  Model?   MapReduce  greatly  simplified  big  data  analysis   But  as  soon  as  it  got  popular,  users  wanted  more:   » More  complex,  multi-­‐pass  analytics  (e.g.  ML,  graph)   » More  interactive  ad-­‐hoc  queries   » More  real-­‐time  stream  processing   All  3  need  faster  data  sharing  across  parallel  jobs  

Slide 7

Slide 7 text

Data  Sharing  in  MapReduce   iter.  1   iter.  2   .    .    .   Input   HDFS   read   HDFS   write   HDFS   read   HDFS   write   Input   query  1   query  2   query  3   result  1   result  2   result  3   .    .    .   HDFS   read   Slow  due  to  replication,  serialization,  and  disk  IO  

Slide 8

Slide 8 text

iter.  1   iter.  2   .    .    .   Input   Data  Sharing  in  Spark   Distributed   memory   Input   query  1   query  2   query  3   .    .    .   one-­‐time   processing   10-­‐100×  faster  than  network  and  disk  

Slide 9

Slide 9 text

Spark  Programming  Model   Key  idea:  resilient  distributed  datasets  (RDDs)   » Distributed  collections  of  objects  that  can  be  cached   in  memory  across  cluster   » Manipulated  through  parallel  operators   » Automatically  recomputed  on  failure   Programming  interface   » Functional  APIs  in  Scala,  Java,  Python   » Interactive  use  from  Scala  shell  

Slide 10

Slide 10 text

Example:  Log  Mining   Load  error  messages  from  a  log  into  memory,  then   interactively  search  for  various  patterns   lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.cache() Block  1   Block  2   Block  3   Worker   Worker   Worker   Driver   messages.filter(_.contains(“foo”)).count messages.filter(_.contains(“bar”)).count . . . tasks   results   Cache  1   Cache  2   Cache  3   Base  RDD   Transformed  RDD   Action   Result:  full-­‐text  search  of  Wikipedia   in  <1  sec  (vs  20  sec  for  on-­‐disk  data)   Result:  scaled  to  1  TB  data  in  5-­‐7  sec   (vs  170  sec  for  on-­‐disk  data)  

Slide 11

Slide 11 text

Fault  Tolerance   RDDs  track  the  series  of  transformations  used  to   build  them  (their  lineage)  to  recompute  lost  data   E.g:       messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) HadoopRDD   path  =  hdfs://…   FilteredRDD   func  =  _.contains(...)   MappedRDD   func  =  _.split(…)  

Slide 12

Slide 12 text

Example:  Logistic  Regression   Goal:  find  best  line  separating  two  sets  of  points   + – + + + + + + + + – – – – – – – – + target   – random  initial  line  

Slide 13

Slide 13 text

Example:  Logistic  Regression   val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)

Slide 14

Slide 14 text

Logistic  Regression  Performance   0   500   1000   1500   2000   2500   3000   3500   4000   1   5   10   20   30   Running  Time  (s)   Number  of  Iterations   Hadoop   Spark   110  s  /  iteration   first  iteration  80  s   further  iterations  1  s  

Slide 15

Slide 15 text

Demo  

Slide 16

Slide 16 text

Supported  Operators   map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...

Slide 17

Slide 17 text

Other  Engine  Features   General  operator  graphs  (e.g.  map-­‐reduce-­‐reduce)   Hash-­‐based  reduces  (faster  than  Hadoop’s  sort)   Controlled  data  partitioning  to  lower  communication   171   72   23   0   50   100   150   200   Iteration  time  (s)   PageRank  Performance   Hadoop   Basic  Spark   Spark  +  Controlled   Partitioning  

Slide 18

Slide 18 text

700+  meetup  members   30+  external  contributors   17  companies  contributing     Spark  Community  

Slide 19

Slide 19 text

This  Talk   Spark  introduction  &  use  cases   GraphX:  graph  computation   Shark:  SQL  over  Spark    

Slide 20

Slide 20 text

Graphs  are  Essential  to  Data  Mining   Identify  influential  people  and  information   Find  communities   Target  ads  and  products   Model  complex  data  dependencies    

Slide 21

Slide 21 text

Pregel Specialized  Graph  Systems  

Slide 22

Slide 22 text

B C D E F A Specialized  Graph  Systems   1.  APIs  to  capture  complex  dependencies   i.e.  graph  parallelism  vs  data  parallelism   2.  Exploit  graph  structure  to   reduce  communication   and  computation  

Slide 23

Slide 23 text

How  is  GraphX  different  from  ____?   Answer:  Simplicity  

Slide 24

Slide 24 text

Simplicity   Integration  with  Spark:  no  disparate  system   » ETL  (Extract,  transform,  load)   » Consumption  of  graph  output   » Fault-­‐tolerance   » Use  the  Scala  REPL  for  interactive  graph  mining   Programmability:  leveraging  Scala/Spark  API   » Implemented  GraphLab  /  Pregel  APIs  in  20  loc   » PageRank  in  5  loc  

Slide 25

Slide 25 text

Resilient  Distributed  Graphs   An  extension  of  Spark  RDDs   » Immutable,  partitioned  set  of  vertices  and  edges   » Constructed  using  RDD[Edge]  and  RDD[Vertex]   Additional  set  of  primitives  (3  functions)  for   graph  computations   » Able  to  express  most  graph  algorithms  (PageRank,   Shortest  Path,  Connected  Components,  ALS,  …)   » Implemented  GraphLab  /  Pregel  in  20  lines  of  code  

Slide 26

Slide 26 text

vertices = spark.textFile("hdfs://path/pages.csv") edges = spark.textFile("hdfs://path/to/links.csv”) .map(line => new Edge(line.split(‘\t’)) g = new Graph(vertices, edges).cache println(g.vertices.count) println(g.edges.count) g1 = g.filterVertices(_.split('\t')(2) == "Berkeley") ranks = Analytics.pageRank(g1, numIter = 10) println(ranks.vertices.sum)

Slide 27

Slide 27 text

ranks = Analytics.pageRank(g1, numIter = 10) println(ranks.vertices.sum)

Slide 28

Slide 28 text

Resilient  Distributed  Graph   Pregel  API   PageRank   GraphLab  API   Shortest   Path   Connected   Components   ALS   GraphX  

Slide 29

Slide 29 text

Early  Performance   Benefits  from  Spark’s:   » In-­‐memory  caching   » Hash-­‐based  operators   » Controlled  data   partitioning   1340   165   0   200   400   600   800   1000   1200   1400   1600   Hadoop   GraphX   PageRank,  16  nodes   Alpha  coming  in   June  /  July!  

Slide 30

Slide 30 text

This  Talk   Spark  introduction  &  use  cases   GraphX:  graph  computation   Shark:  SQL  over  Spark    

Slide 31

Slide 31 text

What  is  Shark?   Columnar  SQL  analytics  engine  for  Spark   » Support  both  SQL  and  complex  analytics   » Up  to  100X  faster  than  Apache  Hive   Compatible  with  Apache  Hive   » HiveQL,  UDF/UDAF,  SerDes,  Scripts   » Runs  on  existing  Hive  warehouses   In  use  at  Yahoo!  for  fast  in-­‐memory  OLAP  

Slide 32

Slide 32 text

Performance   0 25 50 75 100 Q1 Q2 Q3 Q4 Runtime0(seconds) Shark Shark0(disk) Hive 1.1 0.8 0.7 1.0 1.7  TB  Warehouse  Data  on  100  EC2  nodes  

Slide 33

Slide 33 text

Spark  Integration   Unified  system  for   SQL,  graph  processing,   machine  learning     All  share  the  same  set   of  workers  and  caches   def logRegress(points: RDD[Point]): Vector { var w = Vector(D, _ => 2 * rand.nextDouble - 1) for (i <- 1 to ITERATIONS) { val gradient = points.map { p => val denom = 1 + exp(-p.y * (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ + _) w -= gradient } w } val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid") val features = users.mapRows { row => new Vector(extractFeature1(row.getInt("age")), extractFeature2(row.getStr("country")), ...)} val trainedVector = logRegress(features.cache())

Slide 34

Slide 34 text

Teaser:  Spark  Streaming   sc.twitterStream(...) .flatMap(_.getText.split(“ ”)) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)     Come  see  our  talk  tomorrow  at  2:30!  

Slide 35

Slide 35 text

Getting  Started   Visit  www.spark-­‐project.org  for   » Video  tutorials   » Online  exercises  (EC2)   » Docs  and  API  guides   Easy  to  run  in  local  mode,  standalone  clusters,   Apache  Mesos,  YARN  or  EC2   Training  camp  at  Berkeley  in  August  

Slide 36

Slide 36 text

Conclusion   Big  data  analytics  is  evolving  to  include:   » More  complex  analytics  (e.g.  machine  learning)   » More  interactive  ad-­‐hoc  queries   » More  real-­‐time  stream  processing   Spark  is  a  fast,  unified  platform  for  these  apps   Look  for  our  training  camp   at  Berkeley  this  August!   spark-­‐project.org    

Slide 37

Slide 37 text

Backup  Slides  

Slide 38

Slide 38 text

Behavior  with  Not  Enough  RAM   68.8   58.1   40.7   29.7   11.5   0   20   40   60   80   100   Cache   disabled   25%   50%   75%   Fully   cached   Iteration  time  (s)   %  of  working  set  in  memory  

Slide 39

Slide 39 text

Fault  Tolerance   file.map(rec => (rec.type, 1)) .reduceByKey(_ + _) .filter((type, count) => count > 10) filter   reduce   map   Input  file   RDDs  track  lineage  information  to  rebuild  on  failure  

Slide 40

Slide 40 text

filter   reduce   map   Input  file   Fault  Tolerance   file.map(rec => (rec.type, 1)) .reduceByKey(_ + _) .filter((type, count) => count > 10) RDDs  track  lineage  information  to  rebuild  on  failure