Slide 1

Slide 1 text

1 Apache  Spark  as  a  Cross-­‐over  Hit     for  Data  Science   Data  Science  London   Ian  Buss  /  Solu=ons  Architect  /  Cloudera  

Slide 2

Slide 2 text

Inves=ga=ve  vs  Opera=onal  Analy=cs   2  

Slide 3

Slide 3 text

Tools  of  the  Trade   3  

Slide 4

Slide 4 text

Trade-­‐offs  of  the  Tools   4   Produc=on  Data   Large-­‐Scale   Shared  Cluster     Con=nuous   Opera=on   Online     Throughput,  QPS     Few,  Simple     Systems  Language   Performance   Historical  Subset   Sample   Worksta=on     Ad  Hoc   Inves=ga=on   Offline     Accuracy     Many,  Sophis=cated     Scrip=ng,  High  Level   Ease  of  Development   Data   Context   Metrics   Library   Language   Inves=ga=ve   Opera=onal  

Slide 5

Slide 5 text

R   5   Produc=on  Data   Large-­‐Scale   Shared  Cluster     Con=nuous   Opera=on   Online     Throughput,  QPS     Few,  Simple     Systems  Language   Performance   Historical  Subset   Sample   Worksta=on     Ad  Hoc   Inves=ga=on   Offline     Accuracy     Many,  Sophis=cated     Scrip=ng,  High  Level   Ease  of  Development   Data   Context   Metrics   Library   Language   Inves=ga=ve   Opera=onal  

Slide 6

Slide 6 text

Python  +  scikit   6   Produc=on  Data   Large-­‐Scale   Shared  Cluster     Con=nuous   Opera=on   Online     Throughput,  QPS     Few,  Simple     Systems  Language   Performance   Historical  Subset   Sample   Worksta=on     Ad  Hoc   Inves=ga=on   Offline     Accuracy     Many,  Sophis=cated     Scrip=ng,  High  Level   Ease  of  Development   Data   Context   Metrics   Library   Language   Inves=ga=ve   Opera=onal  

Slide 7

Slide 7 text

MapReduce,  Crunch,  Mahout   7   Produc=on  Data   Large-­‐Scale   Shared  Cluster     Con=nuous   Opera=on   Online     Throughput,  QPS     Few,  Simple     Systems  Language   Performance   Historical  Subset   Sample   Worksta=on     Ad  Hoc   Inves=ga=on   Offline     Accuracy     Many,  Sophis=cated     Scrip=ng,  High  Level   Ease  of  Development   Data   Context   Metrics   Library   Language   Inves=ga=ve   Opera=onal  

Slide 8

Slide 8 text

Spark:  Something  For  Everyone   8   •  Now  Apache  TLP   •  UC  Berkeley,  DataBricks   •  Mesos  to  YARN   •  Scala-­‐based   •  Expressive,  efficient   •  JVM-­‐based   •  Scala-­‐like  API   •  Distributed  works  like   local   •  As  Crunch  is  Collec=on-­‐ like   •  REPL   •  Interac=ve   •  Distributed   •  Hadoop-­‐friendly   •  Integrate  with  where   data,  cluster  already  is   •  ETL  no  longer  separate   •  Mllib   •  GraphX,  Streaming,  SQL  

Slide 9

Slide 9 text

Spark   9   Produc=on  Data   Large-­‐Scale   Shared  Cluster     Con=nuous   Opera=on   Online     Throughput,  QPS     Few,  Simple     Systems  Language   Performance   Historical  Subset   Sample   Worksta=on     Ad  Hoc   Inves=ga=on   Offline     Accuracy     Many,  Sophis=cated     Scrip=ng,  High  Level   Ease  of  Development   Data   Context   Metrics   Library   Language   Inves=ga=ve   Opera=onal  

Slide 10

Slide 10 text

Spark  –  Selected  Features   10   Arbitrary  computa=on  DAG   Simple,  expressive  programming   Exploit  RAM   Combine  processing  approaches   Scalable,  reliable,  integrated   Scala,  Python,  Java   RDDs,  itera=ons  –  ML!   Relax,  it’s  just  MapReduce…but  more  flexible     Batch,  Streaming,  SQL…   Built  on  Hadoop  

Slide 11

Slide 11 text

Spark  –  Simple  Example   11   val wifi = sc.textFile("/var/log/wifi.log”,3) val networks = wifi.filter(_.contains("_doAutoJoin")) .map(l => (l.split("[“”]")(1),1)) .reduceByKey(_ + _) .map(n => (n._2,n._1)) .sortByKey(false) .groupByKey filter   map   reduceByKey   sortByKey   map   groupByKey  

Slide 12

Slide 12 text

Stack  Overflow  Tag  Recommender  Demo   12   •  Ques=ons  have  tags  like   java    or    mysql! •  Recommend  new  tags     to  ques=ons     •  Available  as  data  dump   •  Jan  20  2014  Posts.xml! •  24.4GB   •  2.1M  ques=ons     •  9.3M  tags  (34K  unique)  

Slide 13

Slide 13 text

13 !

Slide 14

Slide 14 text

Stack  Overflow  Tag  Recommender  Demo   14   •  CDH  5.0.0   •  Spark  0.9.0   •  Standalone  mode   •  Install  libgfortran   •  6-­‐node  cluster   •  24  cores   •  64GB  RAM  

Slide 15

Slide 15 text

15

Slide 16

Slide 16 text

16 val postsXML = sc.textFile(! "hdfs:///user/ibuss/SparkDemo/Posts.xml")! ! postsXML: org.apache.spark.rdd.RDD[String] =! MappedRDD[13] at textFile at :15! ! ! postsXML.count! ...! res1: Long = 18066983!

Slide 17

Slide 17 text

17   (4,"c#")
 (4,"winforms")
 ...! (4,3104,1.0)
 (4,2148819,1.0)
 ...!

Slide 18

Slide 18 text

18 val postIDTags = postsXML.flatMap { line =>! val idTagRegex = ! "Id=\"(\\d+)\".+Tags=\"([^\"]+)\"".r! val tagRegex = "<([^&]+)>".r! idTagRegex.findFirstMatchIn(line) match {! case None => None ! case Some(m) => {! val postID = m.group(1).toInt! val tagsString = m.group(2)! val tags = ! tagRegex.findAllMatchIn(tagsString)! .map(_.group(1)).toList! if (tags.size >= 4) tags.map((postID,_)) ! else None! }! }! }!

Slide 19

Slide 19 text

19 def nnHash(tag: String) = ! tag.hashCode & 0x7FFFFF! var tagHashes = ! postIDTags.map(_._2).distinct.map(tag => ! (nnHash(tag),tag))! ! import org.apache.spark.mllib.recommendation._! val alsInput = postIDTags.map(t => ! Rating(t._1, nnHash(t._2), 1.0))! ! val model = ALS.trainImplicit(alsInput, 40, 10)!

Slide 20

Slide 20 text

20

Slide 21

Slide 21 text

21 def recommend(questionID: Int, howMany: Int = 5): 
 Array[(String, Double)] = {! val predictions = model.predict(
 tagHashes.map(t => (questionID,t._1)))! val topN = 
 predictions.top(howMany)
 (Ordering.by[Rating,Double](_.rating))! topN.map(r => 
 (tagHashes.lookup(r.product)(0), r.rating))! }! ! recommend(7122697).foreach(println)!

Slide 22

Slide 22 text

22 (sql,0.1666023080230586)! (database,0.14425980384610013)! (oracle,0.09742911781766687)! (ruby-on-rails,0.06623183702418671)! (sqlite,0.05568507618047555)   I have a large table with a text field, and want to make queries to this table, to find records that contain a given substring, using ILIKE. It works perfectly on small tables, but in my case it is a rather time- consuming operation, and I need it work fast, because I use it in a live- search field in my website. Any ideas would be appreciated... postgresql query-optimization substring text-search   stackoverflow.com/ques=ons/7122697/how-­‐to-­‐make-­‐substring-­‐matching-­‐query-­‐work-­‐fast-­‐on-­‐a-­‐large-­‐table  

Slide 23

Slide 23 text

23 blog.cloudera.com/blog/2014/
 03/why-apache-spark-is-a-
 crossover-hit-for-data-
 scientists/! goo.gl/4K5YEI!

Slide 24

Slide 24 text