Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Spark as a Cross-over Hit for Data Science

Apache Spark as a Cross-over Hit for Data Science

Talk by Ian Buss, Solution Architect @Cloudera at Data Science London @ds_ldn meetup

Data Science London

May 06, 2014
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. 1 Apache  Spark  as  a  Cross-­‐over  Hit     for

     Data  Science   Data  Science  London   Ian  Buss  /  Solu=ons  Architect  /  Cloudera  
  2. Trade-­‐offs  of  the  Tools   4   Produc=on  Data  

    Large-­‐Scale   Shared  Cluster     Con=nuous   Opera=on   Online     Throughput,  QPS     Few,  Simple     Systems  Language   Performance   Historical  Subset   Sample   Worksta=on     Ad  Hoc   Inves=ga=on   Offline     Accuracy     Many,  Sophis=cated     Scrip=ng,  High  Level   Ease  of  Development   Data   Context   Metrics   Library   Language   Inves=ga=ve   Opera=onal  
  3. R   5   Produc=on  Data   Large-­‐Scale   Shared

     Cluster     Con=nuous   Opera=on   Online     Throughput,  QPS     Few,  Simple     Systems  Language   Performance   Historical  Subset   Sample   Worksta=on     Ad  Hoc   Inves=ga=on   Offline     Accuracy     Many,  Sophis=cated     Scrip=ng,  High  Level   Ease  of  Development   Data   Context   Metrics   Library   Language   Inves=ga=ve   Opera=onal  
  4. Python  +  scikit   6   Produc=on  Data   Large-­‐Scale

      Shared  Cluster     Con=nuous   Opera=on   Online     Throughput,  QPS     Few,  Simple     Systems  Language   Performance   Historical  Subset   Sample   Worksta=on     Ad  Hoc   Inves=ga=on   Offline     Accuracy     Many,  Sophis=cated     Scrip=ng,  High  Level   Ease  of  Development   Data   Context   Metrics   Library   Language   Inves=ga=ve   Opera=onal  
  5. MapReduce,  Crunch,  Mahout   7   Produc=on  Data   Large-­‐Scale

      Shared  Cluster     Con=nuous   Opera=on   Online     Throughput,  QPS     Few,  Simple     Systems  Language   Performance   Historical  Subset   Sample   Worksta=on     Ad  Hoc   Inves=ga=on   Offline     Accuracy     Many,  Sophis=cated     Scrip=ng,  High  Level   Ease  of  Development   Data   Context   Metrics   Library   Language   Inves=ga=ve   Opera=onal  
  6. Spark:  Something  For  Everyone   8   •  Now  Apache

     TLP   •  UC  Berkeley,  DataBricks   •  Mesos  to  YARN   •  Scala-­‐based   •  Expressive,  efficient   •  JVM-­‐based   •  Scala-­‐like  API   •  Distributed  works  like   local   •  As  Crunch  is  Collec=on-­‐ like   •  REPL   •  Interac=ve   •  Distributed   •  Hadoop-­‐friendly   •  Integrate  with  where   data,  cluster  already  is   •  ETL  no  longer  separate   •  Mllib   •  GraphX,  Streaming,  SQL  
  7. Spark   9   Produc=on  Data   Large-­‐Scale   Shared

     Cluster     Con=nuous   Opera=on   Online     Throughput,  QPS     Few,  Simple     Systems  Language   Performance   Historical  Subset   Sample   Worksta=on     Ad  Hoc   Inves=ga=on   Offline     Accuracy     Many,  Sophis=cated     Scrip=ng,  High  Level   Ease  of  Development   Data   Context   Metrics   Library   Language   Inves=ga=ve   Opera=onal  
  8. Spark  –  Selected  Features   10   Arbitrary  computa=on  DAG

      Simple,  expressive  programming   Exploit  RAM   Combine  processing  approaches   Scalable,  reliable,  integrated   Scala,  Python,  Java   RDDs,  itera=ons  –  ML!   Relax,  it’s  just  MapReduce…but  more  flexible     Batch,  Streaming,  SQL…   Built  on  Hadoop  
  9. Spark  –  Simple  Example   11   val wifi =

    sc.textFile("/var/log/wifi.log”,3) val networks = wifi.filter(_.contains("_doAutoJoin")) .map(l => (l.split("[“”]")(1),1)) .reduceByKey(_ + _) .map(n => (n._2,n._1)) .sortByKey(false) .groupByKey filter   map   reduceByKey   sortByKey   map   groupByKey  
  10. Stack  Overflow  Tag  Recommender  Demo   12   •  Ques=ons

     have  tags  like   java    or    mysql! •  Recommend  new  tags     to  ques=ons     •  Available  as  data  dump   •  Jan  20  2014  Posts.xml! •  24.4GB   •  2.1M  ques=ons     •  9.3M  tags  (34K  unique)  
  11. 13 <row Id="4" PostTypeId="1" AcceptedAnswerId="7”! CreationDate="2008-07-31T21:42:52.667" Score="251" ! ViewCount="15207" Body="&lt;p&gt;I

    want to use a track-bar to change a! form's opacity.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;This is my code:&lt;! p&gt;&#xA;&#xA;&lt;pre&gt;&lt;code&gt;decimal trans =! trackBar1.Value / 5000;&#xA;this.Opacity = trans;&#xA;&lt;! code&gt;&lt;/pre&gt;&#xA;&#xA;&lt;p&gt;When I try to build it, I get! this error:&lt;/p&gt;&#xA;&#xA;&lt;blockquote&gt;&#xA; ! &lt;p&gt;Cannot implicitly convert type 'decimal' to 'double'.&lt;! p&gt;&#xA;&lt;/blockquote&gt;&#xA;&#xA;&lt;p&gt;I tried making ! &lt;strong&gt;trans&lt;/strong&gt; to &lt;strong&gt;double&lt;/! strong&gt;, but then the control doesn't work. This code has worked ! fine for me in VB.NET in the past. &lt;/p&gt;&#xA;" ! OwnerUserId="8” LastEditorUserId="2648239" ! LastEditorDisplayName="Rich B” LastEditDate="2014-01-03T02:42:54.963" ! LastActivityDate="2014-01-03T02:42:54.963" ! Title="When setting a form's opacity should I use a decimal or double?” ! Tags="&lt;c#&gt;&lt;winforms&gt;&lt;forms&gt;&lt;type-! conversion&gt;&lt;opacity&gt;" ! AnswerCount="13" CommentCount="25" FavoriteCount="23" ! CommunityOwnedDate="2012-10-31T16:42:47.213" />!
  12. Stack  Overflow  Tag  Recommender  Demo   14   •  CDH

     5.0.0   •  Spark  0.9.0   •  Standalone  mode   •  Install  libgfortran   •  6-­‐node  cluster   •  24  cores   •  64GB  RAM  
  13. 15

  14. 16 val postsXML = sc.textFile(! "hdfs:///user/ibuss/SparkDemo/Posts.xml")! ! postsXML: org.apache.spark.rdd.RDD[String] =!

    MappedRDD[13] at textFile at <console>:15! ! ! postsXML.count! ...! res1: Long = 18066983!
  15. 18 val postIDTags = postsXML.flatMap { line =>! val idTagRegex

    = ! "Id=\"(\\d+)\".+Tags=\"([^\"]+)\"".r! val tagRegex = "&lt;([^&]+)&gt;".r! idTagRegex.findFirstMatchIn(line) match {! case None => None ! case Some(m) => {! val postID = m.group(1).toInt! val tagsString = m.group(2)! val tags = ! tagRegex.findAllMatchIn(tagsString)! .map(_.group(1)).toList! if (tags.size >= 4) tags.map((postID,_)) ! else None! }! }! }!
  16. 19 def nnHash(tag: String) = ! tag.hashCode & 0x7FFFFF! var

    tagHashes = ! postIDTags.map(_._2).distinct.map(tag => ! (nnHash(tag),tag))! ! import org.apache.spark.mllib.recommendation._! val alsInput = postIDTags.map(t => ! Rating(t._1, nnHash(t._2), 1.0))! ! val model = ALS.trainImplicit(alsInput, 40, 10)!
  17. 20

  18. 21 def recommend(questionID: Int, howMany: Int = 5): 
 Array[(String,

    Double)] = {! val predictions = model.predict(
 tagHashes.map(t => (questionID,t._1)))! val topN = 
 predictions.top(howMany)
 (Ordering.by[Rating,Double](_.rating))! topN.map(r => 
 (tagHashes.lookup(r.product)(0), r.rating))! }! ! recommend(7122697).foreach(println)!
  19. 22 (sql,0.1666023080230586)! (database,0.14425980384610013)! (oracle,0.09742911781766687)! (ruby-on-rails,0.06623183702418671)! (sqlite,0.05568507618047555)   I have a

    large table with a text field, and want to make queries to this table, to find records that contain a given substring, using ILIKE. It works perfectly on small tables, but in my case it is a rather time- consuming operation, and I need it work fast, because I use it in a live- search field in my website. Any ideas would be appreciated... postgresql query-optimization substring text-search   stackoverflow.com/ques=ons/7122697/how-­‐to-­‐make-­‐substring-­‐matching-­‐query-­‐work-­‐fast-­‐on-­‐a-­‐large-­‐table