Apache Spark as a Cross-over Hit for Data Science

Slide 1

Slide 1 text

1 Apache Spark as a Cross-‐over Hit for Data Science Data Science London Ian Buss / Solu=ons Architect / Cloudera

Slide 2

Slide 2 text

Inves=ga=ve vs Opera=onal Analy=cs 2

Slide 3

Slide 3 text

Tools of the Trade 3

Slide 4

Slide 4 text

Trade-‐oﬀs of the Tools 4 Produc=on Data Large-‐Scale Shared Cluster Con=nuous Opera=on Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Worksta=on Ad Hoc Inves=ga=on Oﬄine Accuracy Many, Sophis=cated Scrip=ng, High Level Ease of Development Data Context Metrics Library Language Inves=ga=ve Opera=onal

Slide 5

Slide 5 text

R 5 Produc=on Data Large-‐Scale Shared Cluster Con=nuous Opera=on Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Worksta=on Ad Hoc Inves=ga=on Oﬄine Accuracy Many, Sophis=cated Scrip=ng, High Level Ease of Development Data Context Metrics Library Language Inves=ga=ve Opera=onal

Slide 6

Slide 6 text

Python + scikit 6 Produc=on Data Large-‐Scale Shared Cluster Con=nuous Opera=on Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Worksta=on Ad Hoc Inves=ga=on Oﬄine Accuracy Many, Sophis=cated Scrip=ng, High Level Ease of Development Data Context Metrics Library Language Inves=ga=ve Opera=onal

Slide 7

Slide 7 text

MapReduce, Crunch, Mahout 7 Produc=on Data Large-‐Scale Shared Cluster Con=nuous Opera=on Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Worksta=on Ad Hoc Inves=ga=on Oﬄine Accuracy Many, Sophis=cated Scrip=ng, High Level Ease of Development Data Context Metrics Library Language Inves=ga=ve Opera=onal

Slide 8

Slide 8 text

Spark: Something For Everyone 8 •  Now Apache TLP •  UC Berkeley, DataBricks •  Mesos to YARN •  Scala-‐based •  Expressive, eﬃcient •  JVM-‐based •  Scala-‐like API •  Distributed works like local •  As Crunch is Collec=on-‐ like •  REPL •  Interac=ve •  Distributed •  Hadoop-‐friendly •  Integrate with where data, cluster already is •  ETL no longer separate •  Mllib •  GraphX, Streaming, SQL

Slide 9

Slide 9 text

Spark 9 Produc=on Data Large-‐Scale Shared Cluster Con=nuous Opera=on Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Worksta=on Ad Hoc Inves=ga=on Oﬄine Accuracy Many, Sophis=cated Scrip=ng, High Level Ease of Development Data Context Metrics Library Language Inves=ga=ve Opera=onal

Slide 10

Slide 10 text

Spark – Selected Features 10 Arbitrary computa=on DAG Simple, expressive programming Exploit RAM Combine processing approaches Scalable, reliable, integrated Scala, Python, Java RDDs, itera=ons – ML! Relax, it’s just MapReduce…but more ﬂexible Batch, Streaming, SQL… Built on Hadoop

Slide 11

Slide 11 text

Spark – Simple Example 11 val wifi = sc.textFile("/var/log/wifi.log”,3) val networks = wifi.filter(_.contains("_doAutoJoin")) .map(l => (l.split("[“”]")(1),1)) .reduceByKey(_ + _) .map(n => (n._2,n._1)) .sortByKey(false) .groupByKey ﬁlter map reduceByKey sortByKey map groupByKey

Slide 12

Slide 12 text

Stack Overﬂow Tag Recommender Demo 12 •  Ques=ons have tags like java or mysql! •  Recommend new tags to ques=ons •  Available as data dump •  Jan 20 2014 Posts.xml! •  24.4GB •  2.1M ques=ons •  9.3M tags (34K unique)

Slide 13

Slide 13 text

13 !

Slide 14

Slide 14 text

Stack Overﬂow Tag Recommender Demo 14 •  CDH 5.0.0 •  Spark 0.9.0 •  Standalone mode •  Install libgfortran •  6-‐node cluster •  24 cores •  64GB RAM

Slide 15

Slide 15 text

Slide 16

Slide 16 text

16 val postsXML = sc.textFile(! "hdfs:///user/ibuss/SparkDemo/Posts.xml")! ! postsXML: org.apache.spark.rdd.RDD[String] =! MappedRDD[13] at textFile at :15! ! ! postsXML.count! ...! res1: Long = 18066983!

Slide 17

Slide 17 text

17 (4,"c#")  (4,"winforms")  ...! (4,3104,1.0)  (4,2148819,1.0)  ...!

Slide 18

Slide 18 text

18 val postIDTags = postsXML.flatMap { line =>! val idTagRegex = ! "Id=\"(\\d+)\".+Tags=\"([^\"]+)\"".r! val tagRegex = "<([^&]+)>".r! idTagRegex.findFirstMatchIn(line) match {! case None => None ! case Some(m) => {! val postID = m.group(1).toInt! val tagsString = m.group(2)! val tags = ! tagRegex.findAllMatchIn(tagsString)! .map(_.group(1)).toList! if (tags.size >= 4) tags.map((postID,_)) ! else None! }! }! }!

Slide 19

Slide 19 text

19 def nnHash(tag: String) = ! tag.hashCode & 0x7FFFFF! var tagHashes = ! postIDTags.map(_._2).distinct.map(tag => ! (nnHash(tag),tag))! ! import org.apache.spark.mllib.recommendation._! val alsInput = postIDTags.map(t => ! Rating(t._1, nnHash(t._2), 1.0))! ! val model = ALS.trainImplicit(alsInput, 40, 10)!

Slide 20

Slide 20 text

Slide 21

Slide 21 text

21 def recommend(questionID: Int, howMany: Int = 5):   Array[(String, Double)] = {! val predictions = model.predict(  tagHashes.map(t => (questionID,t._1)))! val topN =   predictions.top(howMany)  (Ordering.by[Rating,Double](_.rating))! topN.map(r =>   (tagHashes.lookup(r.product)(0), r.rating))! }! ! recommend(7122697).foreach(println)!

Slide 22

Slide 22 text

22 (sql,0.1666023080230586)! (database,0.14425980384610013)! (oracle,0.09742911781766687)! (ruby-on-rails,0.06623183702418671)! (sqlite,0.05568507618047555) I have a large table with a text field, and want to make queries to this table, to find records that contain a given substring, using ILIKE. It works perfectly on small tables, but in my case it is a rather time- consuming operation, and I need it work fast, because I use it in a live- search field in my website. Any ideas would be appreciated... postgresql query-optimization substring text-search stackoverflow.com/ques=ons/7122697/how-‐to-‐make-‐substring-‐matching-‐query-‐work-‐fast-‐on-‐a-‐large-‐table

Slide 23

Slide 23 text

23 blog.cloudera.com/blog/2014/  03/why-apache-spark-is-a-  crossover-hit-for-data-  scientists/! goo.gl/4K5YEI!

Slide 24

Slide 24 text

[email protected]!