Apache Spark as a Cross-over Hit for Data Science

1 Apache Spark as a Cross-‐over Hit for
Data Science Data Science London Ian Buss / Solu=ons Architect / Cloudera

Inves=ga=ve vs Opera=onal Analy=cs 2

Tools of the Trade 3

Trade-‐oﬀs of the Tools 4 Produc=on Data
Large-‐Scale Shared Cluster Con=nuous Opera=on Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Worksta=on Ad Hoc Inves=ga=on Oﬄine Accuracy Many, Sophis=cated Scrip=ng, High Level Ease of Development Data Context Metrics Library Language Inves=ga=ve Opera=onal

R 5 Produc=on Data Large-‐Scale Shared
Cluster Con=nuous Opera=on Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Worksta=on Ad Hoc Inves=ga=on Oﬄine Accuracy Many, Sophis=cated Scrip=ng, High Level Ease of Development Data Context Metrics Library Language Inves=ga=ve Opera=onal

Python + scikit 6 Produc=on Data Large-‐Scale
Shared Cluster Con=nuous Opera=on Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Worksta=on Ad Hoc Inves=ga=on Oﬄine Accuracy Many, Sophis=cated Scrip=ng, High Level Ease of Development Data Context Metrics Library Language Inves=ga=ve Opera=onal

MapReduce, Crunch, Mahout 7 Produc=on Data Large-‐Scale
Shared Cluster Con=nuous Opera=on Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Worksta=on Ad Hoc Inves=ga=on Oﬄine Accuracy Many, Sophis=cated Scrip=ng, High Level Ease of Development Data Context Metrics Library Language Inves=ga=ve Opera=onal

Spark: Something For Everyone 8 •  Now Apache
TLP •  UC Berkeley, DataBricks •  Mesos to YARN •  Scala-‐based •  Expressive, eﬃcient •  JVM-‐based •  Scala-‐like API •  Distributed works like local •  As Crunch is Collec=on-‐ like •  REPL •  Interac=ve •  Distributed •  Hadoop-‐friendly •  Integrate with where data, cluster already is •  ETL no longer separate •  Mllib •  GraphX, Streaming, SQL

Spark 9 Produc=on Data Large-‐Scale Shared
Cluster Con=nuous Opera=on Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Worksta=on Ad Hoc Inves=ga=on Oﬄine Accuracy Many, Sophis=cated Scrip=ng, High Level Ease of Development Data Context Metrics Library Language Inves=ga=ve Opera=onal

Spark – Selected Features 10 Arbitrary computa=on DAG
Simple, expressive programming Exploit RAM Combine processing approaches Scalable, reliable, integrated Scala, Python, Java RDDs, itera=ons – ML! Relax, it’s just MapReduce…but more ﬂexible Batch, Streaming, SQL… Built on Hadoop

Spark – Simple Example 11 val wifi =
sc.textFile("/var/log/wifi.log”,3) val networks = wifi.filter(_.contains("_doAutoJoin")) .map(l => (l.split("[“”]")(1),1)) .reduceByKey(_ + _) .map(n => (n._2,n._1)) .sortByKey(false) .groupByKey ﬁlter map reduceByKey sortByKey map groupByKey

Stack Overﬂow Tag Recommender Demo 12 •  Ques=ons
have tags like java or mysql! •  Recommend new tags to ques=ons •  Available as data dump •  Jan 20 2014 Posts.xml! •  24.4GB •  2.1M ques=ons •  9.3M tags (34K unique)

13 <row Id="4" PostTypeId="1" AcceptedAnswerId="7”! CreationDate="2008-07-31T21:42:52.667" Score="251" ! ViewCount="15207" Body="I
want to use a track-bar to change a! form's opacity.

This is my code:<! p>

<pre><code>decimal trans =! trackBar1.Value / 5000;
this.Opacity = trans;
<! code></pre>

When I try to build it, I get! this error:

<blockquote>
! Cannot implicitly convert type 'decimal' to 'double'.<! p>
</blockquote>

I tried making ! trans to double</! strong>, but then the control doesn't work. This code has worked ! fine for me in VB.NET in the past. 
" ! OwnerUserId="8” LastEditorUserId="2648239" ! LastEditorDisplayName="Rich B” LastEditDate="2014-01-03T02:42:54.963" ! LastActivityDate="2014-01-03T02:42:54.963" ! Title="When setting a form's opacity should I use a decimal or double?” ! Tags="<c#><winforms><forms><type-! conversion><opacity>" ! AnswerCount="13" CommentCount="25" FavoriteCount="23" ! CommunityOwnedDate="2012-10-31T16:42:47.213" />!

Stack Overﬂow Tag Recommender Demo 14 •  CDH
5.0.0 •  Spark 0.9.0 •  Standalone mode •  Install libgfortran •  6-‐node cluster •  24 cores •  64GB RAM

16 val postsXML = sc.textFile(! "hdfs:///user/ibuss/SparkDemo/Posts.xml")! ! postsXML: org.apache.spark.rdd.RDD[String] =!
MappedRDD[13] at textFile at <console>:15! ! ! postsXML.count! ...! res1: Long = 18066983!

17 <row Id="4"   ...  Tags="...c#...winforms..."/> (4,"c#")  (4,"winforms")  ...!
(4,3104,1.0)  (4,2148819,1.0)  ...!

18 val postIDTags = postsXML.flatMap { line =>! val idTagRegex
= ! "Id=\"(\\d+)\".+Tags=\"([^\"]+)\"".r! val tagRegex = "<([^&]+)>".r! idTagRegex.findFirstMatchIn(line) match {! case None => None ! case Some(m) => {! val postID = m.group(1).toInt! val tagsString = m.group(2)! val tags = ! tagRegex.findAllMatchIn(tagsString)! .map(_.group(1)).toList! if (tags.size >= 4) tags.map((postID,_)) ! else None! }! }! }!

19 def nnHash(tag: String) = ! tag.hashCode & 0x7FFFFF! var
tagHashes = ! postIDTags.map(_._2).distinct.map(tag => ! (nnHash(tag),tag))! ! import org.apache.spark.mllib.recommendation._! val alsInput = postIDTags.map(t => ! Rating(t._1, nnHash(t._2), 1.0))! ! val model = ALS.trainImplicit(alsInput, 40, 10)!

21 def recommend(questionID: Int, howMany: Int = 5):   Array[(String,
Double)] = {! val predictions = model.predict(  tagHashes.map(t => (questionID,t._1)))! val topN =   predictions.top(howMany)  (Ordering.by[Rating,Double](_.rating))! topN.map(r =>   (tagHashes.lookup(r.product)(0), r.rating))! }! ! recommend(7122697).foreach(println)!

22 (sql,0.1666023080230586)! (database,0.14425980384610013)! (oracle,0.09742911781766687)! (ruby-on-rails,0.06623183702418671)! (sqlite,0.05568507618047555) I have a
large table with a text field, and want to make queries to this table, to find records that contain a given substring, using ILIKE. It works perfectly on small tables, but in my case it is a rather time- consuming operation, and I need it work fast, because I use it in a live- search field in my website. Any ideas would be appreciated... postgresql query-optimization substring text-search stackoverflow.com/ques=ons/7122697/how-‐to-‐make-‐substring-‐matching-‐query-‐work-‐fast-‐on-‐a-‐large-‐table

23 blog.cloudera.com/blog/2014/  03/why-apache-spark-is-a-  crossover-hit-for-data-  scientists/! goo.gl/4K5YEI!

[email protected]!

Apache Spark as a Cross-over Hit for Data Science

Apache Spark as a Cross-over Hit for Data Science

Data Science London

More Decks by Data Science London

Other Decks in Technology

Featured

Transcript

1 Apache Spark as a Cross-‐over Hit for

Inves=ga=ve vs Opera=onal Analy=cs 2

Tools of the Trade 3

Trade-‐oﬀs of the Tools 4 Produc=on Data

R 5 Produc=on Data Large-‐Scale Shared

Python + scikit 6 Produc=on Data Large-‐Scale

MapReduce, Crunch, Mahout 7 Produc=on Data Large-‐Scale

Spark: Something For Everyone 8 •  Now Apache

Spark 9 Produc=on Data Large-‐Scale Shared

Spark – Selected Features 10 Arbitrary computa=on DAG

Spark – Simple Example 11 val wifi =

Stack Overﬂow Tag Recommender Demo 12 •  Ques=ons

13 <row Id="4" PostTypeId="1" AcceptedAnswerId="7”! CreationDate="2008-07-31T21:42:52.667" Score="251" ! ViewCount="15207" Body="<p>I

Stack Overﬂow Tag Recommender Demo 14 •  CDH

15

16 val postsXML = sc.textFile(! "hdfs:///user/ibuss/SparkDemo/Posts.xml")! ! postsXML: org.apache.spark.rdd.RDD[String] =!

17 <row Id="4"   ...  Tags="...c#...winforms..."/> (4,"c#")  (4,"winforms")  ...!

18 val postIDTags = postsXML.flatMap { line =>! val idTagRegex

19 def nnHash(tag: String) = ! tag.hashCode & 0x7FFFFF! var

20

21 def recommend(questionID: Int, howMany: Int = 5):   Array[(String,

22 (sql,0.1666023080230586)! (database,0.14425980384610013)! (oracle,0.09742911781766687)! (ruby-on-rails,0.06623183702418671)! (sqlite,0.05568507618047555) I have a

23 blog.cloudera.com/blog/2014/  03/why-apache-spark-is-a-  crossover-hit-for-data-  scientists/! goo.gl/4K5YEI!

[email protected]!