Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Apache Spark: Next Generation Big Data Technology

Slide 3

Slide 3 text

• @MLnick • Co-founder @graphflow - big data & machine learning applied to recommendations, consumer behaviour & insights • Apache Spark committer • Author of “Machine Learning with Spark” • Packt RAW: http://www.packtpub.com/machine-learning-with-spark/book About

Slide 4

Slide 4 text

Agenda • BIG DATA! • Herding Elephants • Making Sparks • Conclusion

Slide 5

Slide 5 text

Big Data Everywhere

Slide 6

Slide 6 text

Big Data Everywhere

Slide 7

Slide 7 text

Big Data Everywhere

Slide 8

Slide 8 text

Big Data Everywhere

Slide 9

Slide 9 text

Big Data Everywhere

Slide 10

Slide 10 text

Cutting through the Hype • Massive and growing amount of data collected • Moore’s law => cost of compute, disk, RAM decreasing rapidly • But data still growing faster than single-node performance can handle • Factors: • ease of collection • cost of storage • mobile • IoT • science (CERN, SKA)

Slide 11

Slide 11 text

Herding Elephants - Hadoop • Google was doing big data before it was cool… • Google File System Paper (2003) -> Apache Hadoop Distributed Filesystem (HDFS) • Google MapReduce Paper (2004) -> Apache Hadoop MapReduce • BigTable (2006) -> Apache HBase • Dremel (2010) -> Apache Drill (and others)

Slide 12

Slide 12 text

Herding Elephants - Hadoop • Hadoop started at Yahoo • Open sourced -> Apache Hadoop • Spawned Cloudera, MapR, Hortonworks… and an entire big data industry • Old scaling == vertical, big tin • New scaling == horizontal, shared nothing, data parallel, commodity hardware, embrace failure!

Slide 13

Slide 13 text

Herding Elephants - Hadoop HDFS NameNode DataNode DataNode DataNode DataNode MapReduce JobTracker Task Tracker Task Tracker Task Tracker Task Tracker

Slide 14

Slide 14 text

Herding Elephants - Hadoop HDFS NameNode DataNode DataNode DataNode DataNode MapReduce JobTracker Task Tracker Task Tracker Task Tracker Task Tracker • HDFS • Replication • Fault tolerance

Slide 15

Slide 15 text

Herding Elephants - Hadoop HDFS NameNode DataNode DataNode DataNode DataNode MapReduce JobTracker Task Tracker Task Tracker Task Tracker Task Tracker • HDFS • Replication • Fault tolerance Block Block

Slide 16

Slide 16 text

Herding Elephants - Hadoop HDFS NameNode DataNode DataNode DataNode DataNode MapReduce JobTracker Task Tracker Task Tracker Task Tracker Task Tracker • HDFS • Replication • Fault tolerance Block Block Block Block Block Block Block Block

Slide 17

Slide 17 text

Herding Elephants - Hadoop HDFS NameNode DataNode DataNode DataNode MapReduce JobTracker Task Tracker Task Tracker Task Tracker Task Tracker • HDFS • Replication • Fault tolerance Block Block Block Block Block Block Block Block

Slide 18

Slide 18 text

Herding Elephants - Hadoop HDFS NameNode DataNode DataNode DataNode MapReduce JobTracker Task Tracker Task Tracker Task Tracker Task Tracker • HDFS • Replication • Fault tolerance • Map Reduce • Data locality • Fault tolerance Block Block Block Block Block Block Block Block

Slide 19

Slide 19 text

Herding Elephants - Hadoop HDFS NameNode DataNode DataNode DataNode MapReduce JobTracker Task Tracker Task Tracker Task Tracker Task Tracker • HDFS • Replication • Fault tolerance • Map Reduce • Data locality • Fault tolerance Block Block Block Block Block Block Block Block

Slide 20

Slide 20 text

Herding Elephants - Hadoop HDFS NameNode DataNode DataNode DataNode MapReduce JobTracker Task Tracker Task Tracker Task Tracker • HDFS • Replication • Fault tolerance • Map Reduce • Data locality • Fault tolerance Block Block Block Block Block Block Block Block

Slide 21

Slide 21 text

MapReduce

Slide 22

Slide 22 text

MapReduce: Counting Words • “Hadoop is a distributed system for counting words” (Scalding GitHub) • Map public class WordCount { public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } }

Slide 23

Slide 23 text

MapReduce: Counting Words public static class Reduce extends Reducer { ! public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } • Reduce

Slide 24

Slide 24 text

MapReduce: Counting Words public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); • Job Setup

Slide 25

Slide 25 text

Hadoop Issues • Pros • Reliable in face of failure (will happen at scale) - disk, network, node, rack … • Very scalable: ~40,000 nodes at Yahoo! • Cons • Disk I/O for every job • Unwieldy API (hence Cascading, Scalding, Crunch, …)

Slide 26

Slide 26 text

So Why Spark? • In-memory caching == fast! • Broadcast variables and accumulator primitives built-in • Resilient Distributed Datasets (RDD) - recomputed on the fly in case of failure • Hadoop compatible • Rich, functional API in Scala, Python, Java and R • One platform for multiple use cases: • Shark / SparkSQL - SQL on Spark • Spark Streaming - Real time processing • Machine Learning - MLlib • Graph Processing - GraphX

Slide 27

Slide 27 text

Spark Word Count val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") • Power of functional constructs and Scala language • Same code locally or on a cluster val sparkLocal = new SparkContext(“local[4]”, “Local 4 Threads”) val sparkOnCluster = new SparkContext(“spark://...:7077”, “Cluster of 1000 Nodes”)

Slide 28

Slide 28 text

Spark Word Count file = spark.textFile("hdfs://...") counts = file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("hdfs://...") • Python

Slide 29

Slide 29 text

Spark Machine Learning val points = spark.textFile(...).map(parsePoint).cache() var w = Vector.random(D) // initial weight vector for (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } • Caching data in memory allows subsequent iterations to be faster • Scala allows concise code, closer to the actual maths!

Slide 30

Slide 30 text

Spark Machine Learning • Small benchmark dataset - 1 million rows • 12x speed up • Spark version is a far more complex and efficient algorithm, and is still 50% code size Model Hadoop ALS (Mahout) MLlib ALS Runtime 6m39s 24.8s Lines of Code 1374 + 137 + 121 + 116 = ~1750 ~880

Slide 31

Slide 31 text

Spark vs Hadoop public class ImplicitFeedbackAlternatingLeastSquaresSolver { ! private final int numFeatures; private final double alpha; private final double lambda; ! private final OpenIntObjectHashMap Y; private final Matrix YtransposeY; ! public ImplicitFeedbackAlternatingLeastSquaresSolver(int numFeatures, double lambda, double alpha, OpenIntObjectHashMap Y) { this.numFeatures = numFeatures; this.lambda = lambda; this.alpha = alpha; this.Y = Y; YtransposeY = getYtransposeY(Y); } ! public Vector solve(Vector ratings) { return solve(YtransposeY.plus(getYtransponseCuMinusIYPlusLambdaI(ratings)), getYtransponseCuPu(ratings)); } ! private static Vector solve(Matrix A, Matrix y) { return new QRDecomposition(A).solve(y).viewColumn(0); } ! double confidence(double rating) { return 1 + alpha * rating; } ! /* Y' Y */ private Matrix getYtransposeY(OpenIntObjectHashMap Y) { ! IntArrayList indexes = Y.keys(); indexes.quickSort(); int numIndexes = indexes.size(); ! double[][] YtY = new double[numFeatures][numFeatures]; ! // Compute Y'Y by dot products between the 'columns' of Y for (int i = 0; i < numFeatures; i++) { for (int j = i; j < numFeatures; j++) { double dot = 0; for (int k = 0; k < numIndexes; k++) { Vector row = Y.get(indexes.getQuick(k)); dot += row.getQuick(i) * row.getQuick(j); } YtY[i][j] = dot; if (i != j) { YtY[j][i] = dot; } } } return new DenseMatrix(YtY, true); } ! /** Y' (Cu - I) Y + λ I */ private Matrix getYtransponseCuMinusIYPlusLambdaI(Vector userRatings) { Preconditions.checkArgument(userRatings.isSequentialAccess(), "need sequential access to ratings!"); ! /* (Cu -I) Y */ OpenIntObjectHashMap CuMinusIY = new OpenIntObjectHashMap(userRatings.getNumNondefaultElements()); for (Element e : userRatings.nonZeroes()) { CuMinusIY.put(e.index(), Y.get(e.index()).times(confidence(e.get()) - 1)); } ! Matrix YtransponseCuMinusIY = new DenseMatrix(numFeatures, numFeatures); ! /* Y' (Cu -I) Y by outer products */ for (Element e : userRatings.nonZeroes()) { for (Vector.Element feature : Y.get(e.index()).all()) { Vector partial = CuMinusIY.get(e.index()).times(feature.get()); YtransponseCuMinusIY.viewRow(feature.index()).assign(partial, Functions.PLUS); } } ! /* Y' (Cu - I) Y + λ I add lambda on the diagonal */ for (int feature = 0; feature < numFeatures; feature++) { YtransponseCuMinusIY.setQuick(feature, feature, YtransponseCuMinusIY.getQuick(feature, feature) + lambda); } ! return YtransponseCuMinusIY; } ! /** Y' Cu p(u) */ private Matrix getYtransponseCuPu(Vector userRatings) { Preconditions.checkArgument(userRatings.isSequentialAccess(), "need sequential access to ratings!"); ! Vector YtransponseCuPu = new DenseVector(numFeatures); ! for (Element e : userRatings.nonZeroes()) { YtransponseCuPu.assign(Y.get(e.index()).times(confidence(e.get())), Functions.PLUS); } ! return columnVectorAsMatrix(YtransponseCuPu); } ! private Matrix columnVectorAsMatrix(Vector v) { double[][] matrix = new double[numFeatures][1]; for (Vector.Element e : v.all()) { matrix[e.index()][0] = e.get(); } return new DenseMatrix(matrix, true); } ! } def updateFactorsImplicit( UorI: mutable.Map[Int, DenseVector[Double]], ratings: Vector[Double], YtY: DenseMatrix[Double]) = { ! // set up required intermediate data structures val nui = ratings.activeSize val UorIMat = DenseMatrix.zeros[Double](nui, numF) val CuMinusIY = DenseMatrix.zeros[Double](nui, numF) val Cup = DenseVector.zeros[Double](nui) var j = 0 ! ratings.activeIterator.foreach{ case(i, v) => { CuMinusIY(j, ::) := UorI(i) :* alpha :* v Cup(j) = alpha * v + 1 UorIMat(j, ::) := UorI(i) j += 1 }} ! val YtCuY = YtY + UorIMat.t * CuMinusIY + (DenseMatrix.eye[Double](numF) :* lambda) val YtCup = UorIMat.t * Cup YtCuY \ YtCup } vs Matrix / vector multiplication Element-wise operations

Slide 32

Slide 32 text

SparkSQL • SparkSQL val events = rdd.map { case (_, m) => Event(m(“time”).toLong, m(“event”).toString, ...) } ! // complex join and filter in Spark ... ! events.registerAsTable(“events”) val aggs = hql( “select from_unixtime(cast(time/1000.0 as bigint), 'yyyy-MM-dd HH:00:00') hour, event, count(1) from events ...” ) ! // save results

Slide 33

Slide 33 text

Shark (SQL) Performance

Slide 34

Slide 34 text

Shark (SQL) Performance

Slide 35

Slide 35 text

Successor to MapReduce • Apache Spark 1.0.0! • All Hadoop providers announced support / partnerships Cloudera, MapR, Hortonworks • Databricks Cloud (http://databricks.com/cloud/) • Try it out: http://spark.apache.org/ • Local mode on your laptop • Cluster in Amazon EC2 via Spark launch scripts

Slide 36

Slide 36 text

Live Demo Time

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

We’re hiring!

Slide 39

Slide 39 text

nick@graphflow.com