Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Spark - Next Generation Big Data Technology

MLnick
July 17, 2014

Apache Spark - Next Generation Big Data Technology

A presentation I gave at the PHP Cape Town User Group meetup

MLnick

July 17, 2014
Tweet

Other Decks in Technology

Transcript

  1. View Slide

  2. Apache Spark:
    Next Generation Big Data
    Technology

    View Slide

  3. • @MLnick
    • Co-founder @graphflow - big data & machine learning applied to
    recommendations, consumer behaviour & insights
    • Apache Spark committer
    • Author of “Machine Learning with Spark”
    • Packt RAW: http://www.packtpub.com/machine-learning-with-spark/book
    About

    View Slide

  4. Agenda
    • BIG DATA!
    • Herding Elephants
    • Making Sparks
    • Conclusion

    View Slide

  5. Big Data Everywhere

    View Slide

  6. Big Data Everywhere

    View Slide

  7. Big Data Everywhere

    View Slide

  8. Big Data Everywhere

    View Slide

  9. Big Data Everywhere

    View Slide

  10. Cutting through the Hype
    • Massive and growing amount of data collected
    • Moore’s law => cost of compute, disk, RAM decreasing rapidly
    • But data still growing faster than single-node performance can handle
    • Factors:
    • ease of collection
    • cost of storage
    • mobile
    • IoT
    • science (CERN, SKA)

    View Slide

  11. Herding Elephants - Hadoop
    • Google was doing big data before it was cool…
    • Google File System Paper (2003) -> Apache
    Hadoop Distributed Filesystem (HDFS)
    • Google MapReduce Paper (2004) -> Apache
    Hadoop MapReduce
    • BigTable (2006) -> Apache HBase
    • Dremel (2010) -> Apache Drill (and others)

    View Slide

  12. Herding Elephants - Hadoop
    • Hadoop started at Yahoo
    • Open sourced -> Apache Hadoop
    • Spawned Cloudera, MapR, Hortonworks… and an
    entire big data industry
    • Old scaling == vertical, big tin
    • New scaling == horizontal, shared nothing, data
    parallel, commodity hardware, embrace failure!

    View Slide

  13. Herding Elephants - Hadoop
    HDFS
    NameNode
    DataNode DataNode
    DataNode DataNode
    MapReduce
    JobTracker
    Task Tracker Task Tracker
    Task Tracker Task Tracker

    View Slide

  14. Herding Elephants - Hadoop
    HDFS
    NameNode
    DataNode DataNode
    DataNode DataNode
    MapReduce
    JobTracker
    Task Tracker Task Tracker
    Task Tracker Task Tracker
    • HDFS
    • Replication
    • Fault tolerance

    View Slide

  15. Herding Elephants - Hadoop
    HDFS
    NameNode
    DataNode DataNode
    DataNode DataNode
    MapReduce
    JobTracker
    Task Tracker Task Tracker
    Task Tracker Task Tracker
    • HDFS
    • Replication
    • Fault tolerance
    Block Block

    View Slide

  16. Herding Elephants - Hadoop
    HDFS
    NameNode
    DataNode DataNode
    DataNode DataNode
    MapReduce
    JobTracker
    Task Tracker Task Tracker
    Task Tracker Task Tracker
    • HDFS
    • Replication
    • Fault tolerance
    Block Block
    Block Block
    Block Block
    Block
    Block

    View Slide

  17. Herding Elephants - Hadoop
    HDFS
    NameNode
    DataNode DataNode
    DataNode
    MapReduce
    JobTracker
    Task Tracker Task Tracker
    Task Tracker Task Tracker
    • HDFS
    • Replication
    • Fault tolerance
    Block Block
    Block Block
    Block Block
    Block
    Block

    View Slide

  18. Herding Elephants - Hadoop
    HDFS
    NameNode
    DataNode DataNode
    DataNode
    MapReduce
    JobTracker
    Task Tracker Task Tracker
    Task Tracker Task Tracker
    • HDFS
    • Replication
    • Fault tolerance
    • Map Reduce
    • Data locality
    • Fault tolerance
    Block Block
    Block Block
    Block Block
    Block
    Block

    View Slide

  19. Herding Elephants - Hadoop
    HDFS
    NameNode
    DataNode DataNode
    DataNode
    MapReduce
    JobTracker
    Task Tracker Task Tracker
    Task Tracker Task Tracker
    • HDFS
    • Replication
    • Fault tolerance
    • Map Reduce
    • Data locality
    • Fault tolerance
    Block Block
    Block Block
    Block Block
    Block
    Block

    View Slide

  20. Herding Elephants - Hadoop
    HDFS
    NameNode
    DataNode DataNode
    DataNode
    MapReduce
    JobTracker
    Task Tracker Task Tracker
    Task Tracker
    • HDFS
    • Replication
    • Fault tolerance
    • Map Reduce
    • Data locality
    • Fault tolerance
    Block Block
    Block Block
    Block Block
    Block
    Block

    View Slide

  21. MapReduce

    View Slide

  22. MapReduce:
    Counting Words
    • “Hadoop is a distributed system for counting words” (Scalding GitHub)
    • Map
    public class WordCount {
    public static class Map extends Mapper {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(LongWritable key, Text value, Context context) throws
    IOException, InterruptedException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
    word.set(tokenizer.nextToken());
    context.write(word, one);
    }
    }

    View Slide

  23. MapReduce:
    Counting Words
    public static class Reduce extends Reducer {
    !
    public void reduce(Text key, Iterable values, Context context)
    throws IOException, InterruptedException {
    int sum = 0;
    for (IntWritable val : values) {
    sum += val.get();
    }
    context.write(key, new IntWritable(sum));
    }
    }
    • Reduce

    View Slide

  24. MapReduce:
    Counting Words
    public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = new Job(conf, "wordcount");
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    job.waitForCompletion(true);
    • Job Setup

    View Slide

  25. Hadoop Issues
    • Pros
    • Reliable in face of failure (will happen at scale) - disk, network,
    node, rack …
    • Very scalable: ~40,000 nodes at Yahoo!
    • Cons
    • Disk I/O for every job
    • Unwieldy API (hence Cascading, Scalding, Crunch, …)

    View Slide

  26. So Why Spark?
    • In-memory caching == fast!
    • Broadcast variables and accumulator primitives built-in
    • Resilient Distributed Datasets (RDD) - recomputed on the fly in case of failure
    • Hadoop compatible
    • Rich, functional API in Scala, Python, Java and R
    • One platform for multiple use cases:
    • Shark / SparkSQL - SQL on Spark
    • Spark Streaming - Real time processing
    • Machine Learning - MLlib
    • Graph Processing - GraphX

    View Slide

  27. Spark Word Count
    val file = spark.textFile("hdfs://...")
    val counts = file.flatMap(line => line.split(" "))
    .map(word => (word, 1))
    .reduceByKey(_ + _)
    counts.saveAsTextFile("hdfs://...")
    • Power of functional constructs and Scala language
    • Same code locally or on a cluster
    val sparkLocal = new SparkContext(“local[4]”, “Local 4 Threads”)
    val sparkOnCluster = new SparkContext(“spark://...:7077”,
    “Cluster of 1000 Nodes”)

    View Slide

  28. Spark Word Count
    file = spark.textFile("hdfs://...")
    counts = file.flatMap(lambda line: line.split(" ")) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b)
    counts.saveAsTextFile("hdfs://...")
    • Python

    View Slide

  29. Spark Machine Learning
    val points = spark.textFile(...).map(parsePoint).cache()
    var w = Vector.random(D) // initial weight vector
    for (i <- 1 to ITERATIONS) {
    val gradient = points.map(p =>
    (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
    ).reduce(_ + _)
    w -= gradient
    }
    • Caching data in memory allows subsequent iterations to be faster
    • Scala allows concise code, closer to the actual maths!

    View Slide

  30. Spark Machine Learning
    • Small benchmark dataset - 1 million rows
    • 12x speed up
    • Spark version is a far more complex and efficient algorithm, and is still
    50% code size
    Model Hadoop ALS (Mahout) MLlib ALS
    Runtime 6m39s 24.8s
    Lines of Code
    1374 + 137 + 121 + 116 =
    ~1750
    ~880

    View Slide

  31. Spark vs Hadoop
    public class ImplicitFeedbackAlternatingLeastSquaresSolver {
    !
    private final int numFeatures;
    private final double alpha;
    private final double lambda;
    !
    private final OpenIntObjectHashMap Y;
    private final Matrix YtransposeY;
    !
    public ImplicitFeedbackAlternatingLeastSquaresSolver(int numFeatures, double lambda, double alpha,
    OpenIntObjectHashMap Y) {
    this.numFeatures = numFeatures;
    this.lambda = lambda;
    this.alpha = alpha;
    this.Y = Y;
    YtransposeY = getYtransposeY(Y);
    }
    !
    public Vector solve(Vector ratings) {
    return solve(YtransposeY.plus(getYtransponseCuMinusIYPlusLambdaI(ratings)), getYtransponseCuPu(ratings));
    }
    !
    private static Vector solve(Matrix A, Matrix y) {
    return new QRDecomposition(A).solve(y).viewColumn(0);
    }
    !
    double confidence(double rating) {
    return 1 + alpha * rating;
    }
    !
    /* Y' Y */
    private Matrix getYtransposeY(OpenIntObjectHashMap Y) {
    !
    IntArrayList indexes = Y.keys();
    indexes.quickSort();
    int numIndexes = indexes.size();
    !
    double[][] YtY = new double[numFeatures][numFeatures];
    !
    // Compute Y'Y by dot products between the 'columns' of Y
    for (int i = 0; i < numFeatures; i++) {
    for (int j = i; j < numFeatures; j++) {
    double dot = 0;
    for (int k = 0; k < numIndexes; k++) {
    Vector row = Y.get(indexes.getQuick(k));
    dot += row.getQuick(i) * row.getQuick(j);
    }
    YtY[i][j] = dot;
    if (i != j) {
    YtY[j][i] = dot;
    }
    }
    }
    return new DenseMatrix(YtY, true);
    }
    !
    /** Y' (Cu - I) Y + λ I */
    private Matrix getYtransponseCuMinusIYPlusLambdaI(Vector userRatings) {
    Preconditions.checkArgument(userRatings.isSequentialAccess(), "need sequential access to ratings!");
    !
    /* (Cu -I) Y */
    OpenIntObjectHashMap CuMinusIY = new OpenIntObjectHashMap(userRatings.getNumNondefaultElements());
    for (Element e : userRatings.nonZeroes()) {
    CuMinusIY.put(e.index(), Y.get(e.index()).times(confidence(e.get()) - 1));
    }
    !
    Matrix YtransponseCuMinusIY = new DenseMatrix(numFeatures, numFeatures);
    !
    /* Y' (Cu -I) Y by outer products */
    for (Element e : userRatings.nonZeroes()) {
    for (Vector.Element feature : Y.get(e.index()).all()) {
    Vector partial = CuMinusIY.get(e.index()).times(feature.get());
    YtransponseCuMinusIY.viewRow(feature.index()).assign(partial, Functions.PLUS);
    }
    }
    !
    /* Y' (Cu - I) Y + λ I add lambda on the diagonal */
    for (int feature = 0; feature < numFeatures; feature++) {
    YtransponseCuMinusIY.setQuick(feature, feature, YtransponseCuMinusIY.getQuick(feature, feature) + lambda);
    }
    !
    return YtransponseCuMinusIY;
    }
    !
    /** Y' Cu p(u) */
    private Matrix getYtransponseCuPu(Vector userRatings) {
    Preconditions.checkArgument(userRatings.isSequentialAccess(), "need sequential access to ratings!");
    !
    Vector YtransponseCuPu = new DenseVector(numFeatures);
    !
    for (Element e : userRatings.nonZeroes()) {
    YtransponseCuPu.assign(Y.get(e.index()).times(confidence(e.get())), Functions.PLUS);
    }
    !
    return columnVectorAsMatrix(YtransponseCuPu);
    }
    !
    private Matrix columnVectorAsMatrix(Vector v) {
    double[][] matrix = new double[numFeatures][1];
    for (Vector.Element e : v.all()) {
    matrix[e.index()][0] = e.get();
    }
    return new DenseMatrix(matrix, true);
    }
    !
    }
    def updateFactorsImplicit(
    UorI: mutable.Map[Int, DenseVector[Double]],
    ratings: Vector[Double],
    YtY: DenseMatrix[Double]) = {
    !
    // set up required intermediate data structures
    val nui = ratings.activeSize
    val UorIMat = DenseMatrix.zeros[Double](nui, numF)
    val CuMinusIY = DenseMatrix.zeros[Double](nui, numF)
    val Cup = DenseVector.zeros[Double](nui)
    var j = 0
    !
    ratings.activeIterator.foreach{ case(i, v) => {
    CuMinusIY(j, ::) := UorI(i) :* alpha :* v
    Cup(j) = alpha * v + 1
    UorIMat(j, ::) := UorI(i)
    j += 1
    }}
    !
    val YtCuY =
    YtY + UorIMat.t * CuMinusIY + (DenseMatrix.eye[Double](numF) :* lambda)
    val YtCup = UorIMat.t * Cup
    YtCuY \ YtCup
    }
    vs
    Matrix / vector multiplication
    Element-wise operations

    View Slide

  32. SparkSQL
    • SparkSQL
    val events = rdd.map { case (_, m) =>
    Event(m(“time”).toLong, m(“event”).toString, ...)
    }
    !
    // complex join and filter in Spark
    ...
    !
    events.registerAsTable(“events”)
    val aggs = hql(
    “select
    from_unixtime(cast(time/1000.0 as bigint), 'yyyy-MM-dd HH:00:00') hour,
    event,
    count(1)
    from events ...”
    )
    !
    // save results

    View Slide

  33. Shark (SQL) Performance

    View Slide

  34. Shark (SQL) Performance

    View Slide

  35. Successor to MapReduce
    • Apache Spark 1.0.0!
    • All Hadoop providers announced support / partnerships
    Cloudera, MapR, Hortonworks
    • Databricks Cloud (http://databricks.com/cloud/)
    • Try it out: http://spark.apache.org/
    • Local mode on your laptop
    • Cluster in Amazon EC2 via Spark launch scripts

    View Slide

  36. Live Demo Time

    View Slide

  37. View Slide

  38. We’re hiring!

    View Slide

  39. nick@graphflow.com

    View Slide