Big Data using Scala - Speaker Deck

Slide 1

Slide 1 text

SCALA + BIG DATA PARIS SCALA MEETUP, 05/29/2013 Sam BESSALAH

Slide 2

Slide 2 text

Outline  Scala in the Hadoop World Hadoop and Map Reduce Basics Scalding A word about other Scala DSL : Scrunch and Scoobi  Spark and Co. Spark Spark Streaming  More projects using Scala for Data Analysis

Slide 3

Slide 3 text

SCALA and HADOOP

Slide 4

Slide 4 text

The new darling of data crunchers at scale

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Hadoop  Redundant , fault tolerant data storage  Parallel computation framework  Job coordination

Slide 7

Slide 7 text

MapReduce  A programming model for expressing distributed computations at massive scale  An execution framework for organizing and performing those computations in an efficient and fault tolerant way,  Bundled within the hadoop framework

Slide 8

Slide 8 text

MapReduce redux ..  Implements two functions at a high level Map(k1, v1) → List(k2, v2) Reduce (k2, List(v2)) → List(v3,k3)  The framework takes care of all the plumbing and the distribution, sorting, shuffling ...  Values with the same key flowed to the same reducer

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

 Way too long for a simple word counting  This gave birth too new tools like Hive or Pig  Pig : Script language for dataflow text = LOAD 'text' USING TextLoader(); tokens = FOREACH text GENERATE FLATTEN(TOKENIZE($0)) as word; wordcount = FOREACH (GROUP tokens BY word) GENERATE Group as word, COUNT_STAR($1) as ct ;

Slide 13

Slide 13 text

Cascading  Open source created by Chris Wensel, now developped at @Concurrent.  Written in Java, evolves around the concept of Pipes or Data flow eventually transformed into MapReduce jobs

Slide 14

Slide 14 text

 Cascading change the MR programming model to a generic data flow oriented programming model  A Flow is composed of a Source, a Sink and a Pipe to connect them  A pipe is a set of transformations over the input data  Pipes can be combined to create more complex workflow  Contains a flow Optimizer that converts a user data flow to an optimized data flow, that can be converted in its turn to an efficient map reduce job.  We could think of pipes as distributed collections

Slide 15

Slide 15 text

Word Count redux ..

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

But ... - Cascading makes use of FP idioms. - Functions are wrapped in Objects - Constructors (New) define composition between pipes - Map Reduce paradigm itself derive from FP Why not use functional programming ?

Slide 18

Slide 18 text

SCALDING - A Scala DSL on top of Cascading - Open Source project developed at Twitter By Avi Bryant (@avibryant) Oscar Boykin (@posco) Argyris Zymnis (@argyris) -http://github.twitter.com/twitter/scalding

Slide 19

Slide 19 text

Scalding - Two APIs : * Field API : Primary API, using Cascading Fields, dynamic with errors at runtime * TypeSafe API : Uses Scala Types, errors at compile time. We’ll focus on this one - Both can be joined using pipe.Typed and TypedPipe.from

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

Scalding word count

Slide 22

Slide 22 text

In reality : val countedWords = groupedWord.size val countedWords = groupedWords.mapValues(x=>1L).sum val countedWords = groupedWords.mapValues(x =>1L) .reduce(implicit mon:Monoid[Long] ((l,r) => mon.plus(l,r))

Slide 23

Slide 23 text

Fields Based API # pipe.flatMap(existingFields -> additionalFields){function} # pipe.map(existingFields -> additionalFields){function} # pipe.project(fields) # pipe.discard(fields) # pipe.mapTo(existingFields -> additionalFields){function} # pipe.groupBy(fields){ group => ... } # group.reduce(field){function} # group.foldLeft(field){function} … https://github.com/twitter/scalding/wiki/Fields-based-API-Reference

Slide 24

Slide 24 text

Grouping and Mapping GroupBuilder : Builder Pattern object that operates over groups of rows in a pipe. Helps building several parallel aggregations : counting, summing, in one pass . Awesome for stream aggregation. Used for GroupBy, adds fields which are reduction of existing ones. MapReduceMap : map side aggregation, derived from cascading, using combiners intead of reducers. Gotcha : doesn’t work with FoldLeft, which is pushed to reducers

Slide 25

Slide 25 text

Type Safe API Two concepts : TypePipe[T] -Wraps Cascading Pipe object. Instances distributed on the cluster, on top of which transformations occur. -Similar interface as scala.collection.Iterator[T] KeyedList[K,V] - Sharding of Key value objects. Two implementations Grouped[K,V] : usual grouping on key K CoGrouped[K,V,W,Result] : a co group over two grouped pipes, used for joins.

Slide 26

Slide 26 text

Optimized Joins JoinWithTiny : map side joins Left side assymetric join with a smaller pipe. Uses Cascading HashJoin, a non blocking assymetrical join where the smaller join fits in memory. BlockJoinWithSmaller : Performs a block join, by replicating data. SkewJoinwithSmaller|Larger : deals with skewed pipes CrossWithTiny : Doing a cross product with a moderate sized pipe, can create a huge output.

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

MATRIX API Generic Matrix API build using Abstract Algebra(Monoids, Ring, ..) Value Operation : mapValues, filterValues, binarizeAs Vector Operations : getRow,reduceRowVectors … mapRows, rowL2Normalize, rowMeanCentering .. Usual Matrix operation : trnspose, product …. Pipe.toMatrix, pipe.flatMaptoMatrix(fields) mapping function ..

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

Scalding is not the only Scala DSL for MR - Scrunch Build on top of Crunch, a MR pipelining library in Java developed by Cloudera. - Scoobi , build at NICTA Same idea as crunch, except fully written in Scala, uses Distributed Lists Dlist to mimic pipelines.

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

Scrunch Style object WordCount extends PipelineApp { def ScrunchWordCount(file: String) = { read(from.textFile(file)) .flatMap(_.split("\\W+") .filter(!_.isEmpty())) .count } val counts = join(countWords(args(0)), countWords(args(1))) write(counts, to.textFile(args(2))) }

Slide 34

Slide 34 text

Spark In-Memory Interactive and Real time Analytics for Large DataSets Sam Bessalah @samklr Adapted Slides from Matei Zaharia, UC Berkeley

Slide 35

Slide 35 text

Fast, expressive cluster computing system compatible with Apache Hadoop Works with any Hadoop-supported storage system (HDFS, S3, Avro, …) Improves efficiency through: In-memory computing primitives General computation graphs Improves usability through: Rich APIs in Java, Scala, Python Interactive shell Up to 100× faster Often 2-10× less code What is Spark?

Slide 36

Slide 36 text

Key Idea Work with distributed collections as if they were local Concept: Resilient Distributed Datasets (RDDs) - Immutable collections of objects spread across a cluster - Built through parallel transformations (map, filter, etc) - Automatically rebuilt on failure - Controllable persistence (like caching in RAM)

Slide 37

Slide 37 text

Example: Log Mining L oad error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Block 1 Block 1 Block 2 Block 2 Block 3 Block 3 Worke r Worke r Worke r Worke r Worke r Worke r Driver Driver cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count . . . tasks results Cache 1 Cache 1 Cache 2 Cache 2 Cache 3 Cache 3 Base RDD Base RDD Transformed RDD Transformed RDD Action Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data)

Slide 38

Slide 38 text

Fault Tolerance RDDs track lineage information that can be used to efficiently reconstruct lost partitions Ex: messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2)) HDFS File HDFS File Filtered RDD Filtered RDD Mapped RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...))

Slide 39

Slide 39 text

Spark in Java and Scala Java API: JavaRDD lines = spark.textFile(…); errors = lines.filter( new Function() { public Boolean call(String s) { return s.contains(“ERROR”); } }); errors.count() Scala API: val lines = spark.textFile(…) errors = lines.filter(s => s.contains(“ERROR”)) // can also write filter(_.contains(“ERROR”)) errors.count

Slide 40

Slide 40 text

Which Language Should I Use? Standalone programs can be written in any, but console is only Python & Scala Python developers: can stay with Python for both Java developers: consider using Scala for console (to learn the API) Performance: Java / Scala will be faster (statically typed), but Python can do well for numerical work with NumPy

Slide 41

Slide 41 text

Scala Cheat Sheet Variables: var x: Int = 7 var x = 7 // type inferred val y = “hi” // read-only Functions: def square(x: Int): Int = x*x def square(x: Int): Int = { x*x // last line returned } Collections and closures: val nums = Array(1, 2, 3) nums.map((x: Int) => x + 2) // => Array(3, 4, 5) nums.map(x => x + 2) // => same nums.map(_ + 2) // => same nums.reduce((x, y) => x + y) // => 6 nums.reduce(_ + _) // => 6

Slide 42

Slide 42 text

Learning Spark Easiest way: Spark interpreter (spark-shell or pyspark) Special Scala and Python consoles for cluster use Runs in local mode on 1 thread by default, but can control with MASTER environment var: MASTER=local ./spark-shell # local, 1 thread MASTER=local[2] ./spark-shell # local, 2 threads MASTER=spark://host:port ./spark-shell # Spark standalone cluster

Slide 43

Slide 43 text

Main entry point to Spark functionality Created for you in Spark shells as variable sc In standalone programs, you’d make your own (see later for details) First Stop: SparkContext

Slide 44

Slide 44 text

Creating RDDs # Turn a local collection into an RDD sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 sc.textFile(“file.txt”) sc.textFile(“directory/*.txt”) sc.textFile(“hdfs://namenode:9000/path/file”) # Use any existing Hadoop InputFormat sc.hadoopFile(keyClass, valClass, inputFmt, conf)

Slide 45

Slide 45 text

Basic Transformations nums = sc.parallelize([1, 2, 3]) # Pass each element through a function squares = nums.map(lambda x: x*x) # => {1, 4, 9} # Keep elements passing a predicate even = squares.filter(lambda x: x % 2 == 0) # => {4} # Map each element to zero or more others nums.flatMap(lambda x: range(0, x)) # => {0, 0, 1, 0, 1, 2} Range object (sequence of numbers 0, 1, …, x-1) Range object (sequence of numbers 0, 1, …, x-1)

Slide 46

Slide 46 text

nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection nums.collect() # => [1, 2, 3] # Return first K elements nums.take(2) # => [1, 2] # Count number of elements nums.count() # => 3 # Merge elements with an associative function nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file nums.saveAsTextFile(“hdfs://file.txt”) Basic Actions

Slide 47

Slide 47 text

Spark’s “distributed reduce” transformations act on RDDs of key-value pairs Python: pair = (a, b) pair[0] # => a pair[1] # => b Scala: val pair = (a, b) pair._1 // => a pair._2 // => b Java: Tuple2 pair = new Tuple2(a, b); // class scala.Tuple2 pair._1 // => a pair._2 // => b Working with Key-Value Pairs

Slide 48

Slide 48 text

Some Key-Value Operations pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)]) pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)} pets.groupByKey() # => {(cat, Seq(1, 2)), (dog, Seq(1)} pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)} reduceByKey also automatically implements combiners on the map side

Slide 49

Slide 49 text

visits = sc.parallelize([(“index.html”, “1.2.3.4”), (“about.html”, “3.4.5.6”), (“index.html”, “1.3.3.1”)]) pageNames = sc.parallelize([(“index.html”, “Home”), (“about.html”, “About”)]) visits.join(pageNames) # (“index.html”, (“1.2.3.4”, “Home”)) # (“index.html”, (“1.3.3.1”, “Home”)) # (“about.html”, (“3.4.5.6”, “About”)) visits.cogroup(pageNames) # (“index.html”, (Seq(“1.2.3.4”, “1.3.3.1”), Seq(“Home”))) # (“about.html”, (Seq(“3.4.5.6”), Seq(“About”))) Multiple Datasets

Slide 50

Slide 50 text

class MyCoolRddApp { val param = 3.14 val log = new Log(...) ... def work(rdd: RDD[Int]) { rdd.map(x => x + param) .reduce(...) } } How to get around it: class MyCoolRddApp { ... def work(rdd: RDD[Int]) { val param_ = param rdd.map(x => x + param_) .reduce(...) } } NotSerializableException: MyCoolRddApp (or Log) NotSerializableException: MyCoolRddApp (or Log) References only local variable instead of this.param References only local variable instead of this.param Closure Mishap Example

Slide 51

Slide 51 text

SPARK INTERNALS

Slide 52

Slide 52 text

Components sc = new SparkContext f = sc. t ext Fi l e( “ … ” ) f . f i l t er ( … ) . count ( ) . . . Your program Spark client (app master) Spark worker HDFS, HBase, … Block manager Task threads RDD graph Scheduler Block tracker Shuffle tracker Cluster manager

Slide 53

Slide 53 text

Example Job val sc = new SparkContext( “spark://...”, “MyJob”, home, jars) val file = sc.textFile(“hdfs://...”) val errors = file.filter(_.contains(“ERROR”)) errors.cache() errors.count() Resilient distributed datasets (RDDs) Resilient distributed datasets (RDDs) Action Action

Slide 54

Slide 54 text

RDD Graph HadoopRDD path = hdfs://... HadoopRDD path = hdfs://... FilteredRDD func = _.contains(…) shouldCache = true FilteredRDD func = _.contains(…) shouldCache = true file: errors: Partition-level view: Dataset-level view: Task 1Task 2 ...

Slide 55

Slide 55 text

Data Locality First run: data not in cache, so use HadoopRDD’s locality prefs (from HDFS) Second run: FilteredRDD is in cache, so use its locations If something falls out of cache, go back to HDFS

Slide 56

Slide 56 text

Broadcast Variables When one creates a broadcast variable b with a value v, v is saved to a file in a shared file system. The serialized form of b is a path to this file. When b’s value is queried on a worker node, Spark first checks whether v is in a local cache, and reads it from the file system if it isn’t.

Slide 57

Slide 57 text

Accumulators Each accumulator is given a unique ID when it is created. When the accumulator is saved, its serialized form contains its ID and the “zero” value for its type. On the workers, a separate copy of the accumulator is created for each thread that runs a task using thread-local variables, and is reset to zero when a task begins. After each task runs, the worker sends a message to the driver program containing the updates it made to various accumulators.

Slide 58

Slide 58 text

Scheduling Process rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects build operator DAG agnostic to operators! agnostic to operators! doesn’t know about stages doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed

Slide 59

Slide 59 text

Example: HadoopRDD partitions = one per HDFS block dependencies = none compute(partition) = read corresponding block preferredLocations(part) = HDFS block location partitioner = none

Slide 60

Slide 60 text

Example: FilteredRDD partitions = same as parent RDD dependencies = “one-to-one” on parent compute(partition) = compute parent and filter it preferredLocations(part) = none (ask parent) partitioner = none

Slide 61

Slide 61 text

Example: JoinedRDD partitions = one per reduce task dependencies = “shuffle” on each parent compute(partition) = read and join shuffled data preferredLocations(part) = none partitioner = HashPartitioner(numTasks) Spark will now know this data is hashed! Spark will now know this data is hashed!

Slide 62

Slide 62 text

Dependency Types union groupByKey join with inputs not co-partitioned join with inputs co- partitioned map, filter “Narrow” deps: “Wide” (shuffle) deps:

Slide 63

Slide 63 text

DAG Scheduler Interface: receives a “target” RDD, a function to run on each partition, and a listener for results Roles: Build stages of Task objects (code + preferred loc.) Submit them to TaskScheduler as ready Resubmit failed stages if outputs are lost

Slide 64

Slide 64 text

Scheduler Optimizations Pipelines narrow ops. within a stage Picks join algorithms based on partitioning (minimize shuffles) Reuses previously cached data join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition Task

Slide 65

Slide 65 text

Exemple : K-Means Clustering using Spark

Slide 66

Slide 66 text

Clustering Grouping data according to similarity Distance East Distance North E.g. archaeological dig

Slide 67

Slide 67 text

Clustering Grouping data according to similarity Distance East Distance North E.g. archaeological dig

Slide 68

Slide 68 text

K-Means Algorithm Benefits •Popular •Fast •Conceptually straightforward Distance East Distance North E.g. archaeological dig

Slide 69

Slide 69 text

K-Means: preliminaries Feature 1 Feature 2 Data: Collection of values data = lines.map(line=> parseVector(line))

Slide 70

Slide 70 text

K-Means: preliminaries Feature 1 Feature 2 Dissimilarity: Squared Euclidean distance dist = p.squaredDist(q)

Slide 71

Slide 71 text

K-Means: preliminaries Feature 1 Feature 2 K = Number of clusters Data assignments to clusters S 1 , S 2 ,. . ., S K

Slide 72

Slide 72 text

K-Means: preliminaries Feature 1 Feature 2 K = Number of clusters Data assignments to clusters S 1 , S 2 ,. . ., S K

Slide 73

Slide 73 text

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points.

Slide 74

Slide 74 text

Slide 75

Slide 75 text

Slide 76

Slide 76 text

Slide 77

Slide 77 text

Slide 78

Slide 78 text

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster centers • Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p))

Slide 79

Slide 79 text

Slide 80

Slide 80 text

Slide 81

Slide 81 text

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster centers • Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey()

Slide 82

Slide 82 text

Slide 83

Slide 83 text

Slide 84

Slide 84 text

Slide 85

Slide 85 text

Slide 86

Slide 86 text

Slide 87

Slide 87 text

K-Means Source Feature 1 Feature 2 centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) while (d > ɛ) { } d = distance(centers, newCenters) centers = newCenters.map(_)

Slide 88

Slide 88 text

Ease of use  Interactive shell: Useful for featurization, pre-processing data  Lines of code for K-Means - Spark ~ 90 lines – (Part of hands-on tutorial !) - Hadoop/Mahout ~ 4 files, > 300 lines

Slide 89

Slide 89 text

Example: PageRank

Slide 90

Slide 90 text

Why PageRank? Good example of a more complex algorithm Multiple stages of map & reduce Benefits from Spark’s in-memory caching Multiple iterations over the same data

Slide 91

Slide 91 text

Basic Idea Give pages ranks (scores) based on links to them Links from many pages  high rank Link from a high-rank page  high rank Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png

Slide 92

Slide 92 text

Algorithm 1.0 1.0 1.0 1.0 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rank p / |neighbors p | to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs

Slide 93

Slide 93 text

Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rank p / |neighbors p | to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 1.0 1.0 1.0 1.0 1 0.5 0.5 0.5 1 0.5

Slide 94

Slide 94 text

Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rank p / |neighbors p | to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.58 1.0 1.85 0.58

Slide 95

Slide 95 text

Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rank p / |neighbors p | to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.58 0.29 0.29 0.5 1.85 0.58 1.0 1.85 0.58 0.5

Slide 96

Slide 96 text

Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rank p / |neighbors p | to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.39 1.72 1.31 0.58 . . .

Slide 97

Slide 97 text

Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rank p / |neighbors p | to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.46 1.37 1.44 0.73 Final state:

Slide 98

Slide 98 text

Scala Implementation val links = // RDD of (url, neighbors) pairs var ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...)

Slide 99

Slide 99 text

PageRank Performance

Slide 100

Slide 100 text

SPARK STREAMING

Slide 101

Slide 101 text

What is Spark Streaming? Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark’s batch and interactive processing Provides a simple batch-like API for implementing complex algorithm Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.

Slide 102

Slide 102 text

Requirements Scalable to large clusters Second-scale latencies Simple programming model

Slide 103

Slide 103 text

Requirements Scalable to large clusters Second-scale latencies Simple programming model Integrated with batch & interactive processing

Slide 104

Slide 104 text

Stateful Stream Processing Traditional streaming systems have a event-driven record-at-a-time processing model Each node has mutable state For each record, update state & send new records State is lost if node dies! Making stateful stream processing be fault-tolerant is challenging mutable state node 1 node 3 input records node 2 input records 104

Slide 105

Slide 105 text

Existing Streaming Systems Storm Replays record if not processed by a node Processes each record at least once May update mutable state twice! Mutable state can be lost due to failure! Trident – Use transactions to update state Processes each record exactly once Per state transaction updates slow 105

Slide 106

Slide 106 text

Requirements Scalable to large clusters Second-scale latencies Simple programming model Integrated with batch & interactive processing Efficient fault-tolerance in stateful computations

Slide 107

Slide 107 text

Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs 107 Spark Spark Streaming batches of X seconds live data stream processed results  Chop up the live stream into batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations  Finally, the processed results of the RDD operations are returned in batches

Slide 108

Slide 108 text

Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs 108 Spark Spark Streamin batches of X seconds live data stream processed results  Batch sizes as low as ½ second, latency ~ 1 second  Potential for combining batch processing and streaming processing in the same system

Slide 109

Slide 109 text

Example – Get hashtags from Twitter val tweets = ssc.twitterStream(, ) DStream: a sequence of RDD representing a stream of data batch @ t+1 batch @ t+1 batch @ t batch @ t batch @ t+2 batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed) Twitter Streaming API

Slide 110

Slide 110 text

Example – Get hashtags from Twitter val tweets = ssc.twitterStream(, ) val hashTags = tweets.flatMap (status => getTags(status)) flatMap flatMap flatMap … transformation: modify data in one Dstream to create another DStream new DStream new RDDs created for every batch batch @ t+1 batch @ t+1 batch @ t batch @ t batch @ t+2 batch @ t+2 tweets DStream hashTags Dstream [#cat, #dog, … ]

Slide 111

Slide 111 text

Example – Get hashtags from Twitter val tweets = ssc.twitterStream(, ) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage flatMa p flatMa p flatMa p save save save batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags DStream every batch saved to HDFS

Slide 112

Slide 112 text

Java Example Scala val tweets = ssc.twitterStream(, ) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") Java JavaDStreams = ssc.twitterStream(, ) JavaDstream hashTags = tweets.flatMap(new Function<...> { }) hashTags.saveAsHadoopFiles("hdfs://...") Function object to define the transformation

Slide 113

Slide 113 text

Fault-tolerance RDDs are remember the sequence of operations that created it from the original fault-tolerant input data Batches of input data are replicated in memory of multiple worker nodes, therefore fault- tolerant Data lost due to worker failure, can be recomputed from input data input data replicated in memory flatMap lost partitions recomputed on other workers tweets RDD hashTags RDD

Slide 114

Slide 114 text

Key concepts DStream – sequence of RDDs representing a stream of data Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets Transformations – modify data from on DStream to another Standard RDD operations – map, countByValue, reduce, join, … Stateful operations – window, countByValueAndWindow, … Output Operations – send data to external entity saveAsHadoopFiles – saves to HDFS foreach – do anything with each batch of results

Slide 115

Slide 115 text

Example 2 – Count the hashtags val tweets = ssc.twitterStream(, ) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.countByValue() flatMap map reduceByKey flatMap map reduceByKey … flatMap map reduceByKey batch @ t+1 batch @ t batch @ t+2 hashTags tweets tagCounts [(#cat, 10), (#dog, 25), ... ]

Slide 116

Slide 116 text

Fault-tolerant Stateful Processing All intermediate data are RDDs, hence can be recomputed if lost hashTags t-1 t t+1 t+2 t+3 tagCounts

Slide 117

Slide 117 text

Fault-tolerant Stateful Processing State data not lost even if a worker node dies Does not change the value of your result Exactly once semantics to all transformations No double counting! 117

Slide 118

Slide 118 text

Other Interesting Operations Maintaining arbitrary state, track sessions Maintain per-user mood as state, and update it with his/her tweets tweets.updateStateByKey(tweet => updateMood(tweet)) Do arbitrary Spark RDD computation within DStream Join incoming tweets with a spam file to filter out bad tweets tweets.transform(tweetsRDD => { tweetsRDD.join(spamHDFSFile).filter(...) })

Slide 119

Slide 119 text

Performance Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub- second latency Tested with 100 streams of data on 100 EC2 instances with 4 cores each 119

Slide 120

Slide 120 text

OTHER PROJECTS Scoobi Scrunch Scala NLP Breeze Saddle Factorie … **Scala Notebook

Slide 121

Slide 121 text

THANKS

Slide 122

Slide 122 text

Bibliography Slides for Scalding shamelessly inspired from - Mario Pasteurelli http://fr.slideshare.net/melrief/scalding-programming-model-for-hadoop -Dean Wampler (@deanwampler) Scalding workshop code https://github.com/ThinkBigAnalytics/scalding-workshop Slides : http://polyglotprogramming.com/papers/ScaldingForHadoop.pdf SPARK : http://spark-project.org/documentation/