Beyond Shuffling - Tips and Tricks for scaling Apache Spark (Vancouver 2015)

Beyond Shuffling tips & tricks for scaling Apache Spark Vancouver
Spark 2015

Who am I? • My name is Holden Karau •
Prefered pronouns are she/her • I’m a Software Engineer at IBM • previously Alpine, Databricks, Google, Foursquare & Amazon • co-author of Learning Spark & Fast Data processing with Spark ◦ co-author of a new book focused on Spark performance coming out next year* • @holdenkarau • Slide share http://www.slideshare.net/hkarau • Linkedin https://www.linkedin.com/in/holdenkarau • Github https://github.com/holdenk • Spark Videos http://bit.ly/holdenSparkVideos

What is going to be covered: • What I think
I might know about you • RDD re-use (caching, persistence levels, and checkpointing) • Working with key/value data ◦ Why group key is evil and what we can do about it • Best practices for Spark accumulators* • When Spark SQL can be amazing and wonderful • A quick detour into some future performance work in Spark MLLib

Who I think you wonderful humans are? • Nice* people
• Know some Apache Spark • Want to scale your Apache Spark jobs Lori Erickson

Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455 Photo from Cocoa Dream

RDD re-use - sadly not magic • If we know
we are going to re-use the RDD what should we do? ◦ If it fits nicely in memory caching in memory ◦ persisting at another level ▪ MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER ◦ checkpointing • Noisey clusters ◦ _2 & checkpointing can help Richard Gillin

Considerations for Key/Value Data • What does the distribution of
keys look like? • What type of aggregations do we need to do? • Do we want our data in any particular order? • Are we joining with another RDD? • Whats our partitioner? ◦ If we don’t have an explicit one: what is the partition structure? eleda 1

What is key skew and why do we care? •
Keys aren’t evenly distributed ◦ Sales by zip code, or records by city, etc. • groupByKey will explode (but it's pretty easy to break) • We can have really unbalanced partitions ◦ If we have enough key skew sortByKey could even fail ◦ Stragglers (uneven sharding can make some tasks take much longer) Mitchell Joyce

groupByKey - just how evil is it? • Pretty evil
• Groups all of the records with the same key into a single record ◦ Even if we immediately reduce it (e.g. sum it or similar) ◦ This can be too big to fit in memory, then our job fails • Unless we are in SQL then happy pandas PROgeckoam

So what does that look like? (94110, A, B) (94110,
A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (67843, T, R) (10003, A, R) (94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)]

Let’s revisit wordcount with groupByKey val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1)) val grouped = wordPairs.groupByKey() grouped.mapValues(_.sum)

And now back to the “normal” version val words =
rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts

Let’s see what it looks like when we run the
two Quick pastebin of the code for the two: http://pastebin.com/CKn0bsqp val rdd = sc.textFile("python/pyspark/*.py", 20) // Make sure we have many partitions // Evil group by key version val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val grouped = wordPairs.groupByKey() val evilWordCounts = grouped.mapValues(_.sum) evilWordCounts.take(5) // Less evil version val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts.take(5)

GroupByKey

reduceByKey

So what did we do instead? • reduceByKey ◦ Works
when the types are the same (e.g. in our summing version) • aggregateByKey ◦ Doesn’t require the types to be the same (e.g. computing stats model or similar) Allows Spark to pipeline the reduction & skip making the list We also got a map-side reduction (note the difference in shuffled read)

So why did we read in python/*.py If we just
read in the standard README.md file there aren’t enough duplicated keys for the reduceByKey & groupByKey difference to be really apparent Which is why groupByKey can be safe sometimes

Can just the shuffle cause problems? • Sorting by key
can put all of the records in the same partition • We can run into partition size limits (around 2GB) • Or just get bad performance • So we can handle data like the above we can add some “junk” to our key (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) PROTodd Klassy

Shuffle explosions :( (94110, A, B) (94110, A, C) (10003,
D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (94110, A, B) (94110, A, C) (94110, E, F) (94110, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (94110, T, R) (94110, T, R) (67843, T, R) (10003, A, R)

Spark accumulators • Really “great” way for keeping track of
failed records • Double counting makes things really tricky ◦ Jobs which worked “fine” don’t continue to work “fine” when minor changes happen • Relative rules can save us* under certain conditions Found Animals Foundation Follow

Using an accumulator for validation: val (ok, bad) = (sc.accumulator(0),
sc.accumulator(0)) val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1 // Actual parse logic here } // An action (e.g. count, save, etc.) if (bad.value > 0.1* ok.value) { throw Exception("bad data - do not use results") // Optional cleanup } // Mark as safe P.S: If you are interested in this check out spark-validator (still early stages). Found Animals Foundation Follow

Using a library: simple historic validation Photo by Dvortygirl val
vc = new ValidationConf(jobHistoryPath, "1", true, List[ValidationRule](new AvgRule("acc", 0.001, Some(200)))) val v = Validation(sc, vc) // Some job logic // Register an accumulator (optional) val acc = sc.accumulator(0) v.registerAccumulator(acc, "acc") // More Job logic goes here if (v.validate(jobId)) { // Success logic goes here } else sadness()

With a Spark internal counter... val vc = new ValidationConf(tempPath,
"1", true, List[ValidationRule]( new AbsoluteSparkCounterValidationRule("recordsRead", Some(30), Some (1000))) ) val sqlCtx = new SQLContext(sc) val v = Validation(sc, sqlCtx, vc) //Do work here.... assert(v.validate(5) === true) } Photo by Dvortygirl

Where can Spark SQL benefit perf? • Structured or semi-structured
data • OK with having less* complex operations available to us • We may only need to operate on a subset of the data ◦ The fastest data to process isn’t even read • Remember that non-magic cat? Its got some magic** now ◦ In part from peeking inside of boxes • non-JVM (aka Python & R) users: saved from double serialization cost! :) **Magic may cause stack overflow. Not valid in all states. Consult local magic bureau before attempting magic Matti Mattila

Why is Spark SQL good for those things? • Space
efficient columnar cached representation • Able to push down operations to the data store • Optimizer is able to look inside of our operations ◦ Regular spark can’t see inside our operations to spot the difference between (min(_, _)) and (append(_, _)) Matti Mattila

Preview: bringing codegen to Spark ML • Based on Spark
SQL’s code generation ◦ First draft using quasiquotes ◦ Switch to janino for Java compilation • Initial draft for Gradient Boosted Trees ◦ Based on DB’s work ◦ First draft with QuasiQuotes ▪ Moved to Java for speed ◦ See SPARK-10387 for the details Jon

@Override public double call(Vector input) throws Exception { if (input.apply(1)
<= 1.0) { return 0.1; } else { if (input.apply(0) <= 0.5) { return 0.0; } else { return 2.0; } } } (1, 1.0) 0.1 (0, 0.5) 0.0 2.0 What the generated code looks like: Glenn Simmons

Everyone* needs reduce, let’s make it faster! • reduce &
aggregate have “tree” versions • we already had free map-side reduction • but now we can get even better!** **And we might be able to make even cooler versions

Additional Resources • Programming guide (along with JavaDoc, PyDoc, ScalaDoc,
etc.) ◦ http://spark.apache.org/docs/latest/ • Books • Videos • Denny’s meetup on Wednesday :) • Spark Office Hours ◦ follow me on twitter for future ones - https://twitter.com/holdenkarau ◦ fill out this survey to choose the next date - http://bit.ly/spOffice1 raider of gin

Learning Spark Fast Data Processing with Spark (Out of Date)
Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action

And the next book….. Still being written - signup to
be notified when it is available: • http://www.highperformancespark.com • https://twitter.com/highperfspark

Q&A OR A quick detour into spark testing? • It's
like a choose your own adventure novel, but with voting • But more like the voting in High School since if we are running out of time we might just skip it

Spark Videos • Apache Spark Youtube Channel • My Spark
videos on YouTube - ◦ http://bit.ly/holdenSparkVideos • Spark Summit 2014 training • Paco’s Introduction to Apache Spark

Cat wave photo by Quinn Dombrowski k thnx bye! If
you care about Spark testing and don’t hate surveys: http://bit. ly/holdenTestingSpark Will tweet results “eventually” @holdenkarau

Beyond Shuffling - Tips and Tricks for scaling ...

Beyond Shuffling - Tips and Tricks for scaling Apache Spark (Vancouver 2015)

Holden Karau

Other Decks in Programming

Featured

Transcript

Beyond Shuffling tips & tricks for scaling Apache Spark Vancouver

Who am I? • My name is Holden Karau •

What is going to be covered: • What I think

Who I think you wonderful humans are? • Nice* people

Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455 Photo from Cocoa Dream

RDD re-use - sadly not magic • If we know

Considerations for Key/Value Data • What does the distribution of

What is key skew and why do we care? •

groupByKey - just how evil is it? • Pretty evil

So what does that look like? (94110, A, B) (94110,

Let’s revisit wordcount with groupByKey val words = rdd.flatMap(_.split(" "))

And now back to the “normal” version val words =

Let’s see what it looks like when we run the

GroupByKey

reduceByKey

So what did we do instead? • reduceByKey ◦ Works

So why did we read in python/*.py If we just

Can just the shuffle cause problems? • Sorting by key

Shuffle explosions :( (94110, A, B) (94110, A, C) (10003,

Spark accumulators • Really “great” way for keeping track of

Using an accumulator for validation: val (ok, bad) = (sc.accumulator(0),

Using a library: simple historic validation Photo by Dvortygirl val

With a Spark internal counter... val vc = new ValidationConf(tempPath,

Where can Spark SQL benefit perf? • Structured or semi-structured

Why is Spark SQL good for those things? • Space

Preview: bringing codegen to Spark ML • Based on Spark

@Override public double call(Vector input) throws Exception { if (input.apply(1)

Everyone* needs reduce, let’s make it faster! • reduce &

Additional Resources • Programming guide (along with JavaDoc, PyDoc, ScalaDoc,

Learning Spark Fast Data Processing with Spark (Out of Date)

And the next book….. Still being written - signup to

Q&A OR A quick detour into spark testing? • It's

Spark Videos • Apache Spark Youtube Channel • My Spark

Cat wave photo by Quinn Dombrowski k thnx bye! If