Slide 63
Slide 63 text
Why did I start gorillalabs/sparkling?
first, there was clj-spark from The Climate Corporation. Very basic, not maintained anymore.
Then, I found out about flambo from yieldbot. Looked promising at first, fresh release, maybe used in production at yieldbot.
Small jobs were developed fast with Spark.
I ran into sooooo many problems (running on Spark Standalone, moving to YARN, fighting with low memory). Nothing to do with flambo, but with understanding the nuts and
bolts of Spark, YARN and other elements of my infrastructure. Ok, some with serializing my Clojure data structures.
Scaling up the amount of data led me directly into hell. My system was way slower than our existing solution. Was Spark the wrong way? I was completely like this guy: http://
blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html: „Spark should be better than MapReduce (if only it worked)“
After some thinking, I found out what happend: flambo promised to keep me in Clojure-land. Therefore, it uses a map operation to convert Scala Tuple2 to Clojure vector and
back again where necessary. But map looses your Partitioner information. Remember my point? So, flambo broke Einstein’s „as simple as possible but no simpler“
I fixed the library, I incorporated a different take on serializing functions (without reflection). That’s where I released gorillalabs/sparkling.
I needed to tweak the Data Model to have the same partitioner all over the place or use hand-crafted data structures and broadcasts for those not fitting my model. I now
ended up with code generating an index-structure from an RDD, sorted-tree-sets for date-ranged data, and so forth. And everything is fully unit-tested, cause that’s the only way
to go.
Now, my system outperforms a much bigger MySQL-based system on a local master, scales almost linearly wrt cores on a cluster. HURRAY!