Big Data Processing using Apache Spark and Clojure

Slide 1

Slide 1 text

Big Data Processing using Apache Spark and Clojure Dr. Paulus Esterhazy and Dr. Christian Betz January 2015 https://github.com/pesterhazy/, @pesterhazy https://github.com/chrisbetz/, @chris_betz

Slide 2

Slide 2 text

Who uses Clojure?

Slide 3

Slide 3 text

Who's getting paid to use Clojure?

Slide 4

Slide 4 text

Who uses BigData?

Slide 5

Slide 5 text

Who uses Hadoop?

Slide 6

Slide 6 text

Who uses Spark?

Slide 7

Slide 7 text

About us

Slide 8

Slide 8 text

Paulus red pinapple media GmbH

Slide 9

Slide 9 text

Chris

Slide 10

Slide 10 text

WTF is Spark? Patrick Wendell Databricks Spark Performance Common Patterns and Pitfalls for Implementing Algorithms in Spark Hossein Falaki @mhfalaki [email protected] Advanced Spark Reynold Xin, July 2, 2014 @ Spark Summit Training Disclaimer: We reuse stuff

Slide 11

Slide 11 text

Apache Spark - an Overview "Apache Spark™ is a fast and general engine for large-scale data processing." Value proposition? Spark keeps stuff in memory where possible, so intermediate results do not need I/O. Spark allows quicker development cycle with proper unit tests (see later) Spark allows to deﬁne your own data sources (JDBC in our case). Spark allows you to work with any data structures (so some are better than others).

Slide 12

Slide 12 text

Two Questions “I like Clojure, why might I be interested in Spark?” “Granted that Spark is useful, why program it in Clojure?”

Slide 13

Slide 13 text

Two Questions “I like Clojure, why might I be interested in Spark?” “Granted that Spark is useful, why program it in Clojure?” That's you!

Slide 14

Slide 14 text

How Big Data is processed today large amounts of data to process Hadoop is the de-facto standard Hadoop = MapReduce + HDFS

Slide 15

Slide 15 text

However, Hadoop has some limitations Pain point: performance Writing to disk after each map-/reduce step That's esp. bad for chains of map-/reduce steps and iterative algorithms (machine learning, PageRank) Identiﬁed Bottleneck: HDD I/O

Slide 16

Slide 16 text

Spark's Answer Major innovation: data sharing between processing steps In-memory processing

Slide 17

Slide 17 text

Resilient Distributed Datasets (RDDs) Datasets: Collection of elements Distributed: Could be an on any node in the cluster. Resilient: Could get lost (or partially lost), doesn't matter. Spark will recompute.

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Different types of RDDs, all the same interface Scientiﬁc Answer: RDD is an Interface! 1.  Set of partitions (“splits” in Hadoop) 2.  List of dependencies on parent RDDs 3.  Function to compute a partition" (as an Iterator) given its parent(s) 4.  (Optional) partitioner (hash, range) 5.  (Optional) preferred location(s)" for each partition “lineage” optimized execution Example: HadoopRDD partitions = one per HDFS block dependencies = none compute(part) = read corresponding block preferredLocations(part) = HDFS block location partitioner = none

Slide 20

Slide 20 text

Slide 21

Slide 21 text

How are RDDs handled? You create an RDD from a data source, e.g. an HDFS file, a Cassandra DB query, or from a JDBC-Query. You transform RDDs (with map, filter, ...), which gives you new RDDs You perform an action on one RDD to get the results from that RDD into your "driver". (like first, take, collect, count, ...)

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Basic Building Blocks: RDDs Resilient Distributed Datasets Spark follows a function approach: You define collections (RDDs) and functions on collections Sources for RDDs: • Local collections parallelized • HDFS files • Your own (e.g. JDBC-RDD) Transformations (only a selection) • map • filter Actions (only a selection) • reduce (fn) • count JdbcRDD (Query) HDFS-File (Path) Sources define the basic RDDs  you're working on

Slide 24

Slide 24 text

Basic Building Blocks: RDDs Resilient Distributed Datasets Spark follows a function approach: You define collections (RDDs) and functions on collections Sources for RDDs: • Local collections parallelized • HDFS files • Your own (e.g. JDBC-RDD) Transformations (only a selection) • map • filter Actions (only a selection) • reduce (fn) • count JdbcRDD (Query) HDFS-File (Path) map filter Sources define the basic RDDs  you're working on Transformations create new RDDs

Slide 25

Slide 25 text

Basic Building Blocks: RDDs Resilient Distributed Datasets Spark follows a function approach: You define collections (RDDs) and functions on collections Sources for RDDs: • Local collections parallelized • HDFS files • Your own (e.g. JDBC-RDD) Transformations (only a selection) • map • filter Actions (only a selection) • reduce (fn) • count JdbcRDD (Query) HDFS-File (Path) map filter join Sources define the basic RDDs  you're working on Transformations create new RDDs

Slide 26

Slide 26 text

Basic Building Blocks: RDDs Resilient Distributed Datasets Spark follows a function approach: You define collections (RDDs) and functions on collections Sources for RDDs: • Local collections parallelized • HDFS files • Your own (e.g. JDBC-RDD) Transformations (only a selection) • map • filter Actions (only a selection) • reduce (fn) • count JdbcRDD (Query) HDFS-File (Path) map filter join filter You provide your own functions in here! Sources define the basic RDDs  you're working on Transformations create new RDDs

Slide 27

Slide 27 text

Basic Building Blocks: RDDs Resilient Distributed Datasets Spark follows a function approach: You define collections (RDDs) and functions on collections Sources for RDDs: • Local collections parallelized • HDFS files • Your own (e.g. JDBC-RDD) Transformations (only a selection) • map • filter Actions (only a selection) • reduce (fn) • count JdbcRDD (Query) HDFS-File (Path) map filter join filter reduce You provide your own functions in here! Sources define the basic RDDs  you're working on Transformations create new RDDs Actions spit a  result to the Driver

Slide 28

Slide 28 text

RDDs in Practice Example code: https://github.com/gorillalabs/ClojureD

Slide 29

Slide 29 text

In Practice 1: line count (defn line-count [lines] (->> lines count)) (defn process [f] (with-open [rdr (clojure.java.io/reader "in.log")] (let [result (f (line-seq rdr))] (if (seq? result) (doall result) result)))) (process line-count)

Slide 30

Slide 30 text

In Practice 2: line count cont'd (defn line-count* [lines] (->> lines s/count)) (defn new-spark-context [] (let [c (-> (s-conf/spark-conf) (s-conf/master "local[*]") (s-conf/app-name "sparkling") (s-conf/set "spark.akka.timeout" "300") (s-conf/set conf) (s-conf/set-executor-env { "spark.executor.memory" "4G", "spark.files.overwrite" "true"}))] (s/spark-context c) )) (defonce sc (delay (new-spark-context))) (defn process* [f] (let [lines-rdd (s/text-file @sc "in.log")] (f lines-rdd))) (defn line-count [lines] (->> lines count)) (defn process [f] (with-open [rdr (clojure.java.io/reader "in.log")] (let [result (f (line-seq rdr))] (if (seq? result) (doall result) result)))) (process line-count)

Slide 31

Slide 31 text

Only go on when your tests are green! (deftest test-line-count*  (let [conf (test-conf)]  (spark/with-context  sc conf  (testing  "no lines return 0"  (is (= 0 (line-count* (spark/parallelize sc [])))))    (testing  "a single line returns 1"  (is (= 1 (line-count* (spark/parallelize sc ["this is a single line"])))))    (testing  "multiple lines count correctly"  (is (= 10 (line-count* (spark/parallelize sc (repeat 10 "this is a single line"))))))  )))

Slide 32

Slide 32 text

What's an RDD? What's in it? Take e.g. an JdbcRDD (we all know relational databases...):

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

What's an RDD? What's in it? Take e.g. an JdbcRDD (we all know relational databases...): campaign_id from to active 1 123 2014-01-01 2014-01-31 true 2 234 2014-01-06 2014-01-14 true 3 345 2014-02-01 2014-03-31 false 4 456 2014-02-10 2014-03-09 true That's your table [ {:campaign-id 123 :active true} {:campaign-id 234 :active true} {:campaign-id 345 :active false} {:campaign-id 456 :active true}] RDDs are lists of objects [ #t[123 {:campaign-id 123 :active true}] #t[234 {:campaign-id 234 :active true}]] [ #t[345 {:campaign-id 345 :active false}] #t[456 {:campaign-id 456 :active true}]

Slide 38

Slide 38 text

Slide 39

Slide 39 text

In Practice 3: status codes (defn parse-line [line] (some->> line (re-matches common-log-regex) rest (zipmap [:ip :timestamp :request :status :length :referer :ua :duration]) transform-log-entry)) (defn group-by-status-code [lines] (->> lines (map parse-line) (map (fn [entry] [(:status entry) 1])) (reduce (fn [a [k v]] (update-in a [k] #((fnil + 0) % v))) {}) (map identity)))

Slide 40

Slide 40 text

In Practice 4: status codes cont'd (defn parse-line [line] (some->> line (re-matches common-log-regex) rest (zipmap [:ip :timestamp :request :status :length :referer :ua :duration]) transform-log-entry)) (defn group-by-status-code [lines] (->> lines (map parse-line) (map (fn [entry] [(:status entry) 1])) (reduce (fn [a [k v]] (update-in a [k] #((fnil + 0) % v))) {}) (map identity))) (defn group-by-status-code* [lines] (-> lines (s/map parse-line) (s/map-to-pair (fn [entry] (s/tuple (:status entry) 1))) (s/reduce-by-key +) (s/map (sd/key-value-fn vector)) (s/collect)))

Slide 41

Slide 41 text

In Practice 5: details RDD • Lazy evaluation is explicitly forced • Transformation vs actions • Serialization of Clojure functions

Slide 42

Slide 42 text

In Practice 6: data sources and destinations • Writing to HDFS • Reading from HDFS • HDFS is versatile: text ﬁles, S3, Cassandra • Parallelizing regular Clojure collections

Slide 43

Slide 43 text

In Practice 7: top errors (defn top-errors [lines] (->> lines (map parse-line) (filter (fn [entry] (not= "200" (:status entry)))) (map (fn [entry] [(:uri entry) 1])) (reduce (fn [a [k v]] (update-in a [k] #((fnil + 0) % v))) {}) (sort-by val >) (take 10)))

Slide 44

Slide 44 text

In Practice 8: top errors cont'd (defn top-errors* [lines] (-> lines (s/map parse-line) (s/filter (fn [entry] (not= "200" (:status entry)))) s/cache (s/map-to-pair (fn [entry] (s/tuple (:uri entry) 1))) (s/reduce-by-key +) ;; flip (s/map-to-pair (sd/key-value-fn (fn [a b] (s/tuple b a)))) (s/sort-by-key false) ;; descending order ;; flip (s/map-to-pair (sd/key-value-fn (fn [a b] (s/tuple b a)))) (s/map (sd/key-value-fn vector)) (s/take 10)))

Slide 45

Slide 45 text

In Practice 9: caching • enables data sharing • avoiding data (de)serialization • performance degrades gracefully

Slide 46

Slide 46 text

Why Use Clojure to Write Spark Jobs?

Slide 47

Slide 47 text

Spark and Functional Programming • Spark is inspired by FP • Not surprising – Scala is a functional programming language • RDDs are immutable values • Resilience: caches can be discarded • DAG of transformations • Philosophically close to Clojure

Slide 48

Slide 48 text

Slide 49

Slide 49 text

Slide 50

Slide 50 text

Processing RDDs So your application • defines (source) RDDs, • transforms them (which creates new RDDs with dependencies on the source RDDs) • and runs actions on them to get results back to the driver. This defines a Directed Acyclic Graph (DAG) of operators. Spark compiles this DAG of operators into a set of stages, where the boundary between two stages is a shuffle phase. Each stage contains tasks, working on one partition each. Example sc.textFile("/some-hdfs-data") map# map# reduceByKey# collect# textFile# .map(line => line.split("\t")) .map(parts => (parts[0], int(parts[1]))) .reduceByKey(_ + _, 3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] Array[(String, Int)] RDD[(String, Int)] Execution Graph map# map# reduceByKey# collect# textFile# map# Stage#2# Stage#1# map# reduceByKey# collect# textFile#

Slide 51

Slide 51 text

Slide 52

Slide 52 text

Dynamic Types for Data Processing • Clojure's strength: developer-friendly wrapper for a complex interior • Static types everywhere • Imperfect data • For this use case, static typing can get in the way • Jobs naturally represented as transformations of Clojure data structures

Slide 53

Slide 53 text

Data Exploration • Working in real time with big datasets • Great for data mining • Clojure's powerful REPL • Gorilla REPL for live plotting?

Slide 54

Slide 54 text

Summary: Why Spark(ling) Data sharing: Hadoop is for a single map-reduce pass, it needs to write out intermediate result to HDFS. Interactive data exploration: Spark keeps data in memory, opening the possibility of interactively working with TBs of data Hadoop (and HIVE and Pig) lacks an (easy) way to implement unit tests. So writing your own code is also error-prone and development cycle is slooooow.

Slide 55

Slide 55 text

Practical tips

Slide 56

Slide 56 text

Running your spark code Run locally: e.g. inside tests. Use "local" or "local[*]" as SparkMaster. Run on cluster: either directly addressing Spark or (our case): run on top of YARN Both open a Web interface on http://host:4040/. Using the REPL: Open a SparkContext, deﬁne RDDs and store them in vars, perform transformations on these. Develop stuff in the REPL transfer your REPL stuff into tests. Run inside of tests: Open local SparcContext, feed mock data, run jobs. Therefore: design for testability! Submit a Spark Job using "spark-submit" with proper arguments (see upload.sh, run.sh).

Slide 57

Slide 57 text

Best Practices / Dos and Don'ts Shuffling is very expensive, so try to avoid it: • Never, ever, let go of your Partitioner - this has huuuuuuge performance impact. Use map-values instead of map, keep partition when re-keying for join, etc. • This equals: Keep your execution plan slim. There are some tricks for this, all boiling down to proper design of your data models. Use broadcasting where necessary. You need to monitor memory usage, as the inability to store stuff in memory will cause spills to disc (e.g. while shuffling). This will kill you. Tune total memory and/or cache/shuffle ratios.

Slide 58

Slide 58 text

Example

Slide 59

Slide 59 text

Example Matrix Multiplication • Repeatedly multiply sparse matrix and vector 24 Links (url, neighbors) Ranks (url, rank) … iteration 1 iteration 2 iteration 3 Same ﬁle read over and over

Slide 60

Slide 60 text

Example Matrix Multiplication • Repeatedly multiply sparse matrix and vector 24 Links (url, neighbors) Ranks (url, rank) … iteration 1 iteration 2 iteration 3 Same ﬁle read over and over Spark can do much better 25 • Using cache(), keep neighbors in memory • Do not write intermediate results on disk Links (url, neighbors) Ranks (url, rank) join join join … Grouping same RDD over and over

Slide 61

Slide 61 text

Slide 62

Slide 62 text

Some anecdotes

Slide 63

Slide 63 text

Why did I start gorillalabs/sparkling? first, there was clj-spark from The Climate Corporation. Very basic, not maintained anymore. Then, I found out about flambo from yieldbot. Looked promising at first, fresh release, maybe used in production at yieldbot. Small jobs were developed fast with Spark. I ran into sooooo many problems (running on Spark Standalone, moving to YARN, fighting with low memory). Nothing to do with flambo, but with understanding the nuts and bolts of Spark, YARN and other elements of my infrastructure. Ok, some with serializing my Clojure data structures. Scaling up the amount of data led me directly into hell. My system was way slower than our existing solution. Was Spark the wrong way? I was completely like this guy: http:// blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html: „Spark should be better than MapReduce (if only it worked)“ After some thinking, I found out what happend: flambo promised to keep me in Clojure-land. Therefore, it uses a map operation to convert Scala Tuple2 to Clojure vector and back again where necessary. But map looses your Partitioner information. Remember my point? So, flambo broke Einstein’s „as simple as possible but no simpler“ I fixed the library, I incorporated a different take on serializing functions (without reflection). That’s where I released gorillalabs/sparkling. I needed to tweak the Data Model to have the same partitioner all over the place or use hand-crafted data structures and broadcasts for those not fitting my model. I now ended up with code generating an index-structure from an RDD, sorted-tree-sets for date-ranged data, and so forth. And everything is fully unit-tested, cause that’s the only way to go. Now, my system outperforms a much bigger MySQL-based system on a local master, scales almost linearly wrt cores on a cluster. HURRAY!

Slide 64

Slide 64 text

Having nrepl / GorillaREPL is so nice! Having an nrepl open on my Cluster is so nice, since I can inspect stuff in my computation. Ever wondered, what that intermediate RDD contains? Just (spark/take rdd 10) it. Using GorillaREPL, it’s like a visual workbench for big data analysis. See for yourself: http://bit.ly/1C7sSK4

Slide 65

Slide 65 text

References

Slide 66

Slide 66 text

Online Sparkling: https://github.com/gorillalabs/sparkling Flambo: https://github.com/yieldbot/flambo flambo-example: https://github.com/pesterhazy/flambo-example

Slide 67

Slide 67 text

References http://lintool.github.io/SparkTutorial/ (where you can ﬁnd the slides used in this presentation) https://speakerdeck.com/ecepoi/apache-spark-at-viadeo https://speakerdeck.com/ecepoi/viadeos-segmentation-platform-with-spark-on-mesos https://speakerdeck.com/rxin/advanced-spark-at-spark-summit-2014

Slide 68

Slide 68 text

Sources Dean, J., & Ghemawat, S. (2008). MapReduce: simpliﬁed data processing on large clusters. Communications of the ACM, 51(1), 107-113. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., ... & Stoica, I. (2012, April). Resilient distributed datasets: A fault-tolerant abstraction for in- memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (pp. 2-2). USENIX Association. (Both available as PDFs)

Slide 69

Slide 69 text

Questions?