Slide 1

Slide 1 text

From Big Data to Fast Data An introduction to Apache Spark Stefano Baghino Codemotion Milan 2015

Slide 2

Slide 2 text

From Big Data to Fast Data with Functional Reactive Containerized Microservices and AI- driven Monads in a galaxy far far away…

Slide 3

Slide 3 text

Hello! I am Stefano Baghino Software Engineer @ DATABIZ [email protected] @stefanobaghino Favorite PL: Scala My hero: XKCD’s Beret Guy What I fear: [object Object]

Slide 4

Slide 4 text

Agenda u Big Data? u Fast Data? u What do we have now? u How can we do better? u What is Spark? u What does it do? u How does it work? And also code, somewhere here and there.

Slide 5

Slide 5 text

1. What is Big Data? More than a buzzword, I guess

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Really, what is it? u Data that cannot be stored on a single box u Requires horizontal scalability u Requires a shift from traditional solutions

Slide 8

Slide 8 text

2. What is Fast Data? More than yet another buzzword

Slide 9

Slide 9 text

Basically: Streaming The need to process huge quantities of incoming data in real-time

Slide 10

Slide 10 text

Disk I/O all the time Each step reads input from and writes output to disk Let’s look at MapReduce Limited model It’s difficult to fit all algos in the MapReduce model

Slide 11

Slide 11 text

Ok, so what is so good about Spark? May sit on top of an existing Hadoop deployment. Builds heavily on simple functional programming ideas. Computes and caches data in- memory to deliver blazing performances.

Slide 12

Slide 12 text

Fast? Really? Yes! Hadoop 102.5 TB Spark 100 TB Spark 1 PB Elapsed Time 72’ 23’ 234’ # Cores 50400 6592 6080 Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min Source: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

Slide 13

Slide 13 text

So, where can I use it? Java Scala Python

Slide 14

Slide 14 text

Momentum +700 contributors +50 companies

Slide 15

Slide 15 text

3. What is Spark? Let’s get to the point

Slide 16

Slide 16 text

The architecture

Slide 17

Slide 17 text

Deploy on the cluster manager of your choice Local 127.0.0.1 Standalone Hadoop Mesos

Slide 18

Slide 18 text

Working with Spark ◎ Resilient Distributed Dataset ◎ Closely resembles a Scala collection ◎ Very natural to use for Scala devs By the user’s point of view, the RDD is effectively a collection, hiding all the details of its distribution throughout the cluster.

Slide 19

Slide 19 text

Example Word Count Let’s get our hands a little bit dirty

Slide 20

Slide 20 text

The anatomy of a Resilient Distributed Dataset

Slide 21

Slide 21 text

What about resilience? Let’s learn what RDDs really are and how Spark works in order to get it

Slide 22

Slide 22 text

What is an RDD, really? create filter filter join collect create

Slide 23

Slide 23 text

Transformations Produce a new RDD, extending the execution graph at each step e.g.: u  map u  flatMap u  filter What can I do with an RDD? Actions They are “terminal” operations, actually calling for the execution to extract a value e.g.: u  collect u  reduce

Slide 24

Slide 24 text

The execution model 1.  Create DAG of RDDs to represent comp. 2.  Create logical execution plan for the DAG 3.  Schedule and execute individual tasks

Slide 25

Slide 25 text

The execution model in action Let’s count distinct names grouped by their initial sc.textFile("hdfs://...") .map(n => (n.charAt(0), n)) .groupByKey() .mapValues(n => n.toSet.size) .collect()

Slide 26

Slide 26 text

Step 1: Create the logical DAG HadoopRDD MappedRDD ShuffledRDD MappedValuesRDD Array[(Char, Int)] sc.textFile... map(n => (n.charAt(0),... groupByKey() mapValues(n => n.toSet... collect()

Slide 27

Slide 27 text

Step 2: Create the execution plan u Pipeline as much as possible u Split into “stages” based on the need to “shuffle” data HadoopRDD MappedRDD ShuffledRDD MappedValuesRDD Array[(Char, Int)] Alice Bob Andy (A, Alice) (B, Bob) (A, Andy) (A, (Alice, Andy)) (B, Bob) (A, 2) Res0 = [(A, 2),….] (B, 1) Stage 1 Res0 = [(A, 2), (B, 1)] Stage 2

Slide 28

Slide 28 text

So, how is it a Resilient Distributed Dataset? Being a lazy, immutable representation of computation, rather than an actual collection of data, RDDs achieve resiliency by simply being re-executed when their results are lost*. * because distributed systems and Murphy’s Law are best buddies.

Slide 29

Slide 29 text

The ecosystem Spark SQL Structured data Spark Streaming Real-time MLLib Machine learning GraphX Graph processing Spark Core Standalone Scheduler YARN Mesos Spark R Stat. analysis

Slide 30

Slide 30 text

What we’ll see today: Spark Streaming Spark SQL Structured data Spark Streaming Real-time MLLib Machine learning GraphX Graph processing Spark Core Standalone Scheduler YARN Mesos Spark R Stat. analysis

Slide 31

Slide 31 text

Let’s get to Spark Streaming It’s Fast Data time!

Slide 32

Slide 32 text

Surprise! You already know everything you need

Slide 33

Slide 33 text

Spark Streaming Spark Streaming Spark Live data stream “Mini-batches” Processed result

Slide 34

Slide 34 text

“Mini-batches” are DStreams These “mini-batches” are DStreams or discretized streams and they are basically a collection of RDDs. DStreams can be created from streaming sources or by applying transformations to an existing DStream.

Slide 35

Slide 35 text

Example Twitter streaming “Sentiment analysis” for dummies Sure, it’s on Github! https://github.com/stefanobaghino/spark-twitter-stream-example

Slide 36

Slide 36 text

A lot more to be said! u Caching u Shared variables u Partioning optimization u DataFrames u A huge API u A huge ecosystem

Slide 37

Slide 37 text

Tomorrow at Codemotion! Spark SQL Structured data Spark Streaming Real-time MLLib Machine learning GraphX Graph processing Spark Core Standalone Scheduler YARN Mesos Spark R Stat. analysis

Slide 38

Slide 38 text

Thanks! Any questions? You can find me at: @stefanobaghino [email protected]