From Big Data to Fast Data

From Big Data to Fast Data An Introduction to Apache
Spark

Hello! I am Stefano Baghino Software Engineer @ DATABIZ [email protected]
@stefanobaghino

1. What is Big Data? More than a buzzword, I
guess

Really, what is it? ◎ Data that cannot be stored on
a single box ◎ Requires horizontal scalability ◎ Requires a shift from traditional solutions

2. What is Fast Data? More than yet another buzzword

Basically: Streaming The need to process huge quantities of incoming
data in real-time

Disk I/O all the time Each step reads input from
and writes output to disk Let’s look at MapReduce Limited model It’s diﬀicult to fit all algos in the MapReduce model

Ok, so what is so good about Spark? May sit
on top of an existing Hadoop deployment. Builds heavily on simple functional programming ideas. Computes and caches data in- memory to deliver blazing performances.

Fast? Really? Yes! Hadoop 102.5 TB Spark 100 TB Spark
1 PB Elapsed Time 72’ 23’ 234’ # Cores 50400 6592 6080 Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min Source: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

So, where can I use it? Java Scala Python

Momentum +500 contributors +50 companies

3. What is Spark? Let’s get to the point

The architecture

Deploy on the cluster manager of your choice Local 127.0.0.1
Standalone Hadoop Mesos

Working with Spark ◎ Resilient Distributed Dataset ◎ Closely resembles a Scala
collection ◎ Very natural to use for Scala devs By the user’s point of view, the RDD is eﬀectively a collection, hiding all the details of its distribution throughout the cluster.

Example Word Count Let’s get our hands a little bit
dirty

The anatomy of a Resilient Distributed Dataset

What about Resilience? Let’s learn what RDDs really are and
how Spark works in order to get it

What is an RDD, really? create filter filter join collect

Transformations Produce a new RDD, extending the execution graph at
each step e.g.: ◎  map ◎  flatMap ◎  filter What can I do with an RDD? Actions They are “terminal” operations, actually calling for the execution to extract a value e.g.: ◎  collect ◎  reduce

The execution model 1.  Create DAG of RDDs to represent
comp. 2.  Create logical execution plan for the DAG 3.  Schedule and execute individual tasks

The execution model in action Let’s count distinct names grouped
by their initial sc.textFile("hdfs://...") .map(n => (n.charAt(0), n)) .groupByKey() .mapValues(n => n.toSet.size) .collect()

Step 1: Create the logical DAG HadoopRDD MappedRDD ShuﬀledRDD MappedValuesRDD
Array[(Char, Int)] sc.textFile... map(n => (n.charAt(0),... groupByKey() mapValues(n => n.toSet... collect()

Step 2: Create the execution plan ◎ Pipeline as much as
possible ◎ Split into “stages” based on need to “shuﬀle” data HadoopRDD MappedRDD ShuﬀledRDD MappedValuesRDD Array[(Char, Int)] Alice Bob Andy (A, Alice) (B, Bob) (A, Andy) (A, (Alice, Andy)) (B, Bob) (A, 2) Res0 = [(A, 2),….] (B, 1) Stage 1 Res0 = [(A, 2), (B, 1)] Stage 2

So, how is it a Resilient Distributed Dataset? Being a
lazy, immutable representation of computation, rather than an actual collection of data, RDDs achieve resiliency by simply being re-executed when their results are lost*. * because distributed systems and Murphy’s Law are best buddies.

The ecosystem Spark SQL Structured data Spark Streaming Real-time MLLib
Machine learning GraphX Graph processing Spark Core Standalone Scheduler YARN Mesos Spark R Stat. analysis

What we’ll see today: Spark Streaming Spark SQL Structured data
Spark Streaming Real-time MLLib Machine learning GraphX Graph processing Spark Core Standalone Scheduler YARN Mesos Spark R Stat. analysis

Let’s get to Spark Streaming A beautiful hack

Surprise! You already know everything you need

Spark Streaming Spark Streaming Spark Live data stream “Mini-batches” Processed
result

“Mini-batches” are DStreams These “mini-batches” are DStreams or discretized streams
and they are basically a collection of RDDs. DStreams can be created from streaming sources or by applying transformations to an existing DStream.

Example Twitter streaming “Sentiment analysis” for dummies Sure, it’s on
Github! https://github.com/stefanobaghino/spark-twitter-stream-example

A lot more to be said! ◎ Caching ◎ Shared variables ◎ Partioning
optimization ◎ DataFrames ◎ A huge API ◎ A huge ecosystem

The ecosystem: next time! Spark SQL Structured data Spark Streaming
Real-time MLLib Machine learning GraphX Graph processing Spark Core Standalone Scheduler YARN Mesos Spark R Stat. analysis

Thanks! Any questions? Psst... we’re hiring, come talk to me
afterwards. You can find me at: @stefanobaghino [email protected]

From Big Data to Fast Data

From Big Data to Fast Data

More Decks by Stefano Baghino

Other Decks in Technology

Featured

Transcript