From Big Data to Fast Data (Codemotion)

From Big Data to Fast Data An introduction to Apache
Spark Stefano Baghino Codemotion Milan 2015

From Big Data to Fast Data with Functional Reactive Containerized
Microservices and AI- driven Monads in a galaxy far far away…

Hello! I am Stefano Baghino Software Engineer @ DATABIZ stefano.baghino@databiz.it
@stefanobaghino Favorite PL: Scala My hero: XKCD’s Beret Guy What I fear: [object Object]

Agenda u Big Data? u Fast Data? u What do we have now?
u How can we do better? u What is Spark? u What does it do? u How does it work? And also code, somewhere here and there.

1. What is Big Data? More than a buzzword, I
guess

Really, what is it? u Data that cannot be stored on
a single box u Requires horizontal scalability u Requires a shift from traditional solutions

2. What is Fast Data? More than yet another buzzword

Basically: Streaming The need to process huge quantities of incoming
data in real-time

Disk I/O all the time Each step reads input from
and writes output to disk Let’s look at MapReduce Limited model It’s difﬁcult to ﬁt all algos in the MapReduce model

Ok, so what is so good about Spark? May sit
on top of an existing Hadoop deployment. Builds heavily on simple functional programming ideas. Computes and caches data in- memory to deliver blazing performances.

Fast? Really? Yes! Hadoop 102.5 TB Spark 100 TB Spark
1 PB Elapsed Time 72’ 23’ 234’ # Cores 50400 6592 6080 Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min Source: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

So, where can I use it? Java Scala Python

Momentum +700 contributors +50 companies

3. What is Spark? Let’s get to the point

The architecture

Deploy on the cluster manager of your choice Local 127.0.0.1
Standalone Hadoop Mesos

Working with Spark ◎ Resilient Distributed Dataset ◎ Closely resembles a Scala
collection ◎ Very natural to use for Scala devs By the user’s point of view, the RDD is effectively a collection, hiding all the details of its distribution throughout the cluster.

Example Word Count Let’s get our hands a little bit
dirty

The anatomy of a Resilient Distributed Dataset

What about resilience? Let’s learn what RDDs really are and
how Spark works in order to get it

What is an RDD, really? create ﬁlter ﬁlter join collect
create

Transformations Produce a new RDD, extending the execution graph at
each step e.g.: u  map u  ﬂatMap u  ﬁlter What can I do with an RDD? Actions They are “terminal” operations, actually calling for the execution to extract a value e.g.: u  collect u  reduce

The execution model 1.  Create DAG of RDDs to represent
comp. 2.  Create logical execution plan for the DAG 3.  Schedule and execute individual tasks

The execution model in action Let’s count distinct names grouped
by their initial sc.textFile("hdfs://...") .map(n => (n.charAt(0), n)) .groupByKey() .mapValues(n => n.toSet.size) .collect()

Step 1: Create the logical DAG HadoopRDD MappedRDD ShufﬂedRDD MappedValuesRDD
Array[(Char, Int)] sc.textFile... map(n => (n.charAt(0),... groupByKey() mapValues(n => n.toSet... collect()

Step 2: Create the execution plan u Pipeline as much as
possible u Split into “stages” based on the need to “shufﬂe” data HadoopRDD MappedRDD ShufﬂedRDD MappedValuesRDD Array[(Char, Int)] Alice Bob Andy (A, Alice) (B, Bob) (A, Andy) (A, (Alice, Andy)) (B, Bob) (A, 2) Res0 = [(A, 2),….] (B, 1) Stage 1 Res0 = [(A, 2), (B, 1)] Stage 2

So, how is it a Resilient Distributed Dataset? Being a
lazy, immutable representation of computation, rather than an actual collection of data, RDDs achieve resiliency by simply being re-executed when their results are lost*. * because distributed systems and Murphy’s Law are best buddies.

The ecosystem Spark SQL Structured data Spark Streaming Real-time MLLib
Machine learning GraphX Graph processing Spark Core Standalone Scheduler YARN Mesos Spark R Stat. analysis

What we’ll see today: Spark Streaming Spark SQL Structured data
Spark Streaming Real-time MLLib Machine learning GraphX Graph processing Spark Core Standalone Scheduler YARN Mesos Spark R Stat. analysis

Let’s get to Spark Streaming It’s Fast Data time!

Surprise! You already know everything you need

Spark Streaming Spark Streaming Spark Live data stream “Mini-batches” Processed
result

“Mini-batches” are DStreams These “mini-batches” are DStreams or discretized streams
and they are basically a collection of RDDs. DStreams can be created from streaming sources or by applying transformations to an existing DStream.

Example Twitter streaming “Sentiment analysis” for dummies Sure, it’s on
Github! https://github.com/stefanobaghino/spark-twitter-stream-example

A lot more to be said! u Caching u Shared variables u Partioning
optimization u DataFrames u A huge API u A huge ecosystem

Tomorrow at Codemotion! Spark SQL Structured data Spark Streaming Real-time
MLLib Machine learning GraphX Graph processing Spark Core Standalone Scheduler YARN Mesos Spark R Stat. analysis

Thanks! Any questions? You can ﬁnd me at: @stefanobaghino stefano.baghino@databiz.it

From Big Data to Fast Data (Codemotion)

From Big Data to Fast Data (Codemotion)

More Decks by Stefano Baghino

Other Decks in Programming

Featured

Transcript