An Introduction to Apache Spark

Slide 1

Slide 1 text

An Introduction to Apache Spark Ankit Bahuguna 22.10.2014 Teradata, Munich

Slide 2

Slide 2 text

Agenda • Apache Spark • Features • Hadoop + Spark • Spark Ecosystem • Where it fits? (Lambda Architecture) • Stream Processing • Interesting Examples – Spark Streaming • Spark Streaming vs. Apache Storm

Slide 3

Slide 3 text

Apache Spark • Apache Spark™ is a fast and general engine for large-scale data processing. • A powerful open source processing engine for Hadoop data built around speed, ease of use, and sophisticated analytics. • It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010. • Over 200 contributors from 50+ organizations

Slide 4

Slide 4 text

Features • Speed – Enables applications in Hadoop clusters to run up to 100x faster in memory and 10x faster when running on disk. • Ease of Use – Lets you quickly write applications in Java, Scala, or Python. – Built-in set of over 80 high-level operators. – Interactively to query data within the shell.

Slide 5

Slide 5 text

Features • Sophisticated Analytics – In addition to simple “map” and “reduce” operations, Spark supports: • SQL queries, • Streaming data, and • Complex analytics such as machine learning and graph algorithms out-of-the-box. – Better yet, users can combine all these capabilities seamlessly in a single workflow.

Slide 6

Slide 6 text

Spark + Hadoop • Hadoop – Scales out computation and storage across cheap commodity servers and allows other applications to run on top of both of these — Spark is one of these applications. – Spark runs on top of existing Hadoop clusters to provide enhanced and additional functionality

Slide 7

Slide 7 text

Spark + Hadoop • Unlocking Hadoop data with Spark – Hadoop is effective for storing vast amounts of data cheaply, the computations it enables with MapReduce are highly limited. – MR is only able to execute simple computations and uses a high-latency batch model. – Spark provides a more general and powerful alternative to Hadoop’s MR, offering rich functionality such as stream processing, machine learning, and graph computations. – Built on Hadoop Storage: Spark is 100% compatible with HDFS, HBase, and any Hadoop storage system, so existing data is immediately usable in Spark.

Slide 8

Slide 8 text

Spark Deployment

Slide 9

Slide 9 text

Spark Stack

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Spark Core Engine • Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built on top of. • It provides in-memory computing capabilities to deliver speed, a generalized execution model to support a wide variety of applications, and Java, Scala, and Python APIs for ease of development.

Slide 12

Slide 12 text

Structured Data: Spark SQL • Spark SQL is an engine for Hive data that enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data. • It also provides powerful integration with the rest of the Spark ecosystem (e.g., integrating SQL query processing with machine learning). • Spark SQL Queries using an interactive shell!

Slide 13

Slide 13 text

Streaming Analytics: Spark Streaming • Many applications need the ability to process and analyze not only batch data, but also streams of new data in real-time. • Running on top of Spark, Spark Streaming enables powerful interactive and analytical applications across both streaming and historical data, while inheriting Spark’s ease of use and fault tolerance characteristics. • Readily integrates with a wide variety of popular data sources, including HDFS, Flume, Kafka, and Twitter.

Slide 14

Slide 14 text

Machine Learning: MLlib • Machine learning has quickly emerged as a critical piece in mining Big Data for actionable insights. • Built on top of Spark, MLlib is a scalable machine learning library that delivers both high-quality algorithms (e.g., multiple iterations to increase accuracy) and blazing speed (up to 100x faster than MapReduce). • The library is usable in Java, Scala, and Python

Slide 15

Slide 15 text

Other Projects • BlinkDB: An approximate query engine for interactive SQL queries that allows users to trade-off query accuracy for response time. This enables interactive queries over massive data by using data samples and presenting results annotated with meaningful error bars. • GraphX: A graph computation engine built on top of Spark that enables users to interactively build, transform and reason about graph structured data at scale. • SparkR: A package for the R statistical language that enables R-users to leverage Spark functionality interactively from within the R shell.

Slide 16

Slide 16 text

Where it fits?

Slide 17

Slide 17 text

Spark within Lambda Architecture

Slide 18

Slide 18 text

Real Time, Big Data Stream Processing Spark Streaming And Apache Storm

Slide 19

Slide 19 text

Stateful Stream Processing • Traditional streaming systems have a event driven record‐at- a time processing model. – Each node has a mutable state. – For each record, update state and send new records • State is lost when node dies! • Making stateful stream processing be fault-tolerant is challenging!

Slide 20

Slide 20 text

Existing Streaming System • Apache Storm – Replays records if not processed by the nodes. – Processes each record atleast once – May update mutable state twice – Mutable state can be lost due to failure. • Trident – Use transactions to update state! – Processes each record exactly once. – Per state transaction updates slow.

Slide 21

Slide 21 text

Spark Streaming • Enables scalable, high-throughput, fault-tolerant stream processing of live data streams. • Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis or plain old TCP sockets. • Data processed using complex algorithms expressed with high-level functions like map, reduce, join and window. • Finally, processed data can be pushed out to file systems, databases, and live dashboards. One can apply Spark’s machine learning, and graph processing algorithms on data streams.

Slide 22

Slide 22 text

Discretized Stream Processing I

Slide 23

Slide 23 text

Discretized Stream Processing II

Slide 24

Slide 24 text

Spark Streaming: Internally Internally: Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

Slide 25

Slide 25 text

Spark Streaming: Internally • It provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. • DStreams can be created either from input data stream from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs (Resilient Distributed Dataset). • RDDs are distributed data sets that can stay in memory and fallback to disk gracefully. RDDs if lost can be easily rebuilt using a graph that says how to reconstruct. RDDs are great if you want to keep holding a data set in memory and fire a series of queries - this works better than fetching data from disk every time.

Slide 26

Slide 26 text

Discretized Streams / DStreams • Transformation Functions on DStream: – map; flatmap; filter; repartition; union; count; reduce; countByValue; reduceByKey; join; cogroup; transform and updateStateByKey

Slide 27

Slide 27 text

Examples

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

Other Interesting Operations

Slide 35

Slide 35 text

Spark Streaming vs Storm I • Comparing: – Processing Model; Latency; Fault Tolerance (Every Record Processed) • Summary: Storm is a good choice if you need sub-second latency and no data loss. Spark Streaming is better if you need stateful computation, with the guarantee that each event is processed exactly once. Spark Streaming programming logic may also be easier because it is similar to batch programming, in that you are working with batches (albeit very small ones).

Slide 36

Slide 36 text

Spark Streaming vs Storm II • Comparing: – Origin; Implemented in; API Language; Batch Framework Integration • Summary: Two advantages of Spark Streaming: – It is not implemented in Clojure. – It is well integrated with the Spark batch computation framework.

Slide 37

Slide 37 text

Spark Streaming vs Storm III • Comparing: – Production Use; Hadoop Distribution; Support Available; Resource Manager Integration • Summary: Storm has run in production much longer than Spark Streaming. However, Spark Streaming has the advantages: – It has a company dedicated to supporting it (Databricks), and – It is compatible with YARN.

Slide 38

Slide 38 text

Thank You  Email: [email protected]