Deep dive into Spark Streaming

Dive deep into: Spark Streaming Santosh Sahoo, Architect: Concur

Spark Core SQL Structured Data Streaming Real-time MLib Machine Learning
GraphX Graph Data A fast and general purpose framework for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Apache Spark MapReduce Hive Pig Mahaut HDFS

Process data faster* - Parallelize it - Cache it -
Evaluate Lazily - Stream - ...

How is Spark faster? RDD - A Resilient Distributed Dataset,
the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. Caching + DAG model is enough to run them efficiently Combining libraries into one program is much faster DataFrames - schema-RDD

Real-time Processing Continuously processing stream of data, events or logs..

Streaming use cases • Stock Market • Clickstream Analysis •
Fraud Detection • Real Time bidding • Trend analysis • Real Time Data Warehousing • ...

Source Flow Manager Streaming Processor Storage Dashboard Streaming Data Pipeline
Applications Mobile Devices Sensors: IOT Database CDC Log scrapping Async Actors: Akka Message Queues Kafka Flume Azure Event hub AWS Kinesis HDFS Storm Spark Streaming Azure Stream analytics Samza Flink Heron RDBMS NoSQL HDFS DW/Redshift Custom App D3 Tableau Cognos Excel </>

Spark Streaming A data processing framework to build streaming applications.
Why? 1. Scalable 2. Fault-tolerant 3. Simpler 4. Modular 5. Code reuse

But Spark vs Storm..? • Storm is a stream processing
framework that also does micro-batching (Trident). • Spark is a batch processing framework that also does micro-batching (Spark Streaming). Also read:https://www.quora.com/What-are-the-differences-between-Apache-Spark-and-Apache-Flink/answer/Santosh-Sahoo

World of Stream Processors @http://www.slideshare.net/zbigniew.jerzak And continues..

Stream.scala 1. val conf = new SparkConf().setAppName("demoapp").setMaster("local[1]") 2. val sc
= new SparkContext(conf) 3. val ssc = new StreamingContext(sc, Seconds(2)) 4. val kafkaConfig = Map("metadata.broker.list"->"localhost:9092") 5. val topics = Set("topic1") 6. val wordstream = KafkaUtils.createDirectStream(ssc, kafkaConfig, topics ) 7. wordstream.print() 8. ssc.start() 9. ssc.awaitTermination()

DStream Operations 1. map(func) 2. flatMap(func) 3. filter(func) 4. repartition(numPartitions)
5. union(otherStream) 6. count() 7. reduce(func) 8. countByValue() 9. reduceByKey(func, [numTasks]) 10. join(otherStream, [numTasks]) 11. cogroup(otherStream, [numTasks]) 12. transform(func) 13. updateStateByKey(func)

Word count 1. val pairs = wordstream.map(word => (word, 1))
2. val wordCounts = pairs.reduceByKey(_ + _) 3. wordCounts.print()

Spark Streaming Architecture Worker Worker Worker Receiver Driver Master Executor
Executor Executor Source D1 D2 D3 D4 WAL D1 D2 Replication Data Store TASK DStream- Discretized Stream of RDD RDD - Resilient Distributed Datasets

Running Application spark-submit \ --class AppMain \ --master spark://192.168.10.21:7077 \
#local[*] --executor-memory 20G \ --total-executor-cores 100 \ /path/to/code.jar \ 1000

Streaming data patterns • Stream Joins • Top N (Trending)
• Rolling Windows

Demo ….

Demo Producer random.py Kafka Topic Topic Spark Worker

How it worked..? Producer random() Kafka Topic Topic Spark Worker
Redis Node.js HTML5 D3.js SSE Parquet HDFS

Composite Example // Load data using SQL points = ctx.sql(“select
latitude, longitude from hive_tweets”) // Train a machine learning model model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)

Apache Kafka No nonsense logging platform • 100K/s throughput vs
20k of RabbitMQ • Log compaction • Durable persistence • Partition tolerance • Replication • Best in class integration with Spark ◦ http://spark.apache.org/docs/latest/streaming-kafka-integration.html

High availability Leader zookeeper kafka Producer HA Master Worker Driver
Yarn/Mesos HDFS

Example from netflix BDT318 - Netflix Keystone: How Netflix Handles
Data Streams Up to 8 Million Events Per Second

OLTP Reporting Cognos Tableau ? Stream Processor Spark HDFS Import
FTP HTTP SMTP P Protobuf Json Broker Kafka Hive/ Spark SQL OLAP Load balance Failover HANA HANA OLAP Replication Service bus Normalization Extract Compensate Data {Quality, Correction, Analytics} Migrate method API/SQL Expense Travel TTX API Concur Next Gen C Tachyon

Can Spark Streaming survive Chaos Monkey? http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html

Lambda Architecture

Thank You! @santoshsaho speakerdeck.com/santoshsahoo github.com/santoshsahoo/spark-streaming-deepdive linkedin.com/in/sahoosantosh Concur is hiring https://www.concur.com/en-us/careers

Deep dive into Spark Streaming

Deep dive into Spark Streaming

Santosh Sahoo

More Decks by Santosh Sahoo

Other Decks in Programming

Featured

Transcript

Dive deep into: Spark Streaming Santosh Sahoo, Architect: Concur

Spark Core SQL Structured Data Streaming Real-time MLib Machine Learning

Process data faster* - Parallelize it - Cache it -

How is Spark faster? RDD - A Resilient Distributed Dataset,

Real-time Processing Continuously processing stream of data, events or logs..

Streaming use cases • Stock Market • Clickstream Analysis •

Source Flow Manager Streaming Processor Storage Dashboard Streaming Data Pipeline

Spark Streaming A data processing framework to build streaming applications.

But Spark vs Storm..? • Storm is a stream processing

World of Stream Processors @http://www.slideshare.net/zbigniew.jerzak And continues..

Stream.scala 1. val conf = new SparkConf().setAppName("demoapp").setMaster("local[1]") 2. val sc

DStream Operations 1. map(func) 2. flatMap(func) 3. filter(func) 4. repartition(numPartitions)

Word count 1. val pairs = wordstream.map(word => (word, 1))

Spark Streaming Architecture Worker Worker Worker Receiver Driver Master Executor

Running Application spark-submit \ --class AppMain \ --master spark://192.168.10.21:7077 \

Streaming data patterns • Stream Joins • Top N (Trending)

Demo ….

Demo Producer random.py Kafka Topic Topic Spark Worker

How it worked..? Producer random() Kafka Topic Topic Spark Worker

Composite Example // Load data using SQL points = ctx.sql(“select

Apache Kafka No nonsense logging platform • 100K/s throughput vs

High availability Leader zookeeper kafka Producer HA Master Worker Driver

Example from netflix BDT318 - Netflix Keystone: How Netflix Handles

OLTP Reporting Cognos Tableau ? Stream Processor Spark HDFS Import

Can Spark Streaming survive Chaos Monkey? http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html

Lambda Architecture

Thank You! @santoshsaho speakerdeck.com/santoshsahoo github.com/santoshsahoo/spark-streaming-deepdive linkedin.com/in/sahoosantosh Concur is hiring https://www.concur.com/en-us/careers