Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deep dive into Spark Streaming

Santosh Sahoo
November 18, 2015

Deep dive into Spark Streaming

Presented at Bellevue Big Data Meetup

Santosh Sahoo

November 18, 2015
Tweet

More Decks by Santosh Sahoo

Other Decks in Programming

Transcript

  1. Spark Core SQL Structured Data Streaming Real-time MLib Machine Learning

    GraphX Graph Data A fast and general purpose framework for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Apache Spark MapReduce Hive Pig Mahaut HDFS
  2. How is Spark faster? RDD - A Resilient Distributed Dataset,

    the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. Caching + DAG model is enough to run them efficiently Combining libraries into one program is much faster DataFrames - schema-RDD
  3. Streaming use cases • Stock Market • Clickstream Analysis •

    Fraud Detection • Real Time bidding • Trend analysis • Real Time Data Warehousing • ...
  4. Source Flow Manager Streaming Processor Storage Dashboard Streaming Data Pipeline

    Applications Mobile Devices Sensors: IOT Database CDC Log scrapping Async Actors: Akka Message Queues Kafka Flume Azure Event hub AWS Kinesis HDFS Storm Spark Streaming Azure Stream analytics Samza Flink Heron RDBMS NoSQL HDFS DW/Redshift Custom App D3 Tableau Cognos Excel </>
  5. Spark Streaming A data processing framework to build streaming applications.

    Why? 1. Scalable 2. Fault-tolerant 3. Simpler 4. Modular 5. Code reuse
  6. But Spark vs Storm..? • Storm is a stream processing

    framework that also does micro-batching (Trident). • Spark is a batch processing framework that also does micro-batching (Spark Streaming). Also read:https://www.quora.com/What-are-the-differences-between-Apache-Spark-and-Apache-Flink/answer/Santosh-Sahoo
  7. Stream.scala 1. val conf = new SparkConf().setAppName("demoapp").setMaster("local[1]") 2. val sc

    = new SparkContext(conf) 3. val ssc = new StreamingContext(sc, Seconds(2)) 4. val kafkaConfig = Map("metadata.broker.list"->"localhost:9092") 5. val topics = Set("topic1") 6. val wordstream = KafkaUtils.createDirectStream(ssc, kafkaConfig, topics ) 7. wordstream.print() 8. ssc.start() 9. ssc.awaitTermination()
  8. DStream Operations 1. map(func) 2. flatMap(func) 3. filter(func) 4. repartition(numPartitions)

    5. union(otherStream) 6. count() 7. reduce(func) 8. countByValue() 9. reduceByKey(func, [numTasks]) 10. join(otherStream, [numTasks]) 11. cogroup(otherStream, [numTasks]) 12. transform(func) 13. updateStateByKey(func)
  9. Word count 1. val pairs = wordstream.map(word => (word, 1))

    2. val wordCounts = pairs.reduceByKey(_ + _) 3. wordCounts.print()
  10. Spark Streaming Architecture Worker Worker Worker Receiver Driver Master Executor

    Executor Executor Source D1 D2 D3 D4 WAL D1 D2 Replication Data Store TASK DStream- Discretized Stream of RDD RDD - Resilient Distributed Datasets
  11. Running Application spark-submit \ --class AppMain \ --master spark://192.168.10.21:7077 \

    #local[*] --executor-memory 20G \ --total-executor-cores 100 \ /path/to/code.jar \ 1000
  12. How it worked..? Producer random() Kafka Topic Topic Spark Worker

    Redis Node.js HTML5 D3.js SSE Parquet HDFS
  13. Composite Example // Load data using SQL points = ctx.sql(“select

    latitude, longitude from hive_tweets”) // Train a machine learning model model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)
  14. Apache Kafka No nonsense logging platform • 100K/s throughput vs

    20k of RabbitMQ • Log compaction • Durable persistence • Partition tolerance • Replication • Best in class integration with Spark ◦ http://spark.apache.org/docs/latest/streaming-kafka-integration.html
  15. Example from netflix BDT318 - Netflix Keystone: How Netflix Handles

    Data Streams Up to 8 Million Events Per Second
  16. OLTP Reporting Cognos Tableau ? Stream Processor Spark HDFS Import

    FTP HTTP SMTP P Protobuf Json Broker Kafka Hive/ Spark SQL OLAP Load balance Failover HANA HANA OLAP Replication Service bus Normalization Extract Compensate Data {Quality, Correction, Analytics} Migrate method API/SQL Expense Travel TTX API Concur Next Gen C Tachyon