Slide 54
Slide 54 text
3. Products - Spark Streaming. Concepts
Source: http://spark.apache.org/docs/latest/streaming-programming-guide.html
•RDD (Resilient Distributed Datasets):
Distributed memory abstraction to perform in-memory
computations on large clusters in a fault-tolerant manner by
logging the transformations used to build a dataset (its lineage)
rather than the actual data*.
!
•DStream: sequence of RDDs represening a stream of data.
Input DStream coming from Twitter, HDFS, Kafka, Flume,
ZeroMQ, Akka Actor, TCP sockets…
!
!
!
!
•Transformations: modify data from one DStream to another.
•Standard RDD operations: map, count by value, reduce, join
•Stateful operations: window, countByValueAndWindow
!
•Output operations: send data to external entity
Christine Doig. Víctor Herrero. June 2014