Mining of Data Streams

Mining Data Streams

Motivation • Not possible to store data in main memory
since it is received too fast.   (In 2010 Twitter was generating 8TB of data every day, in 2012 Facebook was generating 500TB of data each day.) • Latency constraints.  (Trending Topics)

Stream Computing Model Input Stream Standing Queries Output Stream Stream
Processor Archival Storage Working Memory Ad-hoc queries

Stream Queries • Standing queries: permanently executing queries that generate
output at appropriate times.  Example: 95th percentile latency. • Ad-hoc queries: questions asked about the current state of a stream. Usually supported by storing a summary of the stream if we know what kind of queries will be asked, or by storing a sliding window. 

Frameworks for Processing Data Streams

Spark Streaming

Summingbird / Scalding

http://github.com/vikhyat/StormyCloud

Clustering of Data Streams

Clustering of Data Streams • Model: Sliding window of N
points. • We can ask for the centroids/clustroids of the best clusters formed by the last m of these points, for any m ≤ N.

Stream Computing Model • Each stream element is a point
in some space. • Sliding window consists of N most recent points. • We want to pre-cluster points in the stream so that we can quickly get the clusters of the last m points, where m ≤ N.

Stream Clustering: BDMO • Similar to DGIF. • Stream points
are partitioned and summarized by buckets whose sizes are powers of two. • There are one or two buckets of each size up to a limit. • Bucket sizes are non-decreasing as we go back in time. Number of buckets will be O(log N).

Initializing Buckets

Merging Buckets

Answering Queries

Clustering in a Parallel Environment • We use map-reduce with
a single reduce job. • Each map task is given a subset of the points and emits a set of k-v pairs where the key is 1 and the value describes a single cluster. • Reduce task merges the clusters.

Mining of Data Streams

Mining of Data Streams

vikhyat

More Decks by vikhyat

Other Decks in Technology

Featured

Transcript

Mining Data Streams

Motivation • Not possible to store data in main memory

Stream Computing Model Input Stream Standing Queries Output Stream Stream

Stream Queries • Standing queries: permanently executing queries that generate

Frameworks for Processing Data Streams

Spark Streaming

Storm

Summingbird / Scalding

http://github.com/vikhyat/StormyCloud

Clustering of Data Streams

Clustering of Data Streams • Model: Sliding window of N

Stream Computing Model • Each stream element is a point

Stream Clustering: BDMO • Similar to DGIF. • Stream points

Initializing Buckets

Merging Buckets

Answering Queries

Clustering in a Parallel Environment • We use map-reduce with