Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mining of Data Streams

Mining of Data Streams

Avatar for vikhyat

vikhyat

May 07, 2015
Tweet

More Decks by vikhyat

Other Decks in Technology

Transcript

  1. Motivation • Not possible to store data in main memory

    since it is received too fast. 
 (In 2010 Twitter was generating 8TB of data every day, in 2012 Facebook was generating 500TB of data each day.) • Latency constraints.
 (Trending Topics)
  2. Stream Computing Model Input Stream Standing Queries Output Stream Stream

    Processor Archival Storage Working Memory Ad-hoc queries
  3. Stream Queries • Standing queries: permanently executing queries that generate

    output at appropriate times.
 Example: 95th percentile latency. • Ad-hoc queries: questions asked about the current state of a stream. Usually supported by storing a summary of the stream if we know what kind of queries will be asked, or by storing a sliding window.

  4. Clustering of Data Streams • Model: Sliding window of N

    points. • We can ask for the centroids/clustroids of the best clusters formed by the last m of these points, for any m ≤ N.
  5. Stream Computing Model • Each stream element is a point

    in some space. • Sliding window consists of N most recent points. • We want to pre-cluster points in the stream so that we can quickly get the clusters of the last m points, where m ≤ N.
  6. Stream Clustering: BDMO • Similar to DGIF. • Stream points

    are partitioned and summarized by buckets whose sizes are powers of two. • There are one or two buckets of each size up to a limit. • Bucket sizes are non-decreasing as we go back in time. Number of buckets will be O(log N).
  7. Clustering in a Parallel Environment • We use map-reduce with

    a single reduce job. • Each map task is given a subset of the points and emits a set of k-v pairs where the key is 1 and the value describes a single cluster. • Reduce task merges the clusters.