Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Stream Clustering Implementation

Data Stream Clustering Implementation

Avatar for vikhyat

vikhyat

May 07, 2015
Tweet

More Decks by vikhyat

Other Decks in Technology

Transcript

  1. Why use a separate State Manager component? • Spark Streaming

    does not have support for ad-hoc queries. • It is also very inconvenient to maintain state in the way that is required by the BDMO algorithm using Spark Streaming. (It can be done using updateStateByKey but the resulting state cannot be queried externally.)
  2. Spark Component • Reads tweet information from Twitter’s streaming API.

    • Filters out tweets without geolocation information. • Clusters the locations in each 30s window of Tweets using k-means. • Passes the results of the clustering to the state manager component.
  3. State Manager Component • Implements the core BDMO algorithm. •

    Creates a bucket for every set of clusters received. • Allows ad-hoc queries.
  4. Bucket Merge Heuristic • For each point in the smaller

    bucket, find the closest point in the larger bucket and update it to the size- weighted average position of the two points. • Meaningful when clusters parameters change gradually rather than abruptly, which is generally the case in a stream. • Can be enhanced by considering number of points associated with each cluster rather than the bucket size.
  5. Performance (Spark) • The Spark component is completely horizontally scalable,

    throughput will scale linearly with the number of nodes. • This is important because the Spark component is what consumes data from the stream and needs to be able to cope with sudden increases in load. • Processes 3000-4000 tweets/second on a single node. 
 (For context, on a regular day Twitter gets 5700 tweets/second on average.)
  6. Performance (State Manager) • Rate at which it needs to

    processes requests is controlled by the Spark Streaming window size. • We have an established upper bound on the amount of work that can be needed to fulfil a single request. • Can be sharded if needed, though it would be easier to sacrifice latency instead.