Data Stream Clustering Implementation

Data Stream Clustering Vikhyat Korrapati

Overview • BDMO algorithm. • Implemented using Spark Streaming. •
Dataset is a live stream of tweet locations.

Architecture Twitter Stream Spark State Manager Ad-Hoc Queries

Why use a separate State Manager component? • Spark Streaming
does not have support for ad-hoc queries. • It is also very inconvenient to maintain state in the way that is required by the BDMO algorithm using Spark Streaming. (It can be done using updateStateByKey but the resulting state cannot be queried externally.)

Spark Component • Reads tweet information from Twitter’s streaming API.
• Filters out tweets without geolocation information. • Clusters the locations in each 30s window of Tweets using k-means. • Passes the results of the clustering to the state manager component.

State Manager Component • Implements the core BDMO algorithm. •
Creates a bucket for every set of clusters received. • Allows ad-hoc queries.

Bucket Merge Heuristic • For each point in the smaller
bucket, ﬁnd the closest point in the larger bucket and update it to the size- weighted average position of the two points. • Meaningful when clusters parameters change gradually rather than abruptly, which is generally the case in a stream. • Can be enhanced by considering number of points associated with each cluster rather than the bucket size.

Performance (Spark) • The Spark component is completely horizontally scalable,
throughput will scale linearly with the number of nodes. • This is important because the Spark component is what consumes data from the stream and needs to be able to cope with sudden increases in load. • Processes 3000-4000 tweets/second on a single node.   (For context, on a regular day Twitter gets 5700 tweets/second on average.)

Performance (State Manager) • Rate at which it needs to
processes requests is controlled by the Spark Streaming window size. • We have an established upper bound on the amount of work that can be needed to fulﬁl a single request. • Can be sharded if needed, though it would be easier to sacriﬁce latency instead.

Questions?

Data Stream Clustering Implementation

Data Stream Clustering Implementation

vikhyat

More Decks by vikhyat

Other Decks in Technology

Featured

Transcript

Data Stream Clustering Vikhyat Korrapati

Overview • BDMO algorithm. • Implemented using Spark Streaming. •

Architecture Twitter Stream Spark State Manager Ad-Hoc Queries

Why use a separate State Manager component? • Spark Streaming

Spark Component • Reads tweet information from Twitter’s streaming API.

State Manager Component • Implements the core BDMO algorithm. •

Bucket Merge Heuristic • For each point in the smaller

Performance (Spark) • The Spark component is completely horizontally scalable,

Performance (State Manager) • Rate at which it needs to

Demo

Questions?