Clustering in Improving Microblog Stream Summarization Andrei Olariu University of Bucharest Faculty of Mathematics and Computer Science CICLING 2013 Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
1 Context Microblogging Previous Work Motivation 2 Our Summarizing System Approach Outline Message Clustering based on Events Hierarchical Event Analysis Summarization 3 Results Corpus Metrics Summarization Results 4 Conclusions and Future Work Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation Outline 1 Context Microblogging Previous Work Motivation 2 Our Summarizing System Approach Outline Message Clustering based on Events Hierarchical Event Analysis Summarization 3 Results Corpus Metrics Summarization Results 4 Conclusions and Future Work Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation What is Microblogging microblogging form of blogging characterized by very short posts microblogging_platforms Twitter, Tumblr, Facebook Twitter's main highlights: over 500 million posts per day data is publicly accessible (unlike Facebook) posts are mainly text (unlike Tumblr - mostly images) posts are limited to 140 characters specic vocabulary (internet slang) abbreviations, misspelled words Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation What is Microblogging microblogging form of blogging characterized by very short posts microblogging_platforms Twitter, Tumblr, Facebook Twitter's main highlights: over 500 million posts per day data is publicly accessible (unlike Facebook) posts are mainly text (unlike Tumblr - mostly images) posts are limited to 140 characters specic vocabulary (internet slang) abbreviations, misspelled words Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation What is Microblogging microblogging form of blogging characterized by very short posts microblogging_platforms Twitter, Tumblr, Facebook Twitter's main highlights: over 500 million posts per day data is publicly accessible (unlike Facebook) posts are mainly text (unlike Tumblr - mostly images) posts are limited to 140 characters specic vocabulary (internet slang) abbreviations, misspelled words Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation What is Microblogging Data on Twitter is organized as a stream (sequence of posts) Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation Outline 1 Context Microblogging Previous Work Motivation 2 Our Summarizing System Approach Outline Message Clustering based on Events Hierarchical Event Analysis Summarization 3 Results Corpus Metrics Summarization Results 4 Conclusions and Future Work Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation Microblog Event Detection detect the main topics in a stream Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation Microblog Event Detection model an event based on a stream of related posts cluster similar messages detect words that experience an increased frequency Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation Multi-sentence Compression multi-sentence_compression generate a short sentence that summarizes a group of related sentences Example The wife of a former U.S. president Bill Clinton Hillary Clinton visited China last Monday. Hillary Clinton wanted to visit China last month but postponed her plans till Monday last week. Hillary Clinton paid a visit to the People Republic of China on Monday. Last week the Secretary of State Ms. Clinton visited Chinese ocials. Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation Multi-sentence Compression multi-sentence_compression generate a short sentence that summarizes a group of related sentences Example The wife of a former U.S. president Bill Clinton Hillary Clinton visited China last Monday. Hillary Clinton wanted to visit China last month but postponed her plans till Monday last week. Hillary Clinton paid a visit to the People Republic of China on Monday. Last week the Secretary of State Ms. Clinton visited Chinese ocials. Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation Multi-sentence Compression The Multi-sentence Compression algorithm nds a path minimizing a cost function in a word graph: Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation Summarizing Microblogging Streams approached in two ways: choose a post that best describes the input stream generate a short sentence based on the stream - Phrase Reinforcement algorithm both approaches have been developed for streams of messages related to a given event Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation Summarizing Microblogging Streams approached in two ways: choose a post that best describes the input stream generate a short sentence based on the stream - Phrase Reinforcement algorithm both approaches have been developed for streams of messages related to a given event Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation Summarizing Microblogging Streams approached in two ways: choose a post that best describes the input stream generate a short sentence based on the stream - Phrase Reinforcement algorithm both approaches have been developed for streams of messages related to a given event Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation Phrase Reinforcement Phrase_Reinforcement algorithm that generates a summary starting from a given keyphrase and a stream of posts related to that keyphrase Example A tragedy: Ted Kennedy died today of cancer Ted Kennedy died today Ted Kennedy was a leader Ted Kennedy died at age 77 Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation Phrase Reinforcement Phrase_Reinforcement algorithm that generates a summary starting from a given keyphrase and a stream of posts related to that keyphrase Example A tragedy: Ted Kennedy died today of cancer Ted Kennedy died today Ted Kennedy was a leader Ted Kennedy died at age 77 Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation Phrase Reinforcement The graph built starting from the keyphrase Ted Kennedy: , Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation Outline 1 Context Microblogging Previous Work Motivation 2 Our Summarizing System Approach Outline Message Clustering based on Events Hierarchical Event Analysis Summarization 3 Results Corpus Metrics Summarization Results 4 Conclusions and Future Work Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation Motivation All previous summarizing techniques require as input a stream of related posts: posts are ltered based on a given set of keywords keywords are manually selected to match a specic event/topic Yet, most streams are not about a specic event/topic and suer from a large amount of noise. How can we approach summarizing any kind of stream? Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation Motivation All previous summarizing techniques require as input a stream of related posts: posts are ltered based on a given set of keywords keywords are manually selected to match a specic event/topic Yet, most streams are not about a specic event/topic and suer from a large amount of noise. How can we approach summarizing any kind of stream? Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Previous Work Motivation Motivation Contributions: developed a system for summarizing unltered streams adapted the Phrase Reinforcement algorithm in order to integrate it into our system Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Outline Message Clustering based on Events Hierarchical Event Analysis Summarization Outline 1 Context Microblogging Previous Work Motivation 2 Our Summarizing System Approach Outline Message Clustering based on Events Hierarchical Event Analysis Summarization 3 Results Corpus Metrics Summarization Results 4 Conclusions and Future Work Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Outline Message Clustering based on Events Hierarchical Event Analysis Summarization Approach Outline Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Outline Message Clustering based on Events Hierarchical Event Analysis Summarization Approach Outline Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Outline Message Clustering based on Events Hierarchical Event Analysis Summarization Outline 1 Context Microblogging Previous Work Motivation 2 Our Summarizing System Approach Outline Message Clustering based on Events Hierarchical Event Analysis Summarization 3 Results Corpus Metrics Summarization Results 4 Conclusions and Future Work Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Outline Message Clustering based on Events Hierarchical Event Analysis Summarization Message Clustering based on Events event detection detect words that show an unusual increase in frequency cluster words based on how often they appear together in posts each cluster of words represents an event message clustering for each message, determine the word cluster most similar to it if the similarity is above a threshold, assign it to the event, otherwise consider it noise Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Outline Message Clustering based on Events Hierarchical Event Analysis Summarization Message Clustering based on Events event detection detect words that show an unusual increase in frequency cluster words based on how often they appear together in posts each cluster of words represents an event message clustering for each message, determine the word cluster most similar to it if the similarity is above a threshold, assign it to the event, otherwise consider it noise Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Outline Message Clustering based on Events Hierarchical Event Analysis Summarization Message Clustering based on Events event detection detect words that show an unusual increase in frequency cluster words based on how often they appear together in posts each cluster of words represents an event message clustering for each message, determine the word cluster most similar to it if the similarity is above a threshold, assign it to the event, otherwise consider it noise Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Outline Message Clustering based on Events Hierarchical Event Analysis Summarization Message Clustering based on Events event detection detect words that show an unusual increase in frequency cluster words based on how often they appear together in posts each cluster of words represents an event message clustering for each message, determine the word cluster most similar to it if the similarity is above a threshold, assign it to the event, otherwise consider it noise Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Outline Message Clustering based on Events Hierarchical Event Analysis Summarization Message Clustering based on Events event detection detect words that show an unusual increase in frequency cluster words based on how often they appear together in posts each cluster of words represents an event message clustering for each message, determine the word cluster most similar to it if the similarity is above a threshold, assign it to the event, otherwise consider it noise Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Outline Message Clustering based on Events Hierarchical Event Analysis Summarization Outline 1 Context Microblogging Previous Work Motivation 2 Our Summarizing System Approach Outline Message Clustering based on Events Hierarchical Event Analysis Summarization 3 Results Corpus Metrics Summarization Results 4 Conclusions and Future Work Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Outline Message Clustering based on Events Hierarchical Event Analysis Summarization Hierarchical Event Analysis group very similar messages together in information blocks apply agglomerative clustering on the information blocks we use cosine similarity based on word n-grams Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Outline Message Clustering based on Events Hierarchical Event Analysis Summarization Hierarchical Event Analysis group very similar messages together in information blocks apply agglomerative clustering on the information blocks we use cosine similarity based on word n-grams Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Outline Message Clustering based on Events Hierarchical Event Analysis Summarization Hierarchical Event Analysis group very similar messages together in information blocks apply agglomerative clustering on the information blocks we use cosine similarity based on word n-grams Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Outline Message Clustering based on Events Hierarchical Event Analysis Summarization Outline 1 Context Microblogging Previous Work Motivation 2 Our Summarizing System Approach Outline Message Clustering based on Events Hierarchical Event Analysis Summarization 3 Results Corpus Metrics Summarization Results 4 Conclusions and Future Work Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Outline Message Clustering based on Events Hierarchical Event Analysis Summarization Summarization Approaches We test two dierent approaches: Multi-sentence Compression (MSC) Frequent Phrase Summarization (FPS) an adaptation of Phrase Reinforcement that does not require a starting keyphrase the algorithm retrieves a popular sequence of words from the input stream one of our contributions Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Outline Message Clustering based on Events Hierarchical Event Analysis Summarization Summarization Approaches We test two dierent approaches: Multi-sentence Compression (MSC) Frequent Phrase Summarization (FPS) an adaptation of Phrase Reinforcement that does not require a starting keyphrase the algorithm retrieves a popular sequence of words from the input stream one of our contributions Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Metrics Summarization Results Outline 1 Context Microblogging Previous Work Motivation 2 Our Summarizing System Approach Outline Message Clustering based on Events Hierarchical Event Analysis Summarization 3 Results Corpus Metrics Summarization Results 4 Conclusions and Future Work Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Metrics Summarization Results Corpus we used the Twitter API to retrieve recent tweets we experimented on 1.6 million tweets collected between the 4 th and the 8 th of July 2012 we used another 1.7 million tweets (collected during the previous week) as background data Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Metrics Summarization Results Events The Event Detection module discovered an average of 20 events per day. Examples of events: real sporting events (wrestling, basketball, football) Independence Day celebrity news other: nding the Higgs boson, the European debt crisis virtual memes: thingsidislike popular hashtags popular retweets Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Metrics Summarization Results Events The Event Detection module discovered an average of 20 events per day. Examples of events: real sporting events (wrestling, basketball, football) Independence Day celebrity news other: nding the Higgs boson, the European debt crisis virtual memes: thingsidislike popular hashtags popular retweets Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Metrics Summarization Results Outline 1 Context Microblogging Previous Work Motivation 2 Our Summarizing System Approach Outline Message Clustering based on Events Hierarchical Event Analysis Summarization 3 Results Corpus Metrics Summarization Results 4 Conclusions and Future Work Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Metrics Summarization Results Metrics the summaries were rated regarding: completeness - how much information the summary expresses relative to the detected event grammaticality - the degree of grammatical and syntactical correctness redundancy - if a multi-sentence summary repeats the same information Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Metrics Summarization Results Metrics hierarchical summarization procedure: each cluster tree was cut to the level where it has 4 clusters the 4 clusters were summarized, generating a multi-sentence summary trees with less than 4 clusters were removed from the analysis we were left with 50 sets of summaries a group of 4 volunteers assigned ratings on a scale of 1 to 5 Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Metrics Summarization Results Metrics hierarchical summarization procedure: each cluster tree was cut to the level where it has 4 clusters the 4 clusters were summarized, generating a multi-sentence summary trees with less than 4 clusters were removed from the analysis we were left with 50 sets of summaries a group of 4 volunteers assigned ratings on a scale of 1 to 5 Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Metrics Summarization Results Metrics Examples of ratings: Summary Ratings there is nothing wrong with america that cannot be cured by what is right with america. ~ bill clinton happy4th Completeness: 3 Grammaticality: 5 happy birthday 'merica they call me happy4th happy 4th of july merica there is nothing wrong with america that cannot be cured by what is right with america. ~ bill clinton happy4th Completeness: 4 Grammaticality: 4 Non-redundancy: 3 Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Metrics Summarization Results Outline 1 Context Microblogging Previous Work Motivation 2 Our Summarizing System Approach Outline Message Clustering based on Events Hierarchical Event Analysis Summarization 3 Results Corpus Metrics Summarization Results 4 Conclusions and Future Work Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Metrics Summarization Results Summarization without Clustering MSC generates a meaningless summary 4 th of July summary: MSC: rt TWID you to the TWID URL summaries generated by MSC receive a grammaticality rating of 1 and a completeness rating of 1 Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Metrics Summarization Results Summarization without Clustering MSC generates a meaningless summary 4 th of July summary: MSC: rt TWID you to the TWID URL summaries generated by MSC receive a grammaticality rating of 1 and a completeness rating of 1 Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Metrics Summarization Results Summarization without Clustering MSC generates a meaningless summary 4 th of July summary: MSC: rt TWID you to the TWID URL summaries generated by MSC receive a grammaticality rating of 1 and a completeness rating of 1 Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Metrics Summarization Results Summarization without Clustering FPS picks a long and frequent phrase (usually the one that was retweeted the most) 4 th of July summary: FPS: rt TWID dear mom&dad thank you for everything you've done to me i can never pay back all of them but i'm trying to be the best for both of you summaries generated by FPS receive a grammaticality rating of 5 and a completeness rating of 1. Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Metrics Summarization Results Summarization without Clustering FPS picks a long and frequent phrase (usually the one that was retweeted the most) 4 th of July summary: FPS: rt TWID dear mom&dad thank you for everything you've done to me i can never pay back all of them but i'm trying to be the best for both of you summaries generated by FPS receive a grammaticality rating of 5 and a completeness rating of 1. Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Metrics Summarization Results Summarization without Clustering FPS picks a long and frequent phrase (usually the one that was retweeted the most) 4 th of July summary: FPS: rt TWID dear mom&dad thank you for everything you've done to me i can never pay back all of them but i'm trying to be the best for both of you summaries generated by FPS receive a grammaticality rating of 5 and a completeness rating of 1. Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
We showed that summarizing streams can be signicantly improved by clustering messages together and removing noise. The steps of the summarizing algorithm are: detecting the events people are talking about clustering posts related to those events applying classical summarizing algorithms to each cluster of posts Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
We showed that summarizing streams can be signicantly improved by clustering messages together and removing noise. The steps of the summarizing algorithm are: detecting the events people are talking about clustering posts related to those events applying classical summarizing algorithms to each cluster of posts Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
Work fast online processing of streams develop a visual interface for rendering hierarchical summaries and investigating how large streams can be analyzed by users Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum
You Thank you for your time. Do you have any questions? Contact: [email protected] http://andrei.olariu.org Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Sum