Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spotting first stories in Twitter using Storm

Spotting first stories in Twitter using Storm

Michael Vogiatzis, Software Engineer @SocialArtisan. Talk at Data Science London @ds_ldn

Data Science London

October 29, 2013
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. How to spot first stories on Twitter using Storm Michael

    Vogiatzis - @mvogiatzis Software Engineer
  2. The Task Find the first document in a stream of

    documents, which discusses about a specific event. @mvogiatzis
  3. Twitter  Spam ◦ It’s Cooooooooooooolddd !! Brrrrrr…  Neutral

    ◦ #nowplaying ♫ Live At The BBC – Dire Straits  Events ◦ The 6.4-magnitude quake struck just after 9.20pm (CST) on Sunday in the Banda Sea northeast of East Timor. @mvogiatzis
  4. TF - IDF  Split text into words  Term

    Frequency * Inverted Document Frequency  More frequent words – less weight  Remove out-of-vocabulary words e.g. “lol”, “the”  Remove URLs and mentions (@) @mvogiatzis
  5. Algorithm  TF-IDF on input Tweet  Convert it to

    Vector  Find N nearest neighbours ◦ Locality Sensitive Hashing @mvogiatzis
  6. Locality Sensitive Hashing  Data Clustering – Near neighbour search

     Buckets – Hash Tables for similar documents  Random projection creates a hash  Identical hash -> nearest neighbour candidate @mvogiatzis
  7. Algorithm  TF-IDF on input Tweet  Convert it to

    Vector  Find N nearest neighbours ◦ Locality Sensitive Hashing  Compare distances and find the closest  If distance < threshold not a first story @mvogiatzis
  8. Extra Step  If Buckets distance is not short enough

     Compare with a fixed number of recent tweets  Check again @mvogiatzis
  9. Algorithm  TF-IDF on input Tweet  Convert it to

    Vector  Find N nearest neighbours ◦ Locality Sensitive Hashing  Compare distances and find the closest  If distance < threshold not a first story  Else compare with X most recent tweets (optimization)  If new_distance > threshold -> first story! @mvogiatzis
  10. Storm  Distributed real-time computation system  Fault tolerant 

    Fast  Scalable  Guaranteed message processing  Open source  Multilang capabilities @mvogiatzis
  11. Elements  Streams ◦ Set of tuples ◦ Unbounded sequence

    of data  Spout ◦ Source of streams  Bolts ◦ Application logic ◦ Functions ◦ Streaming aggregations, joins, DB ops @mvogiatzis
  12. Results Input Tweet Stored Tweet Similarity score @Real_Liam_Payne i wanna

    be your female pal i. wanna be your best friend so follow me  0.385 RT @damnitstrue: Life is for living, not for stressing. RT Life is for living, not for stressing. 0.99 The 6.4-magnitude quake struck just after 9.20pm (CST) on Sunday in the Banda Sea northeast of East Timor. http://t.co/UhfwC S2xPp Yay Sunday! 0.129 @mvogiatzis
  13. Evaluation  Evaluation on speed-up metric ◦ 1381 % vs

    single threaded ◦ 372 % vs multi threaded (4 threads)  Having humans labeling tweets is hard!  Implementation tested on newswire and broadcast news  False alarms @mvogiatzis
  14. Future work  Reduce false alarms by using threads for

    topics  Image similarity detection  Audio similarity ? ◦ Hello Shazam! @mvogiatzis
  15. Michael Vogiatzis  Twitter: @mvogiatzis  Code on Github 

    http://micvog.com ◦ Next post: “7 Lessons Learned at a London startup” @mvogiatzis