How to find spam on Twitter

How to find spam on Twitter

This is a Masters thesis presentation on how to detect spam on Twitter by evaluating properties of the Twitter social graph.

The master thesis itself can be found here, for more details: http://mourjo.me/assets/master_thesis_MourjoSen.pdf

C5686e8241d39d963c175bb1738295d0?s=128

Mourjo Sen

July 08, 2015
Tweet

Transcript

  1. How to find spam on Twitter? Mourjo Sen Under the

    guidance of Arnaud Legout, Maksym Gabielkov
  2. Outline ◎ Background, problem statement, workflow ◎ Definition of our

    metric of trust ◎ Spam detection methodology ◎ Testing our method ◎ Conclusion: Next Steps 2
  3. 3 #JeSuisCharlie Mentioned 6,500 times per minute 3.4 million times

    in a day
  4. The dark side of social media ◎ A hacker starts

    an online rumour about a plane crash on Twitter ◎ The “news” goes viral and the airline’s stock plummets ◎ The hacker makes a fortune on stock short sales 4
  5. 5

  6. Real-world influence of Twitter ◎ Political campaigns ◎ Marketing campaigns

    + promotions ◎ Stock markets ◎ Journalism: TV, Books, Newspapers… ◎ Customer satisfaction ◎ Awareness programs 6 A strong incentive to manipulate tweets
  7. The problem ◎ No one knows if tweets can be

    trusted ◎ Not even Twitter themselves ◦ Researchers from Twitter ◦ Discussion with Vigiglobe ◎ Goal: Robust, on-the-fly spam detection 7
  8. The workflow Master 1 PFE ✓ Master 2 PFE ✓

    Master 2 Internship Defining a metric of trust Analyzing the trust metric Classifying tweets by using the metric 8
  9. Outline ◎ Background, problem statement, workflow ◎ Definition of our

    metric of trust ◎ Spam detection methodology ◎ Testing our method ◎ Conclusion: Next Steps 9
  10. Do we need a trust metric? ◎ Twitter has manually

    verified ~ 113 K users ◎ But 99.99 % users are not verified 10
  11. The trust score 11

  12. Outline ◎ Background, problem statement, workflow ◎ Definition of our

    metric of trust ◎ Spam detection methodology ◎ Testing our method ◎ Conclusion: Next Steps 12
  13. Retweet chain 13 Creator of the tweet Retweeters

  14. The power of retweets ◎ Non-repudiation: Public statement of one’s

    approval of the content ◎ Not duplication: Gives credit to the original content publisher ◎ Long retweet chain = High visibility = Affects a lot of people ◎ Public opinion: Retweets influence trends 14
  15. Spam detection method: Quality of retweets 15 Trusted users in

    the retweet chain indicates authenticity of the tweet ✗ ✓
  16. How is it robust and on the fly? ◎ Easy

    to send many tweets ◎ Difficult to change the follow-relationship ◎ If we have the tweet, we can obtain the list of retweets, i.e. the retweet chain 16
  17. Outline ◎ Background, problem statement, workflow ◎ Definition of our

    metric of trust ◎ Spam detection methodology ◎ Testing our method ◎ Conclusion: Next Steps 17
  18. Testing our method of spam detection 18 ◎ No test

    set ◎ Manual verification too slow ◎ Need other methods 1. Suspicious keywords 2. Periodic tweets 3. Content copying
  19. Method 1: Keyword analysis 19 ◎ Collection of 27M topic-related

    tweets ◎ Unrelated/derogatory keywords = spam?
  20. Method 1: Keyword analysis 20 ◎ Collection of 27M topic-related

    tweets ◎ Unrelated/derogatory keywords = spam?
  21. Method 1: Keyword analysis 21 ◎ Collection of 27M topic-related

    tweets ◎ Unrelated/derogatory keywords = spam? con sin hot bra fuck lol giveaway mom kill sale girl hack upgrade bf prom hole exe sex fuckin leak gratis fucking suck followers torrent cheap tops fucked password girls ninja retweets kick male killing dude bitch recent kills gay baby nights hackers cute repair discount pirates rumor teen sexy porn followme fake death finger giveaways wife playroom dick died hiring subscribe multiplayer rear spy midnight dumb upgrades pissed peek freak killer webcam shirt sponsor models cheapest wallpaper installation
  22. Method 1: Keyword analysis 22 ◎ Collection of 27M topic-related

    tweets ◎ Unrelated/derogatory keywords = spam?
  23. Method 2: Finding periodic tweets 23 ◎ Twitter bots often

    tweet periodically ◎ Difficult to detect periods in a large collection of tweets
  24. Method 3: Replication of tweet content 24 ◎ Some tweets

    have the same content ◎ Same content → Spam property ◎ Retweets → Non-spam property
  25. Method 3: Replication of tweet content 25 Set 2: Tweets

    which have not been retweeted Set 1: Tweets with more retweets than copies All tweets in the dataset Non-spam set Spam set
  26. Method 3: Replication of tweet content 26 Number of trusted

    users in retweet chain (log scale) CDF
  27. Method 3: Replication of tweet content 27 Number of copies

    Number of users in retweet chain
  28. Outline ◎ Background, problem statement, workflow ◎ Definition of our

    metric of trust ◎ Spam detection methodology ◎ Testing our method ◎ Conclusion: Next Steps 28
  29. Next steps: Plan for the next two months ◎ Testing

    our method in other datasets ◎ Correlation with other methods ◎ Spam detection as a service/API 29
  30. Conclusion 30 ◎ On-the-fly spam detection ◎ Help prevent manipulation

    of public opinion on Twitter Making social networks safer and more authentic
  31. How to find spam on Twitter? Mourjo Sen Under the

    guidance of Arnaud Legout, Maksym Gabielkov Thank you!