Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to find spam on Twitter

How to find spam on Twitter

This is a Masters thesis presentation on how to detect spam on Twitter by evaluating properties of the Twitter social graph.

The master thesis itself can be found here, for more details: http://mourjo.me/assets/master_thesis_MourjoSen.pdf

Mourjo Sen

July 08, 2015
Tweet

More Decks by Mourjo Sen

Other Decks in Science

Transcript

  1. How to find spam on Twitter? Mourjo Sen Under the

    guidance of Arnaud Legout, Maksym Gabielkov
  2. Outline ◎ Background, problem statement, workflow ◎ Definition of our

    metric of trust ◎ Spam detection methodology ◎ Testing our method ◎ Conclusion: Next Steps 2
  3. The dark side of social media ◎ A hacker starts

    an online rumour about a plane crash on Twitter ◎ The “news” goes viral and the airline’s stock plummets ◎ The hacker makes a fortune on stock short sales 4
  4. 5

  5. Real-world influence of Twitter ◎ Political campaigns ◎ Marketing campaigns

    + promotions ◎ Stock markets ◎ Journalism: TV, Books, Newspapers… ◎ Customer satisfaction ◎ Awareness programs 6 A strong incentive to manipulate tweets
  6. The problem ◎ No one knows if tweets can be

    trusted ◎ Not even Twitter themselves ◦ Researchers from Twitter ◦ Discussion with Vigiglobe ◎ Goal: Robust, on-the-fly spam detection 7
  7. The workflow Master 1 PFE ✓ Master 2 PFE ✓

    Master 2 Internship Defining a metric of trust Analyzing the trust metric Classifying tweets by using the metric 8
  8. Outline ◎ Background, problem statement, workflow ◎ Definition of our

    metric of trust ◎ Spam detection methodology ◎ Testing our method ◎ Conclusion: Next Steps 9
  9. Do we need a trust metric? ◎ Twitter has manually

    verified ~ 113 K users ◎ But 99.99 % users are not verified 10
  10. Outline ◎ Background, problem statement, workflow ◎ Definition of our

    metric of trust ◎ Spam detection methodology ◎ Testing our method ◎ Conclusion: Next Steps 12
  11. The power of retweets ◎ Non-repudiation: Public statement of one’s

    approval of the content ◎ Not duplication: Gives credit to the original content publisher ◎ Long retweet chain = High visibility = Affects a lot of people ◎ Public opinion: Retweets influence trends 14
  12. Spam detection method: Quality of retweets 15 Trusted users in

    the retweet chain indicates authenticity of the tweet ✗ ✓
  13. How is it robust and on the fly? ◎ Easy

    to send many tweets ◎ Difficult to change the follow-relationship ◎ If we have the tweet, we can obtain the list of retweets, i.e. the retweet chain 16
  14. Outline ◎ Background, problem statement, workflow ◎ Definition of our

    metric of trust ◎ Spam detection methodology ◎ Testing our method ◎ Conclusion: Next Steps 17
  15. Testing our method of spam detection 18 ◎ No test

    set ◎ Manual verification too slow ◎ Need other methods 1. Suspicious keywords 2. Periodic tweets 3. Content copying
  16. Method 1: Keyword analysis 19 ◎ Collection of 27M topic-related

    tweets ◎ Unrelated/derogatory keywords = spam?
  17. Method 1: Keyword analysis 20 ◎ Collection of 27M topic-related

    tweets ◎ Unrelated/derogatory keywords = spam?
  18. Method 1: Keyword analysis 21 ◎ Collection of 27M topic-related

    tweets ◎ Unrelated/derogatory keywords = spam? con sin hot bra fuck lol giveaway mom kill sale girl hack upgrade bf prom hole exe sex fuckin leak gratis fucking suck followers torrent cheap tops fucked password girls ninja retweets kick male killing dude bitch recent kills gay baby nights hackers cute repair discount pirates rumor teen sexy porn followme fake death finger giveaways wife playroom dick died hiring subscribe multiplayer rear spy midnight dumb upgrades pissed peek freak killer webcam shirt sponsor models cheapest wallpaper installation
  19. Method 1: Keyword analysis 22 ◎ Collection of 27M topic-related

    tweets ◎ Unrelated/derogatory keywords = spam?
  20. Method 2: Finding periodic tweets 23 ◎ Twitter bots often

    tweet periodically ◎ Difficult to detect periods in a large collection of tweets
  21. Method 3: Replication of tweet content 24 ◎ Some tweets

    have the same content ◎ Same content → Spam property ◎ Retweets → Non-spam property
  22. Method 3: Replication of tweet content 25 Set 2: Tweets

    which have not been retweeted Set 1: Tweets with more retweets than copies All tweets in the dataset Non-spam set Spam set
  23. Method 3: Replication of tweet content 26 Number of trusted

    users in retweet chain (log scale) CDF
  24. Outline ◎ Background, problem statement, workflow ◎ Definition of our

    metric of trust ◎ Spam detection methodology ◎ Testing our method ◎ Conclusion: Next Steps 28
  25. Next steps: Plan for the next two months ◎ Testing

    our method in other datasets ◎ Correlation with other methods ◎ Spam detection as a service/API 29
  26. Conclusion 30 ◎ On-the-fly spam detection ◎ Help prevent manipulation

    of public opinion on Twitter Making social networks safer and more authentic
  27. How to find spam on Twitter? Mourjo Sen Under the

    guidance of Arnaud Legout, Maksym Gabielkov Thank you!