Slide 1

Slide 1 text

How to find spam on Twitter? Mourjo Sen Under the guidance of Arnaud Legout, Maksym Gabielkov

Slide 2

Slide 2 text

Outline ◎ Background, problem statement, workflow ◎ Definition of our metric of trust ◎ Spam detection methodology ◎ Testing our method ◎ Conclusion: Next Steps 2

Slide 3

Slide 3 text

3 #JeSuisCharlie Mentioned 6,500 times per minute 3.4 million times in a day

Slide 4

Slide 4 text

The dark side of social media ◎ A hacker starts an online rumour about a plane crash on Twitter ◎ The “news” goes viral and the airline’s stock plummets ◎ The hacker makes a fortune on stock short sales 4

Slide 5

Slide 5 text

5

Slide 6

Slide 6 text

Real-world influence of Twitter ◎ Political campaigns ◎ Marketing campaigns + promotions ◎ Stock markets ◎ Journalism: TV, Books, Newspapers… ◎ Customer satisfaction ◎ Awareness programs 6 A strong incentive to manipulate tweets

Slide 7

Slide 7 text

The problem ◎ No one knows if tweets can be trusted ◎ Not even Twitter themselves ○ Researchers from Twitter ○ Discussion with Vigiglobe ◎ Goal: Robust, on-the-fly spam detection 7

Slide 8

Slide 8 text

The workflow Master 1 PFE ✓ Master 2 PFE ✓ Master 2 Internship Defining a metric of trust Analyzing the trust metric Classifying tweets by using the metric 8

Slide 9

Slide 9 text

Outline ◎ Background, problem statement, workflow ◎ Definition of our metric of trust ◎ Spam detection methodology ◎ Testing our method ◎ Conclusion: Next Steps 9

Slide 10

Slide 10 text

Do we need a trust metric? ◎ Twitter has manually verified ~ 113 K users ◎ But 99.99 % users are not verified 10

Slide 11

Slide 11 text

The trust score 11

Slide 12

Slide 12 text

Outline ◎ Background, problem statement, workflow ◎ Definition of our metric of trust ◎ Spam detection methodology ◎ Testing our method ◎ Conclusion: Next Steps 12

Slide 13

Slide 13 text

Retweet chain 13 Creator of the tweet Retweeters

Slide 14

Slide 14 text

The power of retweets ◎ Non-repudiation: Public statement of one’s approval of the content ◎ Not duplication: Gives credit to the original content publisher ◎ Long retweet chain = High visibility = Affects a lot of people ◎ Public opinion: Retweets influence trends 14

Slide 15

Slide 15 text

Spam detection method: Quality of retweets 15 Trusted users in the retweet chain indicates authenticity of the tweet ✗ ✓

Slide 16

Slide 16 text

How is it robust and on the fly? ◎ Easy to send many tweets ◎ Difficult to change the follow-relationship ◎ If we have the tweet, we can obtain the list of retweets, i.e. the retweet chain 16

Slide 17

Slide 17 text

Outline ◎ Background, problem statement, workflow ◎ Definition of our metric of trust ◎ Spam detection methodology ◎ Testing our method ◎ Conclusion: Next Steps 17

Slide 18

Slide 18 text

Testing our method of spam detection 18 ◎ No test set ◎ Manual verification too slow ◎ Need other methods 1. Suspicious keywords 2. Periodic tweets 3. Content copying

Slide 19

Slide 19 text

Method 1: Keyword analysis 19 ◎ Collection of 27M topic-related tweets ◎ Unrelated/derogatory keywords = spam?

Slide 20

Slide 20 text

Method 1: Keyword analysis 20 ◎ Collection of 27M topic-related tweets ◎ Unrelated/derogatory keywords = spam?

Slide 21

Slide 21 text

Method 1: Keyword analysis 21 ◎ Collection of 27M topic-related tweets ◎ Unrelated/derogatory keywords = spam? con sin hot bra fuck lol giveaway mom kill sale girl hack upgrade bf prom hole exe sex fuckin leak gratis fucking suck followers torrent cheap tops fucked password girls ninja retweets kick male killing dude bitch recent kills gay baby nights hackers cute repair discount pirates rumor teen sexy porn followme fake death finger giveaways wife playroom dick died hiring subscribe multiplayer rear spy midnight dumb upgrades pissed peek freak killer webcam shirt sponsor models cheapest wallpaper installation

Slide 22

Slide 22 text

Method 1: Keyword analysis 22 ◎ Collection of 27M topic-related tweets ◎ Unrelated/derogatory keywords = spam?

Slide 23

Slide 23 text

Method 2: Finding periodic tweets 23 ◎ Twitter bots often tweet periodically ◎ Difficult to detect periods in a large collection of tweets

Slide 24

Slide 24 text

Method 3: Replication of tweet content 24 ◎ Some tweets have the same content ◎ Same content → Spam property ◎ Retweets → Non-spam property

Slide 25

Slide 25 text

Method 3: Replication of tweet content 25 Set 2: Tweets which have not been retweeted Set 1: Tweets with more retweets than copies All tweets in the dataset Non-spam set Spam set

Slide 26

Slide 26 text

Method 3: Replication of tweet content 26 Number of trusted users in retweet chain (log scale) CDF

Slide 27

Slide 27 text

Method 3: Replication of tweet content 27 Number of copies Number of users in retweet chain

Slide 28

Slide 28 text

Outline ◎ Background, problem statement, workflow ◎ Definition of our metric of trust ◎ Spam detection methodology ◎ Testing our method ◎ Conclusion: Next Steps 28

Slide 29

Slide 29 text

Next steps: Plan for the next two months ◎ Testing our method in other datasets ◎ Correlation with other methods ◎ Spam detection as a service/API 29

Slide 30

Slide 30 text

Conclusion 30 ◎ On-the-fly spam detection ◎ Help prevent manipulation of public opinion on Twitter Making social networks safer and more authentic

Slide 31

Slide 31 text

How to find spam on Twitter? Mourjo Sen Under the guidance of Arnaud Legout, Maksym Gabielkov Thank you!