Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RSC: Mining and Modeling Temporal Activity in S...

Alceu Costa
September 19, 2015

RSC: Mining and Modeling Temporal Activity in Social Media

Slides of my presentation at KDD 2015 of the paper:

RSC: Mining and Modeling Temporal Activity in Social Media
Alceu Ferraz Costa, Yuto Yamaguchi, Agma Juci Machado Traina, Caetano Traina Jr., and Christos Faloutsos
The 21st SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2015

Alceu Costa

September 19, 2015
Tweet

Other Decks in Science

Transcript

  1. RSC: Mining and Modeling Temporal Activity in Social Media Alceu

    F. Costa* Yuto Yamaguchi Agma J. M. Traina Caetano Traina Jr. Christos Faloutsos 1 Universidade de São Paulo KDD 2015 – Sydney, Australia *[email protected]
  2. Introduction 2 Users generate sequences of time-stamps when they use

    a social media Web site What can we learn from time-stamps? Are there common patterns? Can we tell if a user is a bot or a human? Sequence of tweets time-stamps: Bars are tweets time-stamps
  3. Outline Pattern Mining What patterns can we discover from temporal

    activities of social media users? Modeling Bot Detection Experiments Conclusion 3
  4. Reddit Dataset Time-stamp from comments 21,198 users 20 Million time-stamps

    Twitter Dataset Time-stamp from tweets 6,790 users 16 Million time-stamps Pattern Mining: Datasets For each user we have: Sequence of postings time-stamps: T = (t1 , t2 , t3 , …) Inter-arrival times (IAT) of postings: (∆1 , ∆2 , ∆3 , …) 4 t1 t2 t3 t4 ∆1 ∆2 ∆3 time
  5. Pattern Mining Pattern 1: Distribution of IAT is heavy-tailed Users

    can be inactive for long periods of time before making new postings IAT Complementary Cumulative Distribution Function (CCDF) (log-log axis) 5 105 106 107 10−5 100 ∆, IAT (seconds) CCDF, P(x ≥ ∆) 1 day 10 days 105 106 107 10−5 100 ∆, IAT (seconds) CCDF, P(x ≥ ∆) 1 day10 days Reddit Users Twitter Users
  6. Pattern Mining Pattern 2: Bimodal IAT distribution Users have highly

    active sections and resting periods Log-binned histogram of postings IAT 6 Twitter Users 102 104 106 0 0.005 0.01 0.015 ∆, IAT (seconds) PDF 1st Mode (1min) 2nd Mode (3h)
  7. 102 104 106 0 0.005 0.01 ∆, IAT (seconds) PDF

    Pattern Mining Pattern 3: Periodic spikes in the IAT distribution Caused by daily sleeping intervals 7 105 0 0.005 0.01 0.015 ∆, IAT (seconds) PDF 7h 12h 24h 48h 72h Reddit Users
  8. Pattern Mining Pattern 4: Consecutive IAT are correlated Long/short IAT

    are likely to be followed by long/short IAT Heat-map: pairs of consecutive IAT All Reddit users 8 Concentration of pairs in the diagonal: positive correlation
  9. RSC Model Can we generate synthetic time-stamps that match real

    data patterns? 10 Pattern Poisson Process Queue Based Barabási, 2005 CNPP Malmgren et al., 2009 SFP Vaz de Melo et al., 2013 RSC Proposed Model Heavy Tails ✔ ✔ ✔ Bimodal Distribution ✔ ✔ Periodic Spikes ✔ IAT Correlation ✔ ✔ Proposed Model: Rest-Sleep-and-Comment
  10. RSC Model Base model: Self-Correlated Process (SCorr) Definition: A stochastic

    process is a SCorr process with base rate λ and correlation ρ if: Consecutive IAT are correlated: The i-th IAT ∆i depends on the previous (i-1)-th IAT ∆i-1 ρ controls correlation strength: If ρ = 0, SCorr reduces to an exponential distribution 11 X ~ Exp(1/λ) exponential random variable with rate λ ∆i ~ Exp(ρ∆i-1 + 1/λ) Details
  11. SCorr Process RSC Model 12 ✔ Correlated IAT ✔ Heavy

    Tail ✗ Bimodal Distribution ✗ Periodic Spikes Consecutive IAT Distribution SCorr (synthetic data) λ = 20h, ρ = 0.7 ∆ n , IAT (seconds) ∆ n+1 , IAT (seconds) 101 103 105 107 101 103 105 107 10 100
  12. RSC Model 13 λ = 20h, ρ = 0.7 ✔

    Correlated IAT ✔ Heavy Tail ✗ Bimodal Distribution ✗ Periodic Spikes IAT CCDF 104 105 106 107 10−4 10−2 100 ∆, IAT (seconds) CCDF, P(X ≥ ∆) Reddit Data SCorr SCorr Process
  13. RSC Model 14 λ = 20m, ρ = 1.0 ✔

    Correlated IAT ✔ Heavy Tail ✗ Bimodal Distribution ✗ Periodic Spikes IAT Log-binned Histogram 105 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 ∆, IAT (seconds) PDF Data SCorr SCorr Process
  14. RSC Model Model States Active: 1. Wait δ ~ SCorr(λA

    , ρA ) 2. Post with probability ppost 3. Transition Rest: 1. Wait δ ~ SCorr(λR , ρR ) 2. Transition Base rates: λA > λR Average wait time for active state is smaller when compared to rest state State Transitions 15 Active Rest 1-pR pR 1-pA pA Details
  15. RSC Model 16 ✔ Heavy Tail ✔ Correlated IAT ✔

    Bimodal Distribution ✗ Periodic Spikes 105 0 0.002 0.004 0.006 0.008 0.01 ∆, IAT (seconds) PDF IAT Log-binned Histogram Data Synth. SCorr + Rest and Active States
  16. RSC Model Keep track of current time: tclock variable, 0:00h

    < tclock < 23:59h Update tclock after each wait time δ Enter the sleep state if: Current state = rest and (tclock < twake or tclock > tsleep ) In the sleep state: 1. Wait until next wake-up time, twake 2. Transition to rest state 17 tsleep twake tclock Sleep Awake Modeling periodic spikes: sleep state Details
  17. 105 0 0.002 0.004 0.006 0.008 0.01 ∆, IAT (seconds)

    PDF Data RSC RSC Model 18 ✔ Heavy Tail ✔ Correlated IAT ✔ Bimodal Distribution ✔ Periodic Spikes Parameter estimation uses the Levenberg-Marquardt algorithm IAT Log-binned Histogram Complete RSC Model
  18. Outline Pattern Mining Modeling Bot Detection Can we spot automated

    behavior based only on time- stamp data? Experiments Conclusion 19
  19. Bot Detection Problem: Given labeled time-stamp data from a set

    of users {U1 , U2 , U3 , …} decide if a unknown user Ui is a human or a bot. Solution: RSC-Spotter Compare users IAT to synthetic IAT generated by the RSC model If not similar to RSC, then is the user is likely to be a bot 20 0 10 20 30 40 50 60 70 Time (days) Sequence of time-stamps from a single user The user that produced the time-stamps is a human or a bot?
  20. RSC-Spotter Comparing Time-stamps Estimate the RSC parameters Time-stamps from all

    users For each user: 1. Compute the IAT histogram Using log-binned bins 2. Generate synthetic time- stamps using RSC RSC can generate the same number of time-stamps as the user 3. Compare user and synthetic IAT histogram Cost sensitive classification is used to decide if a user is a bot given the dissimilarity D 21 ∆, IAT Bin Counts (user data)ci ∆, IAT Bin Counts (synthetic) či D = Σi |ci – či | (dissimilarity) Details
  21. Outline Pattern Mining Modeling Bot Detection Experiments Can RSC match

    real data? How well can RSC-Spotter detect bots? Conclusion 22
  22. 104 105 106 107 10−4 10−2 100 ∆, IAT (seconds)

    CCDF, P(X ≥ ∆) Data SFP CNPP RSC 104 105 106 107 10−4 10−2 100 ∆, IAT (seconds) CCDF, P(X ≥ ∆) Data SFP CNPP RSC Reddit Users Twitter Users Experiments: Can RSC Match Real Data? 23 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation RSC Proposed model CNPP Malmgren et al. SFP Vaz de Melo et al CNPP fails to match the heavy tail ✗ ✔ ✔
  23. 105 0 0.002 0.004 0.006 0.008 0.01 ∆, IAT (seconds)

    PDF Data CNPP Experiments: Can RSC Match Real Data? 24 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation ✗ ✔ ✗ ✔ ✔ Two Modes No Periodic Spikes Reddit Users CNPP Malmgren et al.
  24. 105 0 0.002 0.004 0.006 0.008 0.01 ∆, IAT (seconds)

    PDF Data SFP Experiments: Can RSC Match Real Data? 25 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation ✗ ✔ ✗ ✗ ✗ ✔ ✔ Reddit Users Single Mode No Periodic Spikes SFP Vaz de Melo et al
  25. 105 0 0.002 0.004 0.006 0.008 0.01 ∆, IAT (seconds)

    PDF Data RSC Experiments: Can RSC Match Real Data? 26 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation ✗ ✔ ✗ ✗ ✗ ✔ ✔ ✔ ✔ Reddit Users Twitter Users Two Modes Periodic Spikes Reddit Users RSC Proposed model
  26. ∆ n , IAT (seconds) ∆ n+1 , IAT (seconds)

    101 103 105 107 109 101 103 105 107 109 10 100 1000 Experiments: Can RSC Match Real Data? 27 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation ✗ ✔ ✗ ✗ ✗ ✗ ✔ ✔ ✔ ✔ ∆ n , IAT (seconds) ∆ n+1 , IAT (seconds) 101 103 105 107 109 101 103 105 107 109 10 100 1000 Twitter Data CNPP Fit No IAT Correlation CNPP Malmgren et al.
  27. ∆ n , IAT (seconds) ∆ n+1 , IAT (seconds)

    101 103 105 107 109 101 103 105 107 109 10 100 1000 Experiments: Can RSC Match Real Data? 28 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation ✗ ✔ ✗ ✗ ✗ ✗ ✔ ✔ ✔ ✔ ✔ ∆ n , IAT (seconds) ∆ n+1 , IAT (seconds) 101 103 105 107 109 101 103 105 107 109 10 100 1000 Twitter Data SFP Fit IAT Correlation (but too strong!) SFP Vaz de Melo et al
  28. Experiments: Can RSC Match Real Data? 29 Pattern CNPP SFP

    RSC Heavy Tail Bimodal Spikes IAT Correlation ✗ ✔ ✗ ✗ ✗ ✗ ✔ ✔ ✔ ✔ ✔ ✔ ∆ n , IAT (seconds) ∆ n+1 , IAT (seconds) 101 103 105 107 109 101 103 105 107 109 10 100 1000 Twitter Data RSC Fit ∆ n , IAT (seconds) ∆ n+1 , IAT (seconds) 101 103 105 107 109 101 103 105 107 109 10 100 1000 IAT Correlation RSC Proposed model
  29. Outline Pattern Mining Modeling Bot Detection Experiments Can RSC Match

    Real Data? How well can RSC-Spotter detect bots? Conclusion 30
  30. Experiments: Can RSC-Spotter Detect Bots? Methodology Datasets Users were manually

    labeled as bot or humans Training Same size for train and test subsets (preserved class distribution) Baseline features: 31 1,963 Humans 37 Bots Reddit 1353 Humans 64 Bots Twitter 1.  IAT Histogram Log-binned IAT histogram 2. Entropy Entropy of the IAT histogram 3. Week Hist. # of postings for day of week 4. All features Combination of 1, 2 and 3
  31. Experiments: Can RSC-Spotter Detect Bots? Precision vs. Sensitivity Curves Good

    performance: curve close to the top 32 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Sensitivity (Recall) Precision 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Sensitivity (Recall) Precision RSC−Spotter IAT Hist. Entropy [6] Weekday Hist. All Features Precision > 94% Sensitivity > 70% With strongly imbalanced datasets # humans >> # bots Twitter Dataset
  32. Experiments: Can RSC-Spotter Detect Bots? Precision vs. Sensitivity Curves Good

    performance: curve close to the top 33 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Sensitivity (Recall) Precision RSC−Spotter IAT Hist. Entropy [6] Weekday Hist. All Features Precision > 96% Sensitivity > 47% With strongly imbalanced datasets # humans >> # bots Reddit Dataset
  33. Conclusion Pattern Mining Discovered four activity patterns RSC-Model Model that

    matches the postings IAT distribution of social media users RSC-Spotter Can tell if a user is a bot based only on time- stamp data 35 102 104 106 0 0.005 0.01 ∆, IAT (seconds) PDF 105 0 0.002 0.004 0.006 0.008 0.01 ∆, IAT (seconds) PDF Data RSC 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Sensitivity (Recall) Precision
  34. Thank you! Alceu F. Costa* Yuto Yamaguchi Agma J. M.

    Traina Caetano Traina Jr. Christos Faloutsos 36 Universidade de São Paulo *[email protected] Datasets and Code: https://github.com/alceufc/rsc_model
  35. RSC Spotter – Training Goal: decide if a dissimilarity D

    is big enough to say that a user is a bot Input: training set of labeled users Positive examples: bots Negative examples: humans 1. Estimate pbot = P(user is a bot | D) Naive-Bayes classifier Dissimilarity D is a feature 2. Estimate a probability threshold pthresh Cost sensitive classification Minimize the weighted harmonic mean between FP and FN errors Uses only training data 38 Assign costs to False Positive and False Negative errors
  36. Self-Correlated Process (SCorr) Exponential distribution: ∆i ~ Exp(β) PDF: f(x)

    = βe-xβ Self-Correlated Process: Similar to the exponential distribution… …however β depends on the previous IAT 39 β: mean inter- arrival time βi = ρ∆i-1 + 1/λ