Upgrade to Pro — share decks privately, control downloads, hide ads and more …


Data Intelligence
June 28, 2017


Antonia Gogoglou, Aristotle University of Thessaloniki, SignalGeneriX Ltd Cyprus
Audience level: Intermediate
Topic area: Modeling
The ENCASE project aims to leverage the latest advances in web security and privacy to design and implement a browser-based architecture for the protection of minors from malicious actors in online social networks, by exploiting sentiment and affective analysis along with graph mining.

Data Intelligence

June 28, 2017


  1. Online Social Networks (OSN’s) (2) 5 Data Intelligence Conference, June

    23-25 2017 o 42% of teenagers in USA between ages 15-17 actively use Twitter o Popularity in younger kids has risen also with 21% of 13-14 year old kids using Twitter in 2015 o Young users often have different protection needs online and are more susceptible to bad influence
  2. Hard to thwart malicious behavior in OSNs o OSNs allow

    cyberbullies/predators/criminals to ▪ Reach countless children and spread Inappropriate/malicious content ▪ Remain relatively anonymous ▪ Avoid being held accountable for their actions 6 Data Intelligence Conference, June 23-25 2017
  3. ENCASE’s proposition o Equip users with a set of tools

    ▪ Safeguard their security and privacy ▪ Raise awareness regarding potential threats o Extend and use the following techniques ▪ Sentiment and affective analysis ▪ Fake activity detection in OSN’s ▪ Content detection and protection 7 Data Intelligence Conference, June 23-25 2017
  4. Approaches for OSN Analysis o Complex network analysis: ▪ A

    graph with non-trivial topological features ▪ Used often to represent real-world networks, such as social networks o Graph decomposition: ▪ Break down a complex graph into subparts ▪ Identify communities, spreading mechanisms, topological order, etc. o Feature selection: ▪ Relevant and influential factors that distinguish users o Machine Learning ▪ Classification of users, pattern recognition 9 Data Intelligence Conference, June 23-25 2017
  5. Motivation o Scenario: ▪ A kid makes a new connection

    in an OSN ▪ If that new connection has spam/bullying/aggressive behavior or is in contact with such malicious users  danger o Criminal hubs, social promoters, social butterflies, spam neighborhood and other influential groups: ▪ May affect youngsters in a different way compared to adults ▪ Need to be identified and flagged even before they shared explicit malicious content o Our contribution: ▪ Early detection framework to identify dangerous groups based on limited information ▪ Identified new category of potentially dangerous users: Social Bridges 10 Data Intelligence Conference, June 23-25 2017
  6. Dataset Description(1) o We chose Twitter for our experiments due

    to openness, accessibility and popularity o Original dataset from MPI [1] with more than 50 million users ▪ We sampled a graph starting from 500 spammers [1] http://socialnetworks.mpi-sws.org/datasets.html 11 Data Intelligence Conference, June 23-25 2017
  7. Dataset Description(2) o The graph was extracted based on the

    following relationships amongst users o The resulting dataset contains: 303,999 unique users and 1,002,316 links, directed and unweighted Data Intelligence Conference, June 23-25 2017 12
  8. Social Graph Analysis(1) o Connected components is a topological invariant

    of a graph o Is a subgraph in which any two nodes are connected with each other, and no additional edges or vertices from G can be included in the subgraph without breaking its property ▪ Strongly connected component (SCC): every node is reachable from every other node ▪ Tarjan algorithm: single pass DFS Data Intelligence Conference, June 23-25 2017 13
  9. Social Graph Analysis(2) o Tarjan algorithm: ▪ Nodes are placed

    in a stack in the order they are visited ▪ When the recursive call that visits node v and its descendants ends, v is left on the stack if there is a path to earlier nodes ▪ If not, v is removed  v is a root of a SCC, whose nodes have no connection to previous ones Data Intelligence Conference, June 23-25 2017 14
  10. Social Graph Analysis(3) Data Intelligence Conference, June 23-25 2017 15

    o Force Atlas layout by Jacomy et al o Core SCC o Peripheral SCC
  11. Social Graph Analysis(4) o K-core decomposition: partition a graph into

    layers from external to more central vertices o A subgraph is a k-core if and only if it is the maximal subgraph so that the degree of every node is greater or equal to k o k-shell contains all vertices belonging to the kth core but not the (k+1)th Data Intelligence Conference, June 23-25 2017 16 k G G 
  12. Social Graph Analysis(5) Data Intelligence Conference, June 23-25 2017 17

    o kmax core is the maximum k value for which Gk is non empty o Nodes belonging to the kmax core form the nucleus, a strongly connected globally distributed subset (73% malicious, 28% honest) o The largest connected component of the kmax -1 shell is the peer component (88% honest)
  13. Social Graph Analysis(6) Data Intelligence Conference, June 23-25 2017 18

    o Force Atlas layout o Core-periphery o Loosely connected nodes
  14. Social Graph Analysis(7) Data Intelligence Conference, June 23-25 2017 19

    o Modularity: the fraction of the edges that fall within given groups (communities) minus the expected fraction if edges were distributed at random o Taxonomy of vertices according to their modularity class (community membership) o Result depends on boundaries between communities (strict or loose), randomization method, etc.
  15. Social Graph Analysis(8) Data Intelligence Conference, June 23-25 2017 20

    o Detailed community segmentation o Yifan Hu layout ▪ The repulsive forces on one node from a cluster of distant nodes are approximated by a Barnes-Hut calculation, which treats them as one super node
  16. Social Bridges(1) Data Intelligence Conference, June 23-25 2017 21 o

    Denser representation o Black: least populated community o Red: intermediate community o Green: largest well connected community
  17. Social Bridges(2) Data Intelligence Conference, June 23-25 2017 22 o

    92% of the least populated community are spammers o 87% of the intermediate community are spam followers o 96% of the largest connected community are honest users o Spammers (black) use intermediate (red) to penetrate the core of honest users (green)
  18. Social Bridges(3) Data Intelligence Conference, June 23-25 2017 23 o

    Spam followers constituting the largest connected component of the intermediate community constitute the social bridges of spammers o They can prove dangerous to vulnerable impressionable users (e.g. children) & are hard to identify o From here on social bridges and spammers will be referred to as malicious users
  19. Social Bridges(4) Data Intelligence Conference, June 23-25 2017 24 o

    Social bridges removed ▪ Spammers become disconnected from the major community o Evidence on the importance of social bridges o Innocent users become part of the expansive network of malicious users
  20. Feature Selection(1) o Network features to investigate graph topology and

    categorize users: ▪ In-Degree: incoming connections (followers) ▪ Out-Degree: outgoing connections (whom a user follows) ▪ Betweeness Centrality: number of shortest paths from all users to all others that pass through that specific node ▪ Closeness Centrality: average distance from a specific node to all others (here harmonic mean) Data Intelligence Conference, June 23-25 2017 25
  21. Feature Selection(2) ▪ Eigenvector Centrality: extended in-degree centrality awarding higher

    importance to links coming from more relevant nodes ▪ K-core number: largest integer k for a node such that this node exists in a graph where all vertices have degree >=k Data Intelligence Conference, June 23-25 2017 26
  22. Experiments and Results (1) o Utilizing social graph topology to

    alert users early on about the dangers of a new connection ▪ Classification (malicious – honest) o Highly imbalanced classes  4 approaches to deal with skewed dataset Data Intelligence Conference, June 23-25 2017 27
  23. Experiments and Results (2) o Approach 1: SMOTE, oversampling minority

    class and undersampling majority one (potential overfitting) o Approach 2: Cost Sensitive Learning, higher penalty to the misplaced points of minority class o Approach 3: Combination of 1 &2 o Approach 4: Rejection Sampling with majority voting, each point independently included or not given a probability function determined by the misclassification cost of each class and the max classification cost Data Intelligence Conference, June 23-25 2017 28
  24. Experiments and Results (3) o 70% of the majority class

    (honest) – 70% sampled from the minority class (malicious) o 30% with natural imbalance o C-SVM classifier from LibSVM[1] o Performance metrics: F-measure and sensitivity of the minority class instead of accuracy Data Intelligence Conference, June 23-25 2017 29
  25. Experiments and Results (4) Data Intelligence Conference, June 23-25 2017

    30 o Why these approaches? ▪ A form of sampling is necessary to deal with imbalanced classes ▪ Cost sensitive learning mitigates skewness of data
  26. Experiments and Results (6) o Appropriate sampling is allowing for

    the biggest increase in performance compared with cost assignment o Combination of Cost Sensitive and SMOTE with all features ensures highest performance o Most important feature: closeness centrality ▪ For graphs containing disconnected nodes it is the most representative metric for connectivity and topological position o Addition of k-core numbers also 27% improvement Data Intelligence Conference, June 23-25 2017 32
  27. Future Work Data Intelligence Conference, June 23-25 2017 33 o

    Explore more groups that can prove to be potentially dangerous o Add more features to improve performance (text, multimedia, etc.) o In depth study of social dynamics to determine spreading mechanisms in OSNs real time (streaming)
  28. A Sneak Peek o User. Basic, i.e., avg. # posts,

    account age, # subscribed lists, Interarrival time o Text. Basic, i.e., # hashtags, # uppercases, # emoticons, # URLs, Sentiment, Hate and curse words Data Intelligence Conference, June 23-25 2017 34

    This work was funded by the European Unions Horizon 2020 research and innovation program under the Marie Skodowska-Curie grant agreement No 691025
  30. Some good reads! • Jacomy M, Venturini T, Heymann S,

    Bastian M (2014) ForceAtlas2: A Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software, PLOS ONE 9(6): e98679. • Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte and Etienne Lefebvre, Fast unfolding of communities in large networks, IOP Publishing Ltd, Journal of Statistical Mechanics: Theory and Experiment, Volume 2008, October 2008 • Heymann, S., Gephi, Encyclopedia of Social Network Analysis and Mining, (2014), Springer New York, p. 612--625 • Hu YF (2005) Efficient and high quality force-directed graph drawing. Math J, 10:(37–71)