Finding Communities in Networks

Ed09e933a899fcae158439f11f66fed0?s=47 Emaad Manzoor
November 11, 2013

Finding Communities in Networks

90-minute presentation on community detection in networks to the CS340 (Computational Methods in Data Mining) class of Fall 2013, at KAUST.

Ed09e933a899fcae158439f11f66fed0?s=128

Emaad Manzoor

November 11, 2013
Tweet

Transcript

  1. Finding Communities in Networks AMCS/CS 340: Data Mining Emaad Ahmed

    Manzoor November 11, 2013
  2. Algorithms Girvan and Newman: Girvan, Michelle, and Mark EJ Newman.

    "Community structure in social and biological networks." Proceedings of the National Academy of Sciences 99.12 (2002): 7821-7826. Metis: Karypis, George, and Vipin Kumar. Multilevel graph partitioning schemes. ICPP (3). 1995. Metis+MQI: Lang, Kevin, and Satish Rao. A flow-based method for improving the expansion or conductance of graph cuts. Integer Programming and Combinatorial Optimization. Springer Berlin Heidelberg, 2004. 325-337. Surveys Leskovec, Jure, Kevin J. Lang, and Michael Mahoney. Empirical comparison of algorithms for network community detection. Proceedings of the 19th international conference on World wide web. ACM, 2010. Fortunato, Santo. Community detection in graphs. Physics Reports 486.3 (2010): 75-174.
  3. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Network Communities/Clusters Graph G(V,E) Community = The quality or number of links amongst members of V’ should be better or more than that between V’ and V - V’ V ' ⊂ V
  4. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Network Communities/Clusters Community detection requires that the graph be sparsely connected A densely connected graph cannot intuitively be split into communities
  5. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Applications Online Ad Exchanges Cluster advertiser-keyword graphs Recommend keywords for advertisers to bid on Deepayan Chakrabarti. "Clustering Applications at Yahoo!”. http://www.slideserve.com/Gabriel/clustering-applications-at-yahoo Retrieved November 3, 2013
  6. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Applications The Dolphin Network Doubtful Sound, NZ D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M. Dawson, Behavioral Ecology and Sociobiology 54, 396-405 (2003).
  7. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Finding Communities/Clusters Define an objective function for the network Quantifies the quality of a network partitioning Optimize the partitioning with the community detection algorithm Typically NP-Hard to optimize
  8. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Girvan & Newman, 2002 Divisive Hierarchical Clustering Detect edges that connect communities and remove them Based on edge betweenness Number of shortest paths through this edge Edge Betweenness Figure 10: Fortunato, Santo. Community detection in graphs. Physics Reports 486.3 (2010): 75-174.
  9. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Girvan & Newman, 2002 1.  Compute  betweenness  for  all  edges   2.  Remove  edge  with  the  largest  betweenness   3.  Recalculate  betweenness  for  all  edges   4.  Go  to  2.     Hierarchical Graph Figure 7: Fortunato, Santo. Community detection in graphs. Physics Reports 486.3 (2010): 75-174.
  10. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Girvan & Newman, 2002 Output Dendrogram Girvan, Michelle, and Mark EJ Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99.12 (2002): 7821-7826.
  11. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Girvan & Newman, 2002 Issues Edge betweenness is slow to compute ( using Newman’s algorithm) Cannot detect overlapping communities Hierarchy may not make sense O(V E )
  12. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Algorithms Graph Partitioning Minimize a function of the cut size of the partitioning Requires number of clusters a priori Requires cluster size a priori Kernighan-Lin, Spectral Partitioning, Multilevel algorithms
  13. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Measuring Partition Quality Intuitively Ratio of the number of edges leaving the cluster to the number of edges inside it Lower is better Figure 1: Leskovec, Jure, Kevin J. Lang, and Michael Mahoney. "Empirical comparison of algorithms for network community detection." Proceedings of the 19th international conference on World wide web. ACM, 2010. φ(A) = 2 6 φ(B) = 1 5
  14. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Measuring Partition Quality Intuitively But this would encourage large clusters that include most vertices in the graph We need to penalize over-large/small clusters by normalizing with the component size Figure 1: Leskovec, Jure, Kevin J. Lang, and Michael Mahoney. "Empirical comparison of algorithms for network community detection." Proceedings of the 19th international conference on World wide web. ACM, 2010.
  15. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Measuring Partition Quality Conductance of a graph cut How community-like is a cluster S? = cut size, number of edges leaving S Figure 1: Leskovec, Jure, Kevin J. Lang, and Michael Mahoney. "Empirical comparison of algorithms for network community detection." Proceedings of the 19th international conference on World wide web. ACM, 2010. C s Vol(S) = degree(u) u∈S ∑
  16. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Measuring Partition Quality Conductance of a graph cut Penalizes over-large clusters is small Penalizes extra-small clusters is small Figure 1: Leskovec, Jure, Kevin J. Lang, and Michael Mahoney. "Empirical comparison of algorithms for network community detection." Proceedings of the 19th international conference on World wide web. ACM, 2010. φ(S) = C s min(Vol(S),Vol(V − S)) Vol(S) Vol(V − S)
  17. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Measuring Partition Quality Conductance of a graph cut Note that Figure 1: Leskovec, Jure, Kevin J. Lang, and Michael Mahoney. "Empirical comparison of algorithms for network community detection." Proceedings of the 19th international conference on World wide web. ACM, 2010. φ(A) = 2 14 φ(B) = 1 11 φ(S) = φ(V − S)
  18. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Measuring Partition Quality Conductance is one of many tradeoff metrics Expansion There are also hard-balance constraints 50-50 vertex bipartition (Metis) φ(S) = C s min( S ), V − S )
  19. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Network Community Profile Characterizes the quality of network communities as a function of their size If = cluster quality (eg. conductance) is the quality of the best cluster having exactly vertices Figure 1: Leskovec, Jure, Kevin J. Lang, and Michael Mahoney. "Empirical comparison of algorithms for network community detection." Proceedings of the 19th international conference on World wide web. ACM, 2010. f (S) φ(k) = min S =k f (S) 1≤ k ≤ V 2 φ(k) k
  20. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Other Applications of Graph Partitioning Low-level Vision Segmentation Restoration Figure 5: Boykov, Yuri, and Vladimir Kolmogorov. "An experimental comparison of min-cut/ max-flow algorithms for energy minimization in vision." Pattern Analysis and Machine Intelligence, IEEE Transactions on 26.9 (2004): 1124-1137.
  21. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Other Applications of Graph Partitioning Distributed Systems Partition data while minimizing communication overhead Khayyat, Zuhair, et al. Mizan: a system for dynamic load balancing in large-scale graph processing. Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 2013.
  22. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    METIS An implementation of multiresolution partitioning Goal is to find a balanced bisection such that the cut size is minimized Very fast!
  23. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    METIS G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar. Multilevel Hypergraph Partitioning: Applications in the VLSI Domain. Presentation at the University of Minnesota.
  24. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    METIS Coarsening Phase. is transformed into a sequence of smaller graphs such that Partitioning Phase. A 2-way partition of the graph is computed that partitions into 2 parts, each containing half the vertices of Uncoarsening Phase. is projected back to by going through intermediate partitions G 0 G 1 ,G 2 ,...,G m V 0 > V 1 >... > V m P m V m G m G 0 G 0 P m−1 , P m−2 ,..., P 0 P m
  25. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    METIS Coarsening Condense multiple nodes in to form multinode v of Weight of v = sum of weights of vertices Edges of v are the union of edges of Multiedges are combined into one, with weight equal to the sum of the component edge weights V i v G i G i+1 V i v V i v
  26. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    METIS Coarsening Which vertices do we combine?
  27. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    METIS Matching A set of edges, no two of which are incident on the same vertex A maximal size matching contains all possible edges, no two of which are incident on the same vertex Matching. Retreived from http://www.cs.indiana.edu/ classes/b673/notes/ GraphPartitioning.pdf
  28. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    METIS Matching Randomized Matching Select vertices in random order If a vertex u has not been matched yet, randomly select one of its unmatched adjacent vertices If such a vertex v exists, add edge (u,v) to the matching and mark u and v as matched
  29. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    METIS Matching Heavy Edge Matching Select a matching that has the maximum sum of edge weights, to minimize the number of coarsening levels Heuristic algorithm (no guarantees, good in practice) Randomly select u as before, but select the unmatched adjacent vertex v such that (u,v) has maximum weight among all such v’s.
  30. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    METIS Matching Light Edge Matching Results in coarse graphs of higher average degree Such graphs are easier to partition with certain heuristics like Kernighan-Lin
  31. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    METIS Matching. Retreived from http://www.cs.indiana.edu/classes/ b673/notes/GraphPartitioning.pdf Coarsening Which vertices do we combine? Coarsen using maximal size matchings
  32. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    METIS Partitioning Phase Compute a minimum edge-cut bisection of the coarse graph, such that each part contains roughly half the weight of the original graph Use any high-quality partitioning algorithm on the coarse graph Since the size of this graph is small, it doesn’t take much time
  33. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    METIS Partitioning Phase Can also employ graph growing heuristics for partitioning Randomly select a vertex, grow a region around it using BFS until half the total vertex-weight has been included Randomly select a vertex, grow a region around it by selected vertices that lead to a smaller increase in the edge cut Use multiple trials with different initial vertices
  34. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    METIS Uncoarsening Phase Every multinode in contains a distinct subset of nodes from Obtain from by simply assigning the nodes collapsed to v to the partition Since has more degrees of freedom, we can refine the partitions G i G i+1 P i ∈ G i P i+1 P i+1 [v] G i P i
  35. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    METIS Uncoarsening Phase Refining Partitions Select 2 subsets of vertices, one from each part Swapping these vertices should result in a partition with smaller cut size
  36. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    METIS Uncoarsening Phase Refining Partitions: Based on Kernighan-Lin partitioning Computes a gain for every vertex, the decrease/increase in cut size if the vertex is moved to the other partition In each iteration, move out the vertex with largest gain from the larger part, and mark it as used Terminate when x number of vertex moves do not decrease the cut size
  37. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    METIS Uncoarsening Phase Refining Partitions: Based on Kernighan-Lin partitioning Kernighan-Lin is effective in finding locally optimal partitionings when it starts with a fairly good initial partition Terminates in a few iterations in practice
  38. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    MQI Max-flow Quotient cut Improvement Optimizes the expansion (a quotient cut metric) of the graph For a given cut (A, B), finds the best improvement among all cuts (A’, B’) such that φ(S) = C s min( S ), V − S ) A' ⊂ A
  39. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    MQI Given a cut (A,B), |A| = a |B| = b a <=b Cut size = c And quotient score q
  40. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    MQI Does any define a cut (A’, B’) with score q’ better than q? There are exponentially many subsets A’ to consider A' ⊂ A
  41. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    MQI There is an exact polynomial-time algorithm that solves this Uriel Feige and Robert Krauthgamer, A polylogarithmic approximation of the minimum bisection, FOCS-2000. Chris Harrelson, Kirsten Hildrum, and Satish Rao, A polynomial-time tree decomposition to minimize congestion, SPAA 2003.
  42. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    MQI A call to MQI returns an improved quotient cut, if it exists We can find the best reachable improved quotient cut by repeatedly feeding the output of MQI back to itself However Finding any cut whose small side (A’) contains the small side of the global best quotient cut is NP-Hard
  43. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Metis + MQI MQI always reduces the balance of a partition A maximally balanced partition will be a good initial cut Use Metis to provide this balanced partition
  44. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Metis + MQI MQI cannot reach all possible cuts from a given initial cut Use Metis to provide multiple different starting cuts
  45. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    MQI Convert this to an S-T max flow problem Solve to obtain the total max-flow, and max-flow in each edge in near-linear time with hi_pr Boris V. Cherkassky and Andrew V. Goldberg. On implementing the push- relabel method for the maximum flow problem. Algorithmica, 19(4):390–410, 1997. Use the S-T problem and solution to obtain the improved cut
  46. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    MQI Convert this to an S-T max flow problem 1.  Discard  all  B-­‐side  nodes     2.  Discard  every  edge  that  used  to  connect  a  pair  of  B-­‐side  nodes     3.  Replace  every  edge  that  used  to  connect  a  pair  of  A-­‐side  nodes,  with  a   pair  of  directed  edges  in  each  direction,  with  capacity  a     4.  Add  a  source  S  and  sink  T     5.  Discard  each  node  that  used  to  connect  a  B-­‐side  node  with  an  A-­‐side   node  x,  replacing  it  with  a  directed  edge  from  S  to  x,  with  capacity  a     6.  Add  a  single  directed  edge  from  every  A-­‐side  node  to  T,  with  capacity  c  
  47. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    MQI Convert this to an S-T max flow problem Figure 1: Lang, Kevin, and Satish Rao. A flow-based method for improving the expansion or conductance of graph cuts. Integer Programming and Combinatorial Optimization. Springer Berlin Heidelberg, 2004. 325-337.
  48. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    MQI Given Input graph Initial quotient cut (A,B) (with |A| <= |B|) The constructed max-flow solution Theorem There is an improved quotient cut (A’, B’), if and only if the maximum flow < ca A' ⊂ A
  49. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    MQI Proof (Forward) Assume the improved quotient cut (A’, B’) exists. We can show that maximum flow is < ca If |A’| = a’, |B’| = b’, c’ = improved cut size c’/a’ < c/a => c’a < a’c (1)
  50. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    MQI Proof (Forward) Net flow into A’ on non-sink edges There are c’ such edges, each with capacity a Net flow is at most c’a (2)
  51. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    MQI Proof (Forward) To saturate the sink edges There are a’ sink edges each with capacity c We need a flow of a’c (3)
  52. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    MQI Proof (Forward) c’a < a’c (1) Net flow into non-sink edges = c’a (2) Flow required to saturate sink edges = a’c (3) A sink edge is unsaturated Total flow in A < ca
  53. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    MQI Proof (Backward) Assume max-flow < ca, we can construct a new cut that has an improved quotient cut score Total capacity of all sink edges = ca Total flow < ca So at least one edge is unsaturated
  54. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    MQI Proof (Backward) Perform a backwards directed DFS from the sink, moving along an edge only if it is unsaturated The vertices reachable this way are A’ Let |A’| = a’, then a’ > 0
  55. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    MQI Proof (Backward) Since there is at least one unsaturated edge from A’ to the sink Total flow in A’ F s < a'c
  56. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    MQI Proof (Backward) Also, every node feeding into A’ from outside A must be saturated If not, we could reach that node via our backward DFS and add it to A’ If there are c’ such edges, flow into A’ F i = c'a
  57. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    MQI Proof (Backward) Since flow is conserved, all flow out of A’ must be carried by the sink edges F i = F s F i = c'a F s < a'c ⇒ c'a < a'c ⇒ c' a < c a
  58. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Metis + MQI The empirical complexity of multi-try Metis +MQI will be linear, because Metis ~ linear in practice Max-flow solver hi_pr ~ linear in practice MQI loop ~ sublogarithmic in practice
  59. Emaad Ahmed Manzoor November 11, 2013 Finding Communities in Networks

    Summary What is a community? Community detection algorithms Hierarchical clustering Girvan & Newman Graph partitioning Measuring partition quality Conductance, NCP Metis Metis + MQI