Community Identification in Networks

Community Identification in Networks

18cb10f135906d61e4feb96fa2018b4a?s=128

Gianluca Campanella

July 24, 2009
Tweet

Transcript

  1. Community Identication in Networks Gianluca Campanella July 24, 2009

  2. Contents 1 Introduction 2 Community identication methods Edge removal methods

    Spectral optimization 3 The concept of modularity Classical denition New denition
  3. Introduction Introduction and motivation Complex networks are now studied in

    many elds of knowledge: Biology Computer and information sciences Ecology Epidemiology Management sciences Medicine Sociology
  4. Introduction Graphs Adjacency matrix A = [aij], i, j =

    1, . . . , n 1 2 3 4 5 a12 = 1 a13 = 0
  5. Introduction Graphs Adjacency matrix A = [aij], i, j =

    1, . . . , n Vertex degrees ai = j aij 1 2 3 4 5 2 a2 = 4
  6. Introduction Graphs Adjacency matrix A = [aij], i, j =

    1, . . . , n Vertex degrees ai = j aij Graph size a = i ai = 2m 1 2 3 4 5 1 2 3 4 5 a = 14
  7. Community identication methods Edge removal methods Girvan-Newman algorithm M. Girvan

    and M. E. J. Newman, Proc. Natl. Acad. Sci. 99 (2002) The shortest path between two vertices is a sequence of connected vertices that minimizes the number of its constituent edges. 1 2 3 4 5 6 7 2 7
  8. Community identication methods Edge removal methods Girvan-Newman algorithm M. Girvan

    and M. E. J. Newman, Proc. Natl. Acad. Sci. 99 (2002) The shortest path between two vertices is a sequence of connected vertices that minimizes the number of its constituent edges. For each edge, the edge betweenness is dened as the number of shortest paths passing through it. 1 2 3 4 5 6 7 9 24
  9. Community identication methods Edge removal methods Girvan-Newman algorithm M. Girvan

    and M. E. J. Newman, Proc. Natl. Acad. Sci. 99 (2002) The shortest path between two vertices is a sequence of connected vertices that minimizes the number of its constituent edges. For each edge, the edge betweenness is dened as the number of shortest paths passing through it. Algorithm for identifying the communities At each step, remove the edge with the highest betweenness. 1 2 3 4 5 6 7
  10. Community identication methods Edge removal methods The karate club case

    study I Girvan-Newman algorithm K = 2 communities
  11. Community identication methods Spectral optimization Cut size minimization The cut

    size R is the number of inter-group edges (in this case, the number of edges between the two groups of vertices). Example 1 2 3 4 5 6 7 R = 1
  12. Community identication methods Spectral optimization Cut size minimization The cut

    size R is the number of inter-group edges (in this case, the number of edges between the two groups of vertices). The index vector h has elements hi = 1 if vertex i belongs to the rst group −1 if vertex i belongs to the second group Example 1 2 3 4 5 6 7 R = 1 h = 1, 1, 1, 1, −1, −1, −1
  13. Community identication methods Spectral optimization Spectral cut size minimization We

    dene the Laplacian matrix L with elements lij = ai i = j −aij i = j The cut size R can then be written in terms of the index vector h and the Laplacian matrix L as R = 1 4 hTLh which is equivalent to R = 1 4 i α2 i λi where λi is the eigenvalue of L corresponding to the orthonormal eigenvector ui , and we assume 0 = λ1 ≤ λ2 ≤ . . . ≤ λn .
  14. Community identication methods Spectral optimization Spectral cut size minimization We

    dene the Laplacian matrix L with elements lij = ai i = j −aij i = j The cut size R can then be written in terms of the index vector h and the Laplacian matrix L as R = 1 4 hTLh which is equivalent to R = 1 4 i α2 i λi where λi is the eigenvalue of L corresponding to the orthonormal eigenvector ui , and we assume 0 = λ1 ≤ λ2 ≤ . . . ≤ λn . h = i αiui αi = uT i h constraint i α2 i = n
  15. The concept of modularity Classical denition Modularity Denition (Modularity) The

    modularity Q is dened as Q = 1 a i,j (aij − pij) δc (i, j) with pij = aiaj a where aij describes the actual number of edges between vertices vi and vj pij describes the expected number of edges between vertices vi and vj , as a function of their degrees ai and aj , respectively The matrix P = [pij] describes the so-called null model of the graph δc (i, j) = 1 if the two vertices vi and vj belong to the same community, and 0 otherwise
  16. The concept of modularity Classical denition Community decomposition of the

    modularity Given a partition {SI, I = 1, . . . , K} of the network into K communities, dene the symmetric matrix E of order K with elements as follows eIJ = 1 a i∈SI j∈SJ aij eIJ is half the fraction of all edges connecting vertices in community SI to vertices in community SJ . Theorem (Community decomposition of the modularity) The modularity of the network can be written as Q = I eII − e2 I eII corresponds to the fraction of all edges within community SI . eI = P J eIJ corresponds to the sum of eII and P J=I eIJ .
  17. The concept of modularity Classical denition Agglomerative hierarchical clustering M.

    E. J. Newman, Phys. Rev. E 69 (2004) Algorithm for identifying the communities At each step, compute the change in Q should any two communities be joined, denoted ∆QIJ . Then, join the pair producing the largest increase. Example (Dendrogram) Q
  18. The concept of modularity Classical denition Agglomerative hierarchical clustering (improved

    version) Aaron Clauset, M. E. J. Newman, and Cristopher Moore, Phys. Rev. E 70 (2004) Maintain and update a matrix ∆ with elements δIJ = ∆QIJ , indexing the largest element in each row and in the whole matrix for fast retrieval. Initialization rule for ∆ Initially, for all pairs of communities SI = SJ we have δIJ = 2aij a − 2aiaj a2 Updating rule for ∆ Upon joining communities SI and SJ into SI∪J , for all communities SK = SI, SJ we have δ(I∪J)K = δIK + δJK
  19. The concept of modularity Classical denition The karate club case

    study II Agglomerative hierarchical clustering All triangles but vertex 9 were previously squares K = 3 communities
  20. The concept of modularity New denition A new denition of

    modularity Self-loops, which we assume to be absent in the original graph, contribute negatively to the modularity Q, since aii = 0, aii a − aiai a2 δc (i, i) = − ai a 2 Denition (New null model matrix) We can apply our diagonal diusion operator to the matrix P, obtaining a new null model matrix R with null-diagonal elements and o-diagonal elements rij of the form rij = aiaj a + 1 n − 2 a2 i + a2 j a − 1 n − 1 i a2 i a , i = j The transformation from P to R preserves row (and column) sums.
  21. The concept of modularity New denition Agglomerative hierarchical clustering New

    modularity As for the classical modularity denition, we can decompose the new modularity at the single community level. Moreover, we can devise initialization and updating rules for the improved version of the agglomerative hierarchical clustering algorithm. Initialization rule for ∆ Initially, for all pairs of communities SI=i = SJ=j we have δIJ = 2aij a − 2aiaj a2 − 2 a2 (n − 1) (n − 2) (n − 1) a2 i + a2 j − i a2 i Updating rule for ∆ The same as before!
  22. The concept of modularity New denition The karate club case

    study III Agglomerative hierarchical clustering, new modularity K = 4 communities All hexagons were previously circles
  23. End Conclusions Our rened modularity seems to be a better

    heuristics for the agglomerative hierarchical clustering algorithm than the classical one.
  24. End Conclusions Our rened modularity seems to be a better

    heuristics for the agglomerative hierarchical clustering algorithm than the classical one. Directions for future research Other modularity optimization methods Modularity as an objective function Dynamic complex networks
  25. End Thank you for your attention!

  26. Additional material A case study Scientic collaboration network: computer science

    in Italy Data has been collected from two publicly accessible databases, namely: 1 Cerca Università 2 The DBLP Computer Science Bibliography The graph obtained has: n = 1741 vertices m = 3878 edges 89 connected components, with a giant central component encompassing more than 80% of the whole graph 66 small components of order 2 and 3
  27. Additional material A case study Communities of the giant connected

    component Overview Nodes represent communities
  28. Additional material A case study Communities of the giant connected

    component One of the central communities in detail
  29. Additional material A case study Universities and SSDs inside communities

    0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities
  30. Additional material A case study Universities and SSDs inside communities

    0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities.
  31. Additional material A case study Universities and SSDs inside communities

    0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities
  32. Additional material A case study Universities and SSDs inside communities

    0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities 0 2 4 6 8 10 0 10 20 30 40 50 Number of dierent SSDs
  33. Additional material A case study Universities and SSDs inside communities

    0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities 0 2 4 6 8 10 0 10 20 30 40 50 Number of dierent SSDs There are 38 communities containing authors from one single SSD, and there are 39 communities containing authors from two dierent SSDs.
  34. Additional material A case study Universities and SSDs inside communities

    0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities 0 2 4 6 8 10 0 10 20 30 40 50 Number of dierent SSDs There are 38 communities containing authors from one single SSD, and there are 39 communities containing authors from two dierent SSDs. More than 60% of the communities contain authors belonging to at most two SSDs
  35. Additional material Algorithmic complexity Algorithmic complexity Girvan-Newman algorithm The betweenness

    can be computed in unweighted graphs in time O (mn) using the fast algorithm of Newman. Since this calculation has to be repeated once for the removal of each edge, the entire algorithm runs in worst-case time O m2n .
  36. Additional material Algorithmic complexity Algorithmic complexity Agglomerative hierarchical clustering We

    need only consider pairs of connected communities, of which there will be at any time at most m The change in Q upon joining two communities can be computed in constant time Following a join, we will need to update up to n of the matrix elements eIJ by adding together the rows and columns corresponding to the joined communities Thus, each step of the algorithm takes worst-case time O (m + n). There are a maximum of n − 1 join operations necessary to construct the complete dendrogram, and hence the entire algorithm runs in time O ((m + n) n), or O n2 on a sparse graph.
  37. Additional material Algorithmic complexity Algorithmic complexity Agglomerative hierarchical clustering (improved

    version) This version of the algorithm is extremely ecient, scaling as O (md log n), where d is the depth of the dendrogram. Furthermore, since many real-world networks are sparse and hierarchical, with m ∝ n and d ∝ log n, this reduces essentially to linear time, O n log2 n . This algorithm has been successfully applied in the study of a recommender network of books from a large on-line retailer, with more than 400000 vertices and 2 million edges.