# Community Identification in Networks

July 24, 2009

## Transcript

2. ### Contents 1 Introduction 2 Community identication methods Edge removal methods

Spectral optimization 3 The concept of modularity Classical denition New denition
3. ### Introduction Introduction and motivation Complex networks are now studied in

many elds of knowledge: Biology Computer and information sciences Ecology Epidemiology Management sciences Medicine Sociology
4. ### Introduction Graphs Adjacency matrix A = [aij], i, j =

1, . . . , n 1 2 3 4 5 a12 = 1 a13 = 0
5. ### Introduction Graphs Adjacency matrix A = [aij], i, j =

1, . . . , n Vertex degrees ai = j aij 1 2 3 4 5 2 a2 = 4
6. ### Introduction Graphs Adjacency matrix A = [aij], i, j =

1, . . . , n Vertex degrees ai = j aij Graph size a = i ai = 2m 1 2 3 4 5 1 2 3 4 5 a = 14
7. ### Community identication methods Edge removal methods Girvan-Newman algorithm M. Girvan

and M. E. J. Newman, Proc. Natl. Acad. Sci. 99 (2002) The shortest path between two vertices is a sequence of connected vertices that minimizes the number of its constituent edges. 1 2 3 4 5 6 7 2 7
8. ### Community identication methods Edge removal methods Girvan-Newman algorithm M. Girvan

and M. E. J. Newman, Proc. Natl. Acad. Sci. 99 (2002) The shortest path between two vertices is a sequence of connected vertices that minimizes the number of its constituent edges. For each edge, the edge betweenness is dened as the number of shortest paths passing through it. 1 2 3 4 5 6 7 9 24
9. ### Community identication methods Edge removal methods Girvan-Newman algorithm M. Girvan

and M. E. J. Newman, Proc. Natl. Acad. Sci. 99 (2002) The shortest path between two vertices is a sequence of connected vertices that minimizes the number of its constituent edges. For each edge, the edge betweenness is dened as the number of shortest paths passing through it. Algorithm for identifying the communities At each step, remove the edge with the highest betweenness. 1 2 3 4 5 6 7
10. ### Community identication methods Edge removal methods The karate club case

study I Girvan-Newman algorithm K = 2 communities
11. ### Community identication methods Spectral optimization Cut size minimization The cut

size R is the number of inter-group edges (in this case, the number of edges between the two groups of vertices). Example 1 2 3 4 5 6 7 R = 1
12. ### Community identication methods Spectral optimization Cut size minimization The cut

size R is the number of inter-group edges (in this case, the number of edges between the two groups of vertices). The index vector h has elements hi = 1 if vertex i belongs to the rst group −1 if vertex i belongs to the second group Example 1 2 3 4 5 6 7 R = 1 h = 1, 1, 1, 1, −1, −1, −1
13. ### Community identication methods Spectral optimization Spectral cut size minimization We

dene the Laplacian matrix L with elements lij = ai i = j −aij i = j The cut size R can then be written in terms of the index vector h and the Laplacian matrix L as R = 1 4 hTLh which is equivalent to R = 1 4 i α2 i λi where λi is the eigenvalue of L corresponding to the orthonormal eigenvector ui , and we assume 0 = λ1 ≤ λ2 ≤ . . . ≤ λn .
14. ### Community identication methods Spectral optimization Spectral cut size minimization We

dene the Laplacian matrix L with elements lij = ai i = j −aij i = j The cut size R can then be written in terms of the index vector h and the Laplacian matrix L as R = 1 4 hTLh which is equivalent to R = 1 4 i α2 i λi where λi is the eigenvalue of L corresponding to the orthonormal eigenvector ui , and we assume 0 = λ1 ≤ λ2 ≤ . . . ≤ λn . h = i αiui αi = uT i h constraint i α2 i = n
15. ### The concept of modularity Classical denition Modularity Denition (Modularity) The

modularity Q is dened as Q = 1 a i,j (aij − pij) δc (i, j) with pij = aiaj a where aij describes the actual number of edges between vertices vi and vj pij describes the expected number of edges between vertices vi and vj , as a function of their degrees ai and aj , respectively The matrix P = [pij] describes the so-called null model of the graph δc (i, j) = 1 if the two vertices vi and vj belong to the same community, and 0 otherwise
16. ### The concept of modularity Classical denition Community decomposition of the

modularity Given a partition {SI, I = 1, . . . , K} of the network into K communities, dene the symmetric matrix E of order K with elements as follows eIJ = 1 a i∈SI j∈SJ aij eIJ is half the fraction of all edges connecting vertices in community SI to vertices in community SJ . Theorem (Community decomposition of the modularity) The modularity of the network can be written as Q = I eII − e2 I eII corresponds to the fraction of all edges within community SI . eI = P J eIJ corresponds to the sum of eII and P J=I eIJ .
17. ### The concept of modularity Classical denition Agglomerative hierarchical clustering M.

E. J. Newman, Phys. Rev. E 69 (2004) Algorithm for identifying the communities At each step, compute the change in Q should any two communities be joined, denoted ∆QIJ . Then, join the pair producing the largest increase. Example (Dendrogram) Q
18. ### The concept of modularity Classical denition Agglomerative hierarchical clustering (improved

version) Aaron Clauset, M. E. J. Newman, and Cristopher Moore, Phys. Rev. E 70 (2004) Maintain and update a matrix ∆ with elements δIJ = ∆QIJ , indexing the largest element in each row and in the whole matrix for fast retrieval. Initialization rule for ∆ Initially, for all pairs of communities SI = SJ we have δIJ = 2aij a − 2aiaj a2 Updating rule for ∆ Upon joining communities SI and SJ into SI∪J , for all communities SK = SI, SJ we have δ(I∪J)K = δIK + δJK
19. ### The concept of modularity Classical denition The karate club case

study II Agglomerative hierarchical clustering All triangles but vertex 9 were previously squares K = 3 communities
20. ### The concept of modularity New denition A new denition of

modularity Self-loops, which we assume to be absent in the original graph, contribute negatively to the modularity Q, since aii = 0, aii a − aiai a2 δc (i, i) = − ai a 2 Denition (New null model matrix) We can apply our diagonal diusion operator to the matrix P, obtaining a new null model matrix R with null-diagonal elements and o-diagonal elements rij of the form rij = aiaj a + 1 n − 2 a2 i + a2 j a − 1 n − 1 i a2 i a , i = j The transformation from P to R preserves row (and column) sums.
21. ### The concept of modularity New denition Agglomerative hierarchical clustering New

modularity As for the classical modularity denition, we can decompose the new modularity at the single community level. Moreover, we can devise initialization and updating rules for the improved version of the agglomerative hierarchical clustering algorithm. Initialization rule for ∆ Initially, for all pairs of communities SI=i = SJ=j we have δIJ = 2aij a − 2aiaj a2 − 2 a2 (n − 1) (n − 2) (n − 1) a2 i + a2 j − i a2 i Updating rule for ∆ The same as before!
22. ### The concept of modularity New denition The karate club case

study III Agglomerative hierarchical clustering, new modularity K = 4 communities All hexagons were previously circles
23. ### End Conclusions Our rened modularity seems to be a better

heuristics for the agglomerative hierarchical clustering algorithm than the classical one.
24. ### End Conclusions Our rened modularity seems to be a better

heuristics for the agglomerative hierarchical clustering algorithm than the classical one. Directions for future research Other modularity optimization methods Modularity as an objective function Dynamic complex networks

26. ### Additional material A case study Scientic collaboration network: computer science

in Italy Data has been collected from two publicly accessible databases, namely: 1 Cerca Università 2 The DBLP Computer Science Bibliography The graph obtained has: n = 1741 vertices m = 3878 edges 89 connected components, with a giant central component encompassing more than 80% of the whole graph 66 small components of order 2 and 3
27. ### Additional material A case study Communities of the giant connected

component Overview Nodes represent communities
28. ### Additional material A case study Communities of the giant connected

component One of the central communities in detail
29. ### Additional material A case study Universities and SSDs inside communities

0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities
30. ### Additional material A case study Universities and SSDs inside communities

0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities.
31. ### Additional material A case study Universities and SSDs inside communities

0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities
32. ### Additional material A case study Universities and SSDs inside communities

0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities 0 2 4 6 8 10 0 10 20 30 40 50 Number of dierent SSDs
33. ### Additional material A case study Universities and SSDs inside communities

0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities 0 2 4 6 8 10 0 10 20 30 40 50 Number of dierent SSDs There are 38 communities containing authors from one single SSD, and there are 39 communities containing authors from two dierent SSDs.
34. ### Additional material A case study Universities and SSDs inside communities

0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities 0 2 4 6 8 10 0 10 20 30 40 50 Number of dierent SSDs There are 38 communities containing authors from one single SSD, and there are 39 communities containing authors from two dierent SSDs. More than 60% of the communities contain authors belonging to at most two SSDs
35. ### Additional material Algorithmic complexity Algorithmic complexity Girvan-Newman algorithm The betweenness

can be computed in unweighted graphs in time O (mn) using the fast algorithm of Newman. Since this calculation has to be repeated once for the removal of each edge, the entire algorithm runs in worst-case time O m2n .
36. ### Additional material Algorithmic complexity Algorithmic complexity Agglomerative hierarchical clustering We

need only consider pairs of connected communities, of which there will be at any time at most m The change in Q upon joining two communities can be computed in constant time Following a join, we will need to update up to n of the matrix elements eIJ by adding together the rows and columns corresponding to the joined communities Thus, each step of the algorithm takes worst-case time O (m + n). There are a maximum of n − 1 join operations necessary to construct the complete dendrogram, and hence the entire algorithm runs in time O ((m + n) n), or O n2 on a sparse graph.
37. ### Additional material Algorithmic complexity Algorithmic complexity Agglomerative hierarchical clustering (improved

version) This version of the algorithm is extremely ecient, scaling as O (md log n), where d is the depth of the dendrogram. Furthermore, since many real-world networks are sparse and hierarchical, with m ∝ n and d ∝ log n, this reduces essentially to linear time, O n log2 n . This algorithm has been successfully applied in the study of a recommender network of books from a large on-line retailer, with more than 400000 vertices and 2 million edges.