Community Identification in Networks

Community Identication in Networks Gianluca Campanella July 24, 2009

Contents 1 Introduction 2 Community identication methods Edge removal methods
Spectral optimization 3 The concept of modularity Classical denition New denition

Introduction Introduction and motivation Complex networks are now studied in
many elds of knowledge: Biology Computer and information sciences Ecology Epidemiology Management sciences Medicine Sociology

Introduction Graphs Adjacency matrix A = [aij], i, j =
1, . . . , n 1 2 3 4 5 a12 = 1 a13 = 0

1, . . . , n Vertex degrees ai = j aij 1 2 3 4 5 2 a2 = 4

1, . . . , n Vertex degrees ai = j aij Graph size a = i ai = 2m 1 2 3 4 5 1 2 3 4 5 a = 14

Community identication methods Edge removal methods Girvan-Newman algorithm M. Girvan
and M. E. J. Newman, Proc. Natl. Acad. Sci. 99 (2002) The shortest path between two vertices is a sequence of connected vertices that minimizes the number of its constituent edges. 1 2 3 4 5 6 7 2 7

and M. E. J. Newman, Proc. Natl. Acad. Sci. 99 (2002) The shortest path between two vertices is a sequence of connected vertices that minimizes the number of its constituent edges. For each edge, the edge betweenness is dened as the number of shortest paths passing through it. 1 2 3 4 5 6 7 9 24

and M. E. J. Newman, Proc. Natl. Acad. Sci. 99 (2002) The shortest path between two vertices is a sequence of connected vertices that minimizes the number of its constituent edges. For each edge, the edge betweenness is dened as the number of shortest paths passing through it. Algorithm for identifying the communities At each step, remove the edge with the highest betweenness. 1 2 3 4 5 6 7

Community identication methods Edge removal methods The karate club case
study I Girvan-Newman algorithm K = 2 communities

Community identication methods Spectral optimization Cut size minimization The cut
size R is the number of inter-group edges (in this case, the number of edges between the two groups of vertices). Example 1 2 3 4 5 6 7 R = 1

Community identication methods Spectral optimization Cut size minimization The cut
size R is the number of inter-group edges (in this case, the number of edges between the two groups of vertices). The index vector h has elements hi = 1 if vertex i belongs to the rst group −1 if vertex i belongs to the second group Example 1 2 3 4 5 6 7 R = 1 h = 1, 1, 1, 1, −1, −1, −1

Community identication methods Spectral optimization Spectral cut size minimization We
dene the Laplacian matrix L with elements lij = ai i = j −aij i = j The cut size R can then be written in terms of the index vector h and the Laplacian matrix L as R = 1 4 hTLh which is equivalent to R = 1 4 i α2 i λi where λi is the eigenvalue of L corresponding to the orthonormal eigenvector ui , and we assume 0 = λ1 ≤ λ2 ≤ . . . ≤ λn .

Community identication methods Spectral optimization Spectral cut size minimization We
dene the Laplacian matrix L with elements lij = ai i = j −aij i = j The cut size R can then be written in terms of the index vector h and the Laplacian matrix L as R = 1 4 hTLh which is equivalent to R = 1 4 i α2 i λi where λi is the eigenvalue of L corresponding to the orthonormal eigenvector ui , and we assume 0 = λ1 ≤ λ2 ≤ . . . ≤ λn . h = i αiui αi = uT i h constraint i α2 i = n

The concept of modularity Classical denition Modularity Denition (Modularity) The
modularity Q is dened as Q = 1 a i,j (aij − pij) δc (i, j) with pij = aiaj a where aij describes the actual number of edges between vertices vi and vj pij describes the expected number of edges between vertices vi and vj , as a function of their degrees ai and aj , respectively The matrix P = [pij] describes the so-called null model of the graph δc (i, j) = 1 if the two vertices vi and vj belong to the same community, and 0 otherwise

The concept of modularity Classical denition Community decomposition of the
modularity Given a partition {SI, I = 1, . . . , K} of the network into K communities, dene the symmetric matrix E of order K with elements as follows eIJ = 1 a i∈SI j∈SJ aij eIJ is half the fraction of all edges connecting vertices in community SI to vertices in community SJ . Theorem (Community decomposition of the modularity) The modularity of the network can be written as Q = I eII − e2 I eII corresponds to the fraction of all edges within community SI . eI = P J eIJ corresponds to the sum of eII and P J=I eIJ .

The concept of modularity Classical denition Agglomerative hierarchical clustering M.
E. J. Newman, Phys. Rev. E 69 (2004) Algorithm for identifying the communities At each step, compute the change in Q should any two communities be joined, denoted ∆QIJ . Then, join the pair producing the largest increase. Example (Dendrogram) Q

The concept of modularity Classical denition Agglomerative hierarchical clustering (improved
version) Aaron Clauset, M. E. J. Newman, and Cristopher Moore, Phys. Rev. E 70 (2004) Maintain and update a matrix ∆ with elements δIJ = ∆QIJ , indexing the largest element in each row and in the whole matrix for fast retrieval. Initialization rule for ∆ Initially, for all pairs of communities SI = SJ we have δIJ = 2aij a − 2aiaj a2 Updating rule for ∆ Upon joining communities SI and SJ into SI∪J , for all communities SK = SI, SJ we have δ(I∪J)K = δIK + δJK

The concept of modularity Classical denition The karate club case
study II Agglomerative hierarchical clustering All triangles but vertex 9 were previously squares K = 3 communities

The concept of modularity New denition A new denition of
modularity Self-loops, which we assume to be absent in the original graph, contribute negatively to the modularity Q, since aii = 0, aii a − aiai a2 δc (i, i) = − ai a 2 Denition (New null model matrix) We can apply our diagonal diusion operator to the matrix P, obtaining a new null model matrix R with null-diagonal elements and o-diagonal elements rij of the form rij = aiaj a + 1 n − 2 a2 i + a2 j a − 1 n − 1 i a2 i a , i = j The transformation from P to R preserves row (and column) sums.

The concept of modularity New denition Agglomerative hierarchical clustering New
modularity As for the classical modularity denition, we can decompose the new modularity at the single community level. Moreover, we can devise initialization and updating rules for the improved version of the agglomerative hierarchical clustering algorithm. Initialization rule for ∆ Initially, for all pairs of communities SI=i = SJ=j we have δIJ = 2aij a − 2aiaj a2 − 2 a2 (n − 1) (n − 2) (n − 1) a2 i + a2 j − i a2 i Updating rule for ∆ The same as before!

The concept of modularity New denition The karate club case
study III Agglomerative hierarchical clustering, new modularity K = 4 communities All hexagons were previously circles

End Conclusions Our rened modularity seems to be a better
heuristics for the agglomerative hierarchical clustering algorithm than the classical one.

End Conclusions Our rened modularity seems to be a better
heuristics for the agglomerative hierarchical clustering algorithm than the classical one. Directions for future research Other modularity optimization methods Modularity as an objective function Dynamic complex networks

End Thank you for your attention!

Additional material A case study Scientic collaboration network: computer science
in Italy Data has been collected from two publicly accessible databases, namely: 1 Cerca Università 2 The DBLP Computer Science Bibliography The graph obtained has: n = 1741 vertices m = 3878 edges 89 connected components, with a giant central component encompassing more than 80% of the whole graph 66 small components of order 2 and 3

Additional material A case study Communities of the giant connected
component Overview Nodes represent communities

Additional material A case study Communities of the giant connected
component One of the central communities in detail

Additional material A case study Universities and SSDs inside communities
0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities

0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities.

0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities

0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities 0 2 4 6 8 10 0 10 20 30 40 50 Number of dierent SSDs

0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities 0 2 4 6 8 10 0 10 20 30 40 50 Number of dierent SSDs There are 38 communities containing authors from one single SSD, and there are 39 communities containing authors from two dierent SSDs.

0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities 0 2 4 6 8 10 0 10 20 30 40 50 Number of dierent SSDs There are 38 communities containing authors from one single SSD, and there are 39 communities containing authors from two dierent SSDs. More than 60% of the communities contain authors belonging to at most two SSDs

Additional material Algorithmic complexity Algorithmic complexity Girvan-Newman algorithm The betweenness
can be computed in unweighted graphs in time O (mn) using the fast algorithm of Newman. Since this calculation has to be repeated once for the removal of each edge, the entire algorithm runs in worst-case time O m2n .

Additional material Algorithmic complexity Algorithmic complexity Agglomerative hierarchical clustering We
need only consider pairs of connected communities, of which there will be at any time at most m The change in Q upon joining two communities can be computed in constant time Following a join, we will need to update up to n of the matrix elements eIJ by adding together the rows and columns corresponding to the joined communities Thus, each step of the algorithm takes worst-case time O (m + n). There are a maximum of n − 1 join operations necessary to construct the complete dendrogram, and hence the entire algorithm runs in time O ((m + n) n), or O n2 on a sparse graph.

Additional material Algorithmic complexity Algorithmic complexity Agglomerative hierarchical clustering (improved
version) This version of the algorithm is extremely ecient, scaling as O (md log n), where d is the depth of the dendrogram. Furthermore, since many real-world networks are sparse and hierarchical, with m ∝ n and d ∝ log n, this reduces essentially to linear time, O n log2 n . This algorithm has been successfully applied in the study of a recommender network of books from a large on-line retailer, with more than 400000 vertices and 2 million edges.

Community Identification in Networks

Community Identification in Networks

More Decks by Gianluca Campanella

Other Decks in Research

Featured

Transcript