Slide 1

Slide 1 text

Community Identication in Networks Gianluca Campanella July 24, 2009

Slide 2

Slide 2 text

Contents 1 Introduction 2 Community identication methods Edge removal methods Spectral optimization 3 The concept of modularity Classical denition New denition

Slide 3

Slide 3 text

Introduction Introduction and motivation Complex networks are now studied in many elds of knowledge: Biology Computer and information sciences Ecology Epidemiology Management sciences Medicine Sociology

Slide 4

Slide 4 text

Introduction Graphs Adjacency matrix A = [aij], i, j = 1, . . . , n 1 2 3 4 5 a12 = 1 a13 = 0

Slide 5

Slide 5 text

Introduction Graphs Adjacency matrix A = [aij], i, j = 1, . . . , n Vertex degrees ai = j aij 1 2 3 4 5 2 a2 = 4

Slide 6

Slide 6 text

Introduction Graphs Adjacency matrix A = [aij], i, j = 1, . . . , n Vertex degrees ai = j aij Graph size a = i ai = 2m 1 2 3 4 5 1 2 3 4 5 a = 14

Slide 7

Slide 7 text

Community identication methods Edge removal methods Girvan-Newman algorithm M. Girvan and M. E. J. Newman, Proc. Natl. Acad. Sci. 99 (2002) The shortest path between two vertices is a sequence of connected vertices that minimizes the number of its constituent edges. 1 2 3 4 5 6 7 2 7

Slide 8

Slide 8 text

Community identication methods Edge removal methods Girvan-Newman algorithm M. Girvan and M. E. J. Newman, Proc. Natl. Acad. Sci. 99 (2002) The shortest path between two vertices is a sequence of connected vertices that minimizes the number of its constituent edges. For each edge, the edge betweenness is dened as the number of shortest paths passing through it. 1 2 3 4 5 6 7 9 24

Slide 9

Slide 9 text

Community identication methods Edge removal methods Girvan-Newman algorithm M. Girvan and M. E. J. Newman, Proc. Natl. Acad. Sci. 99 (2002) The shortest path between two vertices is a sequence of connected vertices that minimizes the number of its constituent edges. For each edge, the edge betweenness is dened as the number of shortest paths passing through it. Algorithm for identifying the communities At each step, remove the edge with the highest betweenness. 1 2 3 4 5 6 7

Slide 10

Slide 10 text

Community identication methods Edge removal methods The karate club case study I Girvan-Newman algorithm K = 2 communities

Slide 11

Slide 11 text

Community identication methods Spectral optimization Cut size minimization The cut size R is the number of inter-group edges (in this case, the number of edges between the two groups of vertices). Example 1 2 3 4 5 6 7 R = 1

Slide 12

Slide 12 text

Community identication methods Spectral optimization Cut size minimization The cut size R is the number of inter-group edges (in this case, the number of edges between the two groups of vertices). The index vector h has elements hi = 1 if vertex i belongs to the rst group −1 if vertex i belongs to the second group Example 1 2 3 4 5 6 7 R = 1 h = 1, 1, 1, 1, −1, −1, −1

Slide 13

Slide 13 text

Community identication methods Spectral optimization Spectral cut size minimization We dene the Laplacian matrix L with elements lij = ai i = j −aij i = j The cut size R can then be written in terms of the index vector h and the Laplacian matrix L as R = 1 4 hTLh which is equivalent to R = 1 4 i α2 i λi where λi is the eigenvalue of L corresponding to the orthonormal eigenvector ui , and we assume 0 = λ1 ≤ λ2 ≤ . . . ≤ λn .

Slide 14

Slide 14 text

Community identication methods Spectral optimization Spectral cut size minimization We dene the Laplacian matrix L with elements lij = ai i = j −aij i = j The cut size R can then be written in terms of the index vector h and the Laplacian matrix L as R = 1 4 hTLh which is equivalent to R = 1 4 i α2 i λi where λi is the eigenvalue of L corresponding to the orthonormal eigenvector ui , and we assume 0 = λ1 ≤ λ2 ≤ . . . ≤ λn . h = i αiui αi = uT i h constraint i α2 i = n

Slide 15

Slide 15 text

The concept of modularity Classical denition Modularity Denition (Modularity) The modularity Q is dened as Q = 1 a i,j (aij − pij) δc (i, j) with pij = aiaj a where aij describes the actual number of edges between vertices vi and vj pij describes the expected number of edges between vertices vi and vj , as a function of their degrees ai and aj , respectively The matrix P = [pij] describes the so-called null model of the graph δc (i, j) = 1 if the two vertices vi and vj belong to the same community, and 0 otherwise

Slide 16

Slide 16 text

The concept of modularity Classical denition Community decomposition of the modularity Given a partition {SI, I = 1, . . . , K} of the network into K communities, dene the symmetric matrix E of order K with elements as follows eIJ = 1 a i∈SI j∈SJ aij eIJ is half the fraction of all edges connecting vertices in community SI to vertices in community SJ . Theorem (Community decomposition of the modularity) The modularity of the network can be written as Q = I eII − e2 I eII corresponds to the fraction of all edges within community SI . eI = P J eIJ corresponds to the sum of eII and P J=I eIJ .

Slide 17

Slide 17 text

The concept of modularity Classical denition Agglomerative hierarchical clustering M. E. J. Newman, Phys. Rev. E 69 (2004) Algorithm for identifying the communities At each step, compute the change in Q should any two communities be joined, denoted ∆QIJ . Then, join the pair producing the largest increase. Example (Dendrogram) Q

Slide 18

Slide 18 text

The concept of modularity Classical denition Agglomerative hierarchical clustering (improved version) Aaron Clauset, M. E. J. Newman, and Cristopher Moore, Phys. Rev. E 70 (2004) Maintain and update a matrix ∆ with elements δIJ = ∆QIJ , indexing the largest element in each row and in the whole matrix for fast retrieval. Initialization rule for ∆ Initially, for all pairs of communities SI = SJ we have δIJ = 2aij a − 2aiaj a2 Updating rule for ∆ Upon joining communities SI and SJ into SI∪J , for all communities SK = SI, SJ we have δ(I∪J)K = δIK + δJK

Slide 19

Slide 19 text

The concept of modularity Classical denition The karate club case study II Agglomerative hierarchical clustering All triangles but vertex 9 were previously squares K = 3 communities

Slide 20

Slide 20 text

The concept of modularity New denition A new denition of modularity Self-loops, which we assume to be absent in the original graph, contribute negatively to the modularity Q, since aii = 0, aii a − aiai a2 δc (i, i) = − ai a 2 Denition (New null model matrix) We can apply our diagonal diusion operator to the matrix P, obtaining a new null model matrix R with null-diagonal elements and o-diagonal elements rij of the form rij = aiaj a + 1 n − 2 a2 i + a2 j a − 1 n − 1 i a2 i a , i = j The transformation from P to R preserves row (and column) sums.

Slide 21

Slide 21 text

The concept of modularity New denition Agglomerative hierarchical clustering New modularity As for the classical modularity denition, we can decompose the new modularity at the single community level. Moreover, we can devise initialization and updating rules for the improved version of the agglomerative hierarchical clustering algorithm. Initialization rule for ∆ Initially, for all pairs of communities SI=i = SJ=j we have δIJ = 2aij a − 2aiaj a2 − 2 a2 (n − 1) (n − 2) (n − 1) a2 i + a2 j − i a2 i Updating rule for ∆ The same as before!

Slide 22

Slide 22 text

The concept of modularity New denition The karate club case study III Agglomerative hierarchical clustering, new modularity K = 4 communities All hexagons were previously circles

Slide 23

Slide 23 text

End Conclusions Our rened modularity seems to be a better heuristics for the agglomerative hierarchical clustering algorithm than the classical one.

Slide 24

Slide 24 text

End Conclusions Our rened modularity seems to be a better heuristics for the agglomerative hierarchical clustering algorithm than the classical one. Directions for future research Other modularity optimization methods Modularity as an objective function Dynamic complex networks

Slide 25

Slide 25 text

End Thank you for your attention!

Slide 26

Slide 26 text

Additional material A case study Scientic collaboration network: computer science in Italy Data has been collected from two publicly accessible databases, namely: 1 Cerca Università 2 The DBLP Computer Science Bibliography The graph obtained has: n = 1741 vertices m = 3878 edges 89 connected components, with a giant central component encompassing more than 80% of the whole graph 66 small components of order 2 and 3

Slide 27

Slide 27 text

Additional material A case study Communities of the giant connected component Overview Nodes represent communities

Slide 28

Slide 28 text

Additional material A case study Communities of the giant connected component One of the central communities in detail

Slide 29

Slide 29 text

Additional material A case study Universities and SSDs inside communities 0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities

Slide 30

Slide 30 text

Additional material A case study Universities and SSDs inside communities 0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities.

Slide 31

Slide 31 text

Additional material A case study Universities and SSDs inside communities 0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities

Slide 32

Slide 32 text

Additional material A case study Universities and SSDs inside communities 0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities 0 2 4 6 8 10 0 10 20 30 40 50 Number of dierent SSDs

Slide 33

Slide 33 text

Additional material A case study Universities and SSDs inside communities 0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities 0 2 4 6 8 10 0 10 20 30 40 50 Number of dierent SSDs There are 38 communities containing authors from one single SSD, and there are 39 communities containing authors from two dierent SSDs.

Slide 34

Slide 34 text

Additional material A case study Universities and SSDs inside communities 0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities 0 2 4 6 8 10 0 10 20 30 40 50 Number of dierent SSDs There are 38 communities containing authors from one single SSD, and there are 39 communities containing authors from two dierent SSDs. More than 60% of the communities contain authors belonging to at most two SSDs

Slide 35

Slide 35 text

Additional material Algorithmic complexity Algorithmic complexity Girvan-Newman algorithm The betweenness can be computed in unweighted graphs in time O (mn) using the fast algorithm of Newman. Since this calculation has to be repeated once for the removal of each edge, the entire algorithm runs in worst-case time O m2n .

Slide 36

Slide 36 text

Additional material Algorithmic complexity Algorithmic complexity Agglomerative hierarchical clustering We need only consider pairs of connected communities, of which there will be at any time at most m The change in Q upon joining two communities can be computed in constant time Following a join, we will need to update up to n of the matrix elements eIJ by adding together the rows and columns corresponding to the joined communities Thus, each step of the algorithm takes worst-case time O (m + n). There are a maximum of n − 1 join operations necessary to construct the complete dendrogram, and hence the entire algorithm runs in time O ((m + n) n), or O n2 on a sparse graph.

Slide 37

Slide 37 text

Additional material Algorithmic complexity Algorithmic complexity Agglomerative hierarchical clustering (improved version) This version of the algorithm is extremely ecient, scaling as O (md log n), where d is the depth of the dendrogram. Furthermore, since many real-world networks are sparse and hierarchical, with m ∝ n and d ∝ log n, this reduces essentially to linear time, O n log2 n . This algorithm has been successfully applied in the study of a recommender network of books from a large on-line retailer, with more than 400000 vertices and 2 million edges.