Slide 1

Slide 1 text

Graph network analysis of Wikipedia: The structure behind millions nodes of knowledge DIMITRIS SPATHIS & ELIAS KOUSLIS Social Network Analysis MSc in Computer Science Aristotle University of Thessaloniki

Slide 2

Slide 2 text

Motivation Why Wikipedia is the largest, most carefully indexed collection of human knowledge ever amassed. More than information about a topic, Wikipedia is a web of naturally emerging relationships. What By retrieving the hyperlinks inside each article, we algorithmically construct an undirected network of all 240.000 articles of the Greek edition of Wikipedia and a resulting network graph of 1.405.709 nodes (articles) and 3.975.431 edges (links).

Slide 3

Slide 3 text

RQ1: How is a local version of Wikipedia compared structurally to the English one? RQ2: Does the degree distribution follow a known pattern? RQ3: Which community detection algorithm is more suitable for a Wikipedia graph? RQ4: Which are the top articles while estimated with different centrality measures? RQ5: How can we transfer the local Wikipedia versions’ graph structure to augment NLP tasks in more languages? RQ6: How can we visualize community graphs of millions of nodes? Research questions

Slide 4

Slide 4 text

Workflow Data retrieval XML parsing The first step in data retrieval was to get the Wikipedia dump from the official website. After inspecting the offered data, we downloaded a 1.2GB snapshot of Greek Wikipedia as of June 1st 2016. Graph metrics Centralities and stats Average degree Average density Average diameter Power law degree estimation Eigenvector centrality Degree centrality Pagerank Communities Modularity Louvain Modularity Fast Unfolding Asynchronous label propagation

Slide 5

Slide 5 text

Tools used PYTHON NetworkX Igraph Powerlaw JAVA Gephi

Slide 6

Slide 6 text

XML Parsing

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Graph edgelist

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Metrics

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

Power law Degree frequency Degree rank

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

Power law estimation

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Top-k articles

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Modularity Community detection

Slide 23

Slide 23 text

Community detection algorithms Louvain This is a bottom-up algorithm: initially every vertex belongs to a separate community, and vertices are moved between communities iteratively in a way that maximizes the vertices' local contribution to the overall modularity score. When a consensus is reached (i. e. no single move would increase the modularity score), every community in the original graph is shrunk to a single vertex. Label Propagation The algorithm is probabilistic and the found communities may vary on different executions. After initializing each node with a unique label, the algorithm repeatedly sets the label of a node to be the label that appears most frequently among that node's neighbors. It is asynchronous because each node is updated without waiting for updates on the remaining nodes. 0.577 Modularity 0.101 Modularity

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

Visualization

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

Thank you