Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Wikipedia graph network analysis

Wikipedia graph network analysis

Wikipedia is the largest, most carefully indexed collection of human knowledge ever amassed. More than information about a topic, Wikipedia is a web of naturally emerging relationships. By retrieving the hyperlinks inside each article, we algorithmically construct an undirected network of all 240.000 articles of the Greek edition of Wikipedia. We show that the resulting network graph of 1.405.709 nodes (articles) and 3.975.431 edges (links) follows a power law degree distribution. An empirical comparison of centrality measures (eigenvector, degree and Pagerank) unveils the top-k articles. We find semantically related subgraphs via community detection by evaluating the Louvain Method and Asynchronous Label Propagation. We discuss the leverage of graph structure of local Wikipedias for augmenting NLP tasks in underrepresented languages.

Dimitris Spathis

June 30, 2016
Tweet

More Decks by Dimitris Spathis

Other Decks in Research

Transcript

  1. Graph network analysis of Wikipedia: The structure behind millions nodes

    of knowledge DIMITRIS SPATHIS & ELIAS KOUSLIS Social Network Analysis MSc in Computer Science Aristotle University of Thessaloniki
  2. Motivation Why Wikipedia is the largest, most carefully indexed collection

    of human knowledge ever amassed. More than information about a topic, Wikipedia is a web of naturally emerging relationships. What By retrieving the hyperlinks inside each article, we algorithmically construct an undirected network of all 240.000 articles of the Greek edition of Wikipedia and a resulting network graph of 1.405.709 nodes (articles) and 3.975.431 edges (links).
  3. RQ1: How is a local version of Wikipedia compared structurally

    to the English one? RQ2: Does the degree distribution follow a known pattern? RQ3: Which community detection algorithm is more suitable for a Wikipedia graph? RQ4: Which are the top articles while estimated with different centrality measures? RQ5: How can we transfer the local Wikipedia versions’ graph structure to augment NLP tasks in more languages? RQ6: How can we visualize community graphs of millions of nodes? Research questions
  4. Workflow Data retrieval XML parsing The first step in data

    retrieval was to get the Wikipedia dump from the official website. After inspecting the offered data, we downloaded a 1.2GB snapshot of Greek Wikipedia as of June 1st 2016. Graph metrics Centralities and stats Average degree Average density Average diameter Power law degree estimation Eigenvector centrality Degree centrality Pagerank Communities Modularity Louvain Modularity Fast Unfolding Asynchronous label propagation
  5. Community detection algorithms Louvain This is a bottom-up algorithm: initially

    every vertex belongs to a separate community, and vertices are moved between communities iteratively in a way that maximizes the vertices' local contribution to the overall modularity score. When a consensus is reached (i. e. no single move would increase the modularity score), every community in the original graph is shrunk to a single vertex. Label Propagation The algorithm is probabilistic and the found communities may vary on different executions. After initializing each node with a unique label, the algorithm repeatedly sets the label of a node to be the label that appears most frequently among that node's neighbors. It is asynchronous because each node is updated without waiting for updates on the remaining nodes. 0.577 Modularity 0.101 Modularity