Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Network Graph Analysis Using Open Source

Network Graph Analysis Using Open Source

An overview of network graph analysis with a few examples built with Gephi and Sigma.js

Avatar for Ken Cherven

Ken Cherven

May 22, 2016
Tweet

Other Decks in Technology

Transcript

  1. Analyzing Complex Networks Using Open Source Software @ODSC OPEN DATA

    SCIENCE CONFERENCE Ken Cherven @kc2519 visual-baseball.com visualidity.com Boston | May 20-22nd 2016
  2. Network Graph Analysis – aka Social Network Analysis (SNA), is

    the study of connections (links) between actors (nodes) within a network node node node node node
  3. Network Graph Analysis has many use cases, ranging from the

    familiar SNA (Facebook, Twitter networks) to the more specialized visual and statistical investigation of political, criminal, or terrorist networks
  4. The use cases for Network Graph Analysis are almost endless

    – any dataset where relationships can be mapped can be analyzed both statistically and visually; all we need are nodes and links
  5. We have two primary approaches to assess patterns in a

    network: • Statistical measures are used to understand the underlying structure and relationships between nodes • Visual assessment allows us to leverage size, color, spacing, and structure to understand patterns at a network level
  6. Statistical measures are employed to understand structural patterns within the

    network: • Degrees (# of connections) • Centrality (influence) • Density (level of network connectedness) • Homophily (common groupings) • Diameter (max distance between nodes)
  7. Visual assessment allows us to use our visual sense to

    interpret network patterns: • Node location to represent related nodes • Node sizes to represent degrees • Node coloring to represent common groupings (clusters, categories) • Edge weights that show the strength of connections between nodes
  8. Some open source network graph tools: • Gephi (http://gephi.org) •

    Cytoscape (http://cytoscape.org) • GraphViz (http://graphviz.org) • Sigma.js (http://sigmajs.org) • NodeXL (http://nodexl.codeplex.com/) • Pajek (http://mrvar.fdv.uni-lj.si/pajek/) • Tulip (http://tulip.labri.fr/TulipDrupal/)
  9. We’ll use Gephi and Sigma.js for the following examples: •

    Miles Davis album network (tripartite network) • Boston Red Sox player network • GDELT event networks
  10. The desire behind the Miles Davis network is to understand

    the multiple phases within his long and varied career, and to see the shifting patterns in his musical partnerships and styles http://visual-baseball.com/gephi/jazz/miles_davis/#
  11. Five Album Clusters to Investigate 2 3 1 4 5

    What do these clusters represent?
  12. Five Album Clusters Revealed Early 60s Big Bands Mid- 60s

    small group 1950s small groups 1970s fusion, electric sounds Late career – 1980s, experimentation, eclectic instrumentation
  13. A quick exploration of the network reveals information about the

    elements of time, instrumentation, number of musicians, and types of instruments. With just a few minutes of traversing the network, we gain a greater understanding of Miles Davis’ musical career
  14. The goal for the Red Sox player network is to

    understand connections between players across eras, and to understand influence and groupings within the network, as defined by degrees and other centrality measures http://visual-baseball.com/gephi/teams/redsox_network/
  15. Red Sox Network Topology Player nodes are sized and colored

    based on number of years with team and cluster assignment Players are positioned based on common years with team Links are built using the number of seasons two players were on the team roster together
  16. A simple look at 3 prominent players showed us some

    quickly observable differences using centrality measures: • Despite playing several fewer seasons than either Williams or Yastrzemski, Varitek has the most connections; but Yastrzemski could get you to more players faster by being very central to the network structure
  17. GDELT data exposes an incredible number of opportunities for viewing

    network data based on published accounts of news events around the world. Our exploration focuses on US Government threats reported between March 1st and April 30, 2016
  18. GDELT Network Topology (Geo Layout) Using Geo Layout Connections are

    between Actor1 and Actor2 within a specific event instance; Actor1 is often the Protagonist, Actor2 the Target Nodes are positioned by lat/lon coordinates; most are concentrated in the Northeast US Node and edge colors are based on the GDELT GoldsteinScale variable; darker colors are indicative of higher destabilization potential
  19. GDELT Network Topology (Dual Circle) Using Dual Circle Layout Prominent

    nodes are positioned in the inner circle, based on the number of articles on cumulative events (speeches, press conference, negotiations, etc.) Secondary nodes are positioned around the outer circle; these may be either primary or secondary actors in an event Node colors are again based on the GDELT GoldsteinScale variable
  20. A few minutes of network exploration reveals topic patterns based

    on news reporting, and allows us to understand which actors are directing actions against others, and what is the tone of those actions. Tracking these measures over time will enable us to spot trends both positive and negative.
  21. Conclusions • Network graph analysis is a powerful tool for

    visually and statistically assessing complex networks • Network graphs are proliferating, due to the availability of multiple open source tools and increasing amounts of open data • Network graph analysis can be used to tell powerful stories wherever connected data is present
  22. Miles Davis network specs: • Data sourced from Wikipedia •

    Nodes and edges created in Excel • Graph created in Gephi using the Yifan Hu Proportional algorithm • Exported to Sigma.js (json format) • 348 nodes, 596 edges
  23. Red Sox Player Network specs: • Data sourced from Lahman

    Database at seanlahman.com • Nodes and edges created using SQL code in Toad for MySQL • Graphs created in Gephi using the ARF layout algorithm • JSON file exported to Sigma.js • 1668 nodes, 51,223 edges
  24. GDELT classifications: • Type refers to groupings such as Government,

    Media, Education, and many more • Event codes reference the type of event – riots, protests, sanctions, and so on • The GoldsteinScale runs from -10 to 10 in describing the relative destabilizing potential of the event
  25. GDELT Network specs: • Data sourced from the GDELT event

    database at gdeltproject.org (3/1 to 4/30/16) • Nodes and edges refined using SQL code in Toad for MySQL • Graphs created in Gephi using the Geo Layout and Dual Circle algorithms • GEXF files exported for use with Sigma.js • 414 nodes, 11,975 edges