Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Graph Algorithms: Predict Real-World Behavior

Graph Algorithms: Predict Real-World Behavior

Learn how graph algorithms can help you predict real-world behavior and why an averages approach fails. Find out which algorithms to apply for various types of data analysis. From this session, you will gain the knowledge to recognize whether you have a graph analytics problem and how you can get started.

Jennifer Reif

March 14, 2019
Tweet

More Decks by Jennifer Reif

Other Decks in Technology

Transcript

  1. Jennifer Reif Software Engineer, Neo4j - Developer - Blogger -

    Conference speaker Love cats, coffee, and traveling :) [email protected] @jmhreif Graph Algorithms The Right Tool for Real-World Networks
  2. Do You Have a Graph Analytics Problem? Requires Understanding Relationships

    and Structures Flow & Dynamics Interactions & Resiliency Propagation Pathways Forecast Behavior & Prescribe Action
  3. Averages Aren’t Reality Nodes Relationships Average Distribution
 - Random -

    “There is No Network in Nature that we know of that would be described by the Random network model.” 
 –Albert-László Barabási
  4. Averages Aren’t Reality Nodes Relationships Average Distribution
 - Random -

    Nodes Relationships Power Law Distribution - Scale-Free -
  5. Most nodes have the same number of links No highly

    connected nodes - Scale-Free - - Small World - And You’ll Never See the Structures Nodes Relationships Average Distribution
 - Random -
  6. Graph Algorithms
 Extract Structure and Infer Behavior Source: “Communities, modules

    and large-scale structure in networks“ - Mark Newman Source: “Hierarchical structure and the prediction of missing links in networks”; ”Structure and inference in annotated networks” - A. Clauset, C. Moore, and M.E.J. Newman.
  7. Graph Algorithms Finds the optimal path or evaluates route availability

    and quality Pathfinding 
 & Search Determines the importance of distinct nodes in the network Centrality Evaluates how a group is clustered or partitioned Community Detection
  8. • Single-Source Shortest Path ◦ Calculates “shortest” path between a

    node and all other nodes Algorithms - Pathfinding & Search • All-Pairs Shortest Path ◦ Finds all shortest paths between all nodes
  9. • Single-Source Shortest Path ◦ Calculates path between a node

    and all other nodes Algorithms - Pathfinding & Search • All-Pairs Shortest Path ◦ Calculates shortest path group with all shortest paths between nodes • Minimum Weight Spanning Tree ◦ Calculates the path with the smallest value for visiting all nodes Least Cost Routing
  10. • Single-Source Shortest Path ◦ Calculates path between a node

    and all other nodes Algorithms - Pathfinding & Search • All-Pairs Shortest Path ◦ Calculates shortest path group with all shortest paths between nodes • Minimum Weight Spanning Tree ◦ Calculates the path with the smallest value for visiting all nodes
  11. • Parallel Breadth-First Search & Depth-First Search ◦ Traverses tree

    structure by exploring nearest neighbors (BFS) or down each branch (DFS) • Single-Source Shortest Path ◦ Calculates path between a node and all other nodes Algorithms - Pathfinding & Search • All-Pairs Shortest Path ◦ Calculates shortest path group with all shortest paths between nodes • Minimum Weight Spanning Tree ◦ Calculates the path with the smallest value for visiting all nodes
  12. Algorithms - Centralities • PageRank ◦ Which nodes have the

    most overall influence • Betweenness ◦ Which nodes are the bridges between different clusters (most shortest paths)
  13. Algorithms - Centralities • PageRank ◦ Which nodes have the

    most overall influence • Betweenness ◦ Which nodes are the bridges between different clusters (most shortest paths)
  14. Algorithms - Centralities • PageRank ◦ Which nodes have the

    most overall influence • Closeness ◦ Which nodes are able to reach entire group the fastest
  15. Algorithms - Centralities • PageRank ◦ Which nodes have the

    most overall influence • Closeness ◦ Which nodes are able to reach entire group the fastest • Betweenness ◦ Which nodes are the bridges between different clusters (most shortest paths) Source: Maven 7
  16. Algorithms - Centralities • PageRank ◦ Which nodes have the

    most overall influence • Closeness ◦ Which nodes are able to reach entire group the fastest • Betweenness ◦ Which nodes are the bridges between different clusters (most shortest paths) • Degree ◦ The number of connections in/out of a node
  17. Understanding Influence Source: “Robustness of the European power grids under

    intentional attack.” - R.V. Sole, M. Rosas-Casals, B. Corominas-Murtra, and S. Valverde. Source: “Network Science” - Barabasi Preventing 
 Cascading Failures with 
 4 Nodes Removed
  18. Algorithms – Community Detection • Label Propagation ◦ Spreads labels

    based on neighbors to infer clusters • Union Find / Weakly Connected Components ◦ Finds groups of nodes that all have a path to each other
 • Strongly Connected Components ◦ Finds groups of nodes that are all connected 
 to each other following the 
 direction of relationships
  19. Algorithms – Community Detection • Label Propagation ◦ Spreads labels

    based on neighbors to infer clusters • Union Find / Weakly Connected Components ◦ Finds groups of nodes that all have a path to each other
 • Strongly Connected Components ◦ Finds groups of nodes that are all connected 
 to each other following the 
 direction of relationships • Louvain Modularity ◦ Measures the presumed accuracy of community grouping Source: “Fast unfolding of communities in large networks” – Blondel, Guillaume, Lambiotte, Lefebvre
  20. Algorithms – Community Detection • Label Propagation ◦ Spreads labels

    based on neighbors to infer clusters • Union Find / Weakly Connected Components ◦ Finds groups of nodes that all have a path to each other
 • Strongly Connected Components ◦ Finds groups of nodes that are all connected 
 to each other following the 
 direction of relationships • Louvain Modularity ◦ Measures the presumed accuracy of community grouping • Triangle-Count & Clustering Coefficient ◦ Measures the degree that nodes tend to cluster together
  21. 18 Graph Algorithms Apply to All Real-World Networks Where You

    Need to Predict Complex Interactions Anti Money Laundering Recommendations Terrorist Networks Credit-Checks Fraud Prevention Cybersecurity Network Design PoS Profitability Alternate Routing Urban Resource Placement Theory Generation ML Feature Extraction Disease Spread Rippling Travel / Logistic Delays Drug Gene- Targeting
  22. Graphs are one of the Unifying Themes of computer science

    . . . 
 
 That so many different structures
 can be modeled using a single formalism
 is a Source of Great Power 
 to the educated programmer.” 
 
 - Steven S. Skiena “
  23. Many Moving Parts! Example Workflow Pipeline based on John Swain’s

    Twitter Analysis Twitter 
 Streaming API Python Tweet Collection 
 (includes user data) Rabbit MQ MongoDB Neo4j R Scripts
 -Graph Stats -Community Detection MySQL Graph .gra phml Tableau Graph Visualization Moved from Twitter Search API to Streaming API Replaced Python Twitter libraries (Tweepy) with raw API calls Streaming tweets in message queue Full tweets and user data stored in MongoDB Built graph for analysis in Neo4j from tweets persisted in MongoDB Analysis in R iGraph libraries for algorithms Some text analysis e.g. LDA topics Results published in MySQL for Tableau Graphml for import to Gephi with stats precalculated
  24. Our Goal is to Simplify Twitter 
 Streaming API Python

    Tweet Collection 
 (includes user data) Rabbit MQ MongoDB Neo4j R Scripts
 -Graph Stats -Community Detection MySQL Graph .gra phml Tableau Graph Visualization Example Workflow Pipeline based on John Swain’s Twitter Analysis
  25. Neo4j 
 Native Graph Database Analytics Integrations Cypher Query Language

    Wide Range of
 APOC Procedures Optimized 
 Graph Algorithms
  26. 1. Call as Cypher procedure 2. Pass in specification (Label,

    Prop, Query) and configuration 3. Execute and return results A. ~.stream variant returns (a lot) of results
 CALL algo.<name>.stream('Label','TYPE',{conf})
 YIELD nodeId, score B. non-stream variant writes results to graph returns statistics
 CALL algo.<name>('Label','TYPE',{conf}) How To…
  27. Pass in Cypher statement for node- and relationship-lists.
 
 CALL

    algo.<name>(
 'MATCH ... RETURN id(n)',
 'MATCH (n)-->(m) 
 RETURN id(n) as source, 
 id(m) as target', {graph:'cypher'}) Cypher Projection
  28. 28

  29. Game of Thrones • 800 nodes • 400 relationships •

    Sandbox: Yelp Business Graph • 5m nodes • 17m relationships • GitHub: https://github.com/neo4j- contrib/neo4j-data-science- yelp/blob/master/notebooks/ neo4j_yelp_00_data_load.ipynb Neo4j Community Graph • 280k nodes • 1.4m relationships • GitHub: https://github.com/community- graph/documentation Browser: http://138.197.15.1:7474 username: "all" pwd: “readonly” DBPedia • 11m nodes • 116m relationships • https://github.com/jexp/ graphipedia Datasets :play data_science