Graph algorithms for improving ML predictions

1 Graph Algorithms for Improving ML Predictions Data Science DC
Meetup 4/15

• Graph Analytics & Algorithms • Graph enhanced machine learning
• ML process eﬃciencies • Connected feature extraction • Connected feature selection • Link prediction 2 Next 30 Mins Amy E. Hodler Graph Analytics & AI Program Manager, Neo4j [email protected] @amyhodler #Neo4j

4 Free O’Reilly Book neo4j.com/ graph-algorithms-book • Spark & Neo4j
Examples • Machine Learning Chapter

Relationships Are Often the Strongest Predictors of Behavior “Increasingly we're
learning that you can make better predictions about people by getting all the information from their friends and their friends’ friends than you can from the information you have about the person themselves” — Dr. James Fowler

Average Distribution Most nodes have the same number of relationships
Number of Nodes Number of Relationships Structured Many statistic models erroneously focus on the average Underinvested area where most nodes really exist Power-Law Distribution Most nodes have very few relationships while some have many/more Random - No Structure “There is No Network in Nature that we know of that would be described by the Random network model.” –Albert-László Barabási

Graph Algorithms 7

Graph Algorithms Extract Structure and Infer Behavior 8 Source: “Communities,
modules and large-scale structure in networks“ - Mark Newman Source: “Hierarchical structure and the prediction of missing links in networks”; ”Structure and inference in annotated networks” - A. Clauset, C. Moore, and M.E.J. Newman.

Extract Predictive Elements using Relations in the Bigger Picture 9
Query (e.g. Cypher/Python) Real-time, local decisioning and pattern matching Graph Algorithms Libraries Global analysis and iterations You know what you’re looking for and making a decision You’re learning the overall structure of a network, updating data, and predicting Local Patterns Global Computation

Common Types of Graph Algorithms 10 Classic Graph Algorithms Areas
Other Common Categories Pathﬁnding & Search Centrality / Importance Community Detection+ Similarity Link Prediction ML Workflow Network Flow & Percolation Decomposition, Covering & Coloring Subgraph & Isomorphism Basic Stats Assortative Mixing So many others!

Graph & ML Algorithms in Neo4j +35 neo4j.com/ graph-algorithms- book/
Pathﬁnding & Search Centrality / Importance Community Detection Link Prediction Finds optimal paths or evaluates route availability and quality Determines the importance of distinct nodes in the network Detects group clustering or partition options Evaluates how alike nodes are Estimates the likelihood of nodes forming a future relationship Similarity

Graph Enhanced AI & ML

Decisions $ Better Decisions Graphs add highly predictive features to
models; adding accuracy and efficiencies without altering current workflows Machine Learning Pipeline Machine Learning Pipeline Traditional methods based on ”flat data” simplify, or leave out entirely, predictive relationship and network data

15 Graph accelerated ML uses context for eﬃciency

56% of enterprise CIOs say iterative model training is the
largest ML challenge Graph Accelerated ML

Graph ﬁltering is quite eﬃcient, especially compared to typical manual
sub-setting or statistical inference Graph Accelerated ML

Betweenness Centrality sums the % shortest paths that pass through
a node, calculated by pairs Graph Filtering - Example Algorithms Community Detection Filter Groups Centrality Filter Top Inﬂuencer Strongly Connected Components are all connected in direction of relationships CC/Union Find disregards direction Closeness Centrality - which nodes can reach all other nodes the fastest

Running Machine Learning within a Graph Graph Accelerated ML -
Research

Enhance Your Predictions 20 Connected Features add context to ML
for improved accuracy, precision, and recall

• Transaction Fraud • Anti-money laundering (AML) • Claims Fraud
• Credit Fraud • Compliance and investigation 21 Improve the Predictive Power of ML Example in Fighting Financial Crimes Machine Learning Pipeline Data Machine Learning can help uncover & learn common traits so we can build more predictive models Unfortunately many machine learning methods rely on ﬂat data structures and tables

Engineering connected features improves Machine Learning by calculating relationship metrics
when you know what’s predictive For example, adding how many fraudsters are in someone’s network is faster and simpler using connections Combat Financial Crimes using Connected Features ACCOUNT HOLDER ACCOUNT HOLDER ACCOUNT HOLDER ACCOUNT HOLDER ACCOUNT HOLDER BANK ACCOUNT SSN/ ID NUMBER UNSECURED LOAN BANK ACCOUNT BANK ACCOUNT UNSECURED LOAN PHONE NUMBER CREDIT CARD SSN/ ID NUMBER PHONE NUMBER ACCOUNT HOLDER ACCOUNT HOLDER ACCOUNT HOLDER ADDRESS PHONE NUMBER $ APPLICATION Typically a query but more advanced situations might use call for graph algorithms

Connected Feature Engineering - Maybe Algorithms Pathﬁnding & Search Find/score
items on a particular route

Connected feature extraction and selection using graph algorithms improves accuracy
and precision by uncovering more predictive elements to feed into ML models For example, ﬁnding anomalies of tight communities that might be money laundering networks or identifying which attributes are most predictive of fraud Combat Financial Crimes using Connected Features

25 Connected Feature Extraction - Example Algorithms Community Detection Scoring
Connectedness u Triangles = 2 CC= 0.33 Triangle Count number of triangles passing through a node Clustering Coeﬃcient probability that neighbors of a particular node are connected Can be normalized globally 1 2 2 5 3 2 1 6 1 5 4 Classiﬁcation Label Propagation Adopts labels based on neighbors to infer clusters Great choice for fast grouping at scale and data preprocessing Well suited where groupings are less clear and weights can be used

27 Connected Feature Selection - Example Algorithm 1 1 1
2 0.5 2.5 ADDRESSES PHONE S: 3 LOANS SSN/ IDs PHONES e.g. Graph centrality algorithms can identify influential features in our models so we can eliminate less important features and reduce overfitting Centrality Cut-out less predictive features PageRank - Measures the transitive (directional) influence of nodes and considers the influence of neighbors and their neighbors Personalized PR works well for contextual ranking

28 Connected Feature Selection - Possible Algorithm Overlap Similarity -
Ideal choice for ﬁnding hierarchy in data and developing super and sub-categories Overlap similarity coeﬃcient represents the co-occurrence of items between groups Similarity Feature Overlap? A B A B

Link Prediction 29 Can we infer which new interactions are
likely to occur in the future? “We formalize this question as the link prediction problem, and develop approaches to link prediction based on measures for analyzing the “proximity” of nodes in a network.” Jon Kleinberg and David Liben-Nowell A Goal, an Approach & an Algorithm Category

• future associations in a terrorist network • co-authorships in
a citation network • associations between molecules in a biology network • interest in an artist or artwork What can we use this approach for?

Predicting a link means that we are predicting some future
behaviour or an unobserved fact. For example, in a citation network, we’re actually predicting the action of two people collaborating on a paper. What's common across all these use cases?

Computes a score for a pair of nodes, that can
be considered a measure of proximity or “similarity” between those nodes based on the graph topology Graph Algorithms used with Link Prediction Link Prediction Other Algorithms Community Detection It’s common when our goal is link prediction to use a variety of algorithm types to extract features and use them together in a machine learning model Similarity

Common Neighbors Based on number of potential triangles / closing
triangles Concept is that if 2 strangers have a friend/colleague in common, they are more likely to be introduced http://be.amazd.com/link-prediction/

Adamic Adar (weighted common neighbors) Reﬁnes the simple counting of
common features by weighting rarer features more heavily. Formalizes the intuitive notion that rare features are more telling; if we both like GoT that’s less predictive than a preference for 16th century poetry http://be.amazd.com/link-prediction/

Preferential Attachments Multiplies the number of connections two nodes have
Reﬂects the tendency in real-world networks for highly connected nodes to become more connected. (Rich get richer, the popular get more friends. Hub-spoke structure) http://be.amazd.com/link-prediction/

• Use the measures directly ◦ Set a threshold value
used to predict a link between nodes • Use the measures as features to train a ML model ◦ e.g. A binary classiﬁer that predicts which nodes will be linked What to do with algorithm scores? node1 node2 commonNeighbors preferentialAttachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1 0

Results look too good to be true?! Consider time-based splits
Careful of data Leakage in graphy data, especially when we randomly split the dataset. This can easily happen when working with graphs because pairs of nodes in our training set may be connected to those in the test set. Train and Test Datasets

Sandbox: Data & examples neo4j.com/sandbox Neo4j Resources for Data Scientists
Community: Ask Anything Community.neo4j.com Neuler: Run Algorithms Code-Free neo4j.com/developer/ graph-algorithms/

39 Free O’Reilly Book neo4j.com/ graph-algorithms-book • Spark & Neo4j
Examples • Machine Learning Chapter [email protected] @amyhodler #Neo4j #GraphAnalytics

Graph and ML Algorithms in Neo4j • Parallel Breadth First
Search & DFS • Shortest Path • Single-Source Shortest Path • All Pairs Shortest Path • Minimum Spanning Tree • A* Shortest Path • Yen’s K Shortest Path • K-Spanning Tree (MST) • Random Walk • Degree Centrality • Closeness Centrality • CC Variations: Harmonic, Dangalchev, Wasserman & Faust • Betweenness Centrality • Approximate Betweenness Centrality • PageRank • Personalized PageRank • ArticleRank • Eigenvector Centrality • Triangle Count • Clustering Coefficients • Connected Components (Union Find) • Strongly Connected Components • Label Propagation • Louvain Modularity – 1 Step & Multi-Step • Balanced Triad (identification) • Euclidean Distance • Cosine Similarity • Jaccard Similarity • Overlap Similarity • Pearson Similarity Pathfinding & Search Centrality / Importance Community Detection Similarity Updated April 2019 Link Prediction • Adamic Adar • Common Neighbors • Preferential Attachment • Resource Allocations • Same Community • Total Neighbors

Graph algorithms for improving ML predictions

Graph algorithms for improving ML predictions

More Decks by Data Science DC

Other Decks in Technology

Featured

Transcript