Slide 1

Slide 1 text

GrapFrames GraphFrames speeds up big data prototype development by allowing us to perform mixed graph and non-graph data analysis in pySpark Lightning talk by: Elena Nemtseva at 32 PyData London Meetup https://graphframes.github.io/ People behind GraphFrames: Ankur Dave (UC Berkeley AMPLab) Alekh Jindal (Microsoft) Li Erran Li (Uber) Reynold Xin (Databricks) Joseph Gonzalez (UC Berkeley) Matei Zaharia (MIT and Databricks)

Slide 2

Slide 2 text

Graph Data Strctures SQL (some structure, requires joins): SQL Server, Data Frames NoSql (unstructured, semi-structured): MongoDB, Redis Graph (highly-connected, all required joins ‘built in’): Neo4j, networkX, GraphLab, GraphX, GraphFrames Graph components: ● Nodes/Vertices (e.g. people or entities) ● Edges/Links - links between nodes (e.g. ‘friends with’, ‘member of’) ● Node and edge attributes – additional information about nodes (e.g. age, url) ● Edge weights representing metrics (e.g. link strength) Edge Source: 01 Destination: 100 Attributes: {Type: ‘member of’, Weight: 0.7} Node ID: 100 Attributes: {name: PyData, type: meetup} Node ID: 01 Attributes: {name: Elena, type: person}

Slide 3

Slide 3 text

DS Prototype Problem Need to quickly see if prototype works (gives expected results? limitations? extra data needed?) Ideally want to perform combined Graph and non-Graph analysis in the same pySpark session: ● Data Prep / Feature Extraction ● ML on Graph ● Graph Algorithms ● Graph pattern queries Data Frame ● Spark SQL ● Spark ML ● pure Python UDFs ● nltk inside UDF to extract specific ‘interests’ Text data about various people with interest in some sports Graph Frame nodes edges , Nodes: People, interests Edges: People > People People > interests Graph Pattern Queries (person)>[in]>(group5) & (person)>[likes]>(interest) & (person)>[might like]>(interest) Graph ML: ALS ML: K-means Recommend new interests to people Cluster people into groups with similar interests ≈

Slide 4

Slide 4 text

Building a graph Create graph from ‘nodes’ and ‘edges’ Spark Data Frames The data frames were created in pySpark using ordinary SparkSQL with ‘edge weight’ metrics pre- calculated using Spark ML (Note: Spark’s UDFs can be wrapped around most python functions) g = GraphFrame(nodes, edges)

Slide 5

Slide 5 text

ML on Graph Product recommendations code sniplet from Ankur Dave’s paper:

Slide 6

Slide 6 text

Graph Queries Query Example: For each pair of similar people, list top keywords they have in common Kurt Thomas Aileen Morrison Zita Szabo Aisha Gerber Alberto Contador Giuseppe Guerini

Slide 7

Slide 7 text

Thank you