GrapFrames for Faster Data Science Prototypes in pySpark

GrapFrames GraphFrames speeds up big data prototype development by allowing
us to perform mixed graph and non-graph data analysis in pySpark Lightning talk by: Elena Nemtseva at 32 PyData London Meetup https://graphframes.github.io/ People behind GraphFrames: Ankur Dave (UC Berkeley AMPLab) Alekh Jindal (Microsoft) Li Erran Li (Uber) Reynold Xin (Databricks) Joseph Gonzalez (UC Berkeley) Matei Zaharia (MIT and Databricks)

Graph Data Strctures SQL (some structure, requires joins): SQL Server,
Data Frames NoSql (unstructured, semi-structured): MongoDB, Redis Graph (highly-connected, all required joins ‘built in’): Neo4j, networkX, GraphLab, GraphX, GraphFrames Graph components: • Nodes/Vertices (e.g. people or entities) • Edges/Links - links between nodes (e.g. ‘friends with’, ‘member of’) • Node and edge attributes – additional information about nodes (e.g. age, url) • Edge weights representing metrics (e.g. link strength) Edge Source: 01 Destination: 100 Attributes: {Type: ‘member of’, Weight: 0.7} Node ID: 100 Attributes: {name: PyData, type: meetup} Node ID: 01 Attributes: {name: Elena, type: person}

DS Prototype Problem Need to quickly see if prototype works
(gives expected results? limitations? extra data needed?) Ideally want to perform combined Graph and non-Graph analysis in the same pySpark session: • Data Prep / Feature Extraction • ML on Graph • Graph Algorithms • Graph pattern queries Data Frame • Spark SQL • Spark ML • pure Python UDFs • nltk inside UDF to extract specific ‘interests’ Text data about various people with interest in some sports Graph Frame nodes edges , Nodes: People, interests Edges: People > People People > interests Graph Pattern Queries (person)>[in]>(group5) & (person)>[likes]>(interest) & (person)>[might like]>(interest) Graph ML: ALS ML: K-means Recommend new interests to people Cluster people into groups with similar interests ≈

Building a graph Create graph from ‘nodes’ and ‘edges’ Spark
Data Frames The data frames were created in pySpark using ordinary SparkSQL with ‘edge weight’ metrics pre- calculated using Spark ML (Note: Spark’s UDFs can be wrapped around most python functions) g = GraphFrame(nodes, edges)

ML on Graph Product recommendations code sniplet from Ankur Dave’s
paper:

Graph Queries Query Example: For each pair of similar people,
list top keywords they have in common Kurt Thomas Aileen Morrison Zita Szabo Aisha Gerber Alberto Contador Giuseppe Guerini

Thank you

GrapFrames for Faster Data Science Prototypes i...

GrapFrames for Faster Data Science Prototypes in pySpark

Elena Nemtseva

Other Decks in Technology

Featured

Transcript

GrapFrames GraphFrames speeds up big data prototype development by allowing

Graph Data Strctures SQL (some structure, requires joins): SQL Server,

DS Prototype Problem Need to quickly see if prototype works

Building a graph Create graph from ‘nodes’ and ‘edges’ Spark

ML on Graph Product recommendations code sniplet from Ankur Dave’s

Graph Queries Query Example: For each pair of similar people,

Thank you