Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GrapFrames for Faster Data Science Prototypes in pySpark

GrapFrames for Faster Data Science Prototypes in pySpark

GraphFrames speeds up big data prototype development by allowing us to perform mixed graph and non-graph data analysis in pySpark

Elena Nemtseva

March 07, 2017
Tweet

Other Decks in Technology

Transcript

  1. GrapFrames GraphFrames speeds up big data prototype development by allowing

    us to perform mixed graph and non-graph data analysis in pySpark Lightning talk by: Elena Nemtseva at 32 PyData London Meetup https://graphframes.github.io/ People behind GraphFrames: Ankur Dave (UC Berkeley AMPLab) Alekh Jindal (Microsoft) Li Erran Li (Uber) Reynold Xin (Databricks) Joseph Gonzalez (UC Berkeley) Matei Zaharia (MIT and Databricks)
  2. Graph Data Strctures SQL (some structure, requires joins): SQL Server,

    Data Frames NoSql (unstructured, semi-structured): MongoDB, Redis Graph (highly-connected, all required joins ‘built in’): Neo4j, networkX, GraphLab, GraphX, GraphFrames Graph components: • Nodes/Vertices (e.g. people or entities) • Edges/Links - links between nodes (e.g. ‘friends with’, ‘member of’) • Node and edge attributes – additional information about nodes (e.g. age, url) • Edge weights representing metrics (e.g. link strength) Edge Source: 01 Destination: 100 Attributes: {Type: ‘member of’, Weight: 0.7} Node ID: 100 Attributes: {name: PyData, type: meetup} Node ID: 01 Attributes: {name: Elena, type: person}
  3. DS Prototype Problem Need to quickly see if prototype works

    (gives expected results? limitations? extra data needed?) Ideally want to perform combined Graph and non-Graph analysis in the same pySpark session: • Data Prep / Feature Extraction • ML on Graph • Graph Algorithms • Graph pattern queries Data Frame • Spark SQL • Spark ML • pure Python UDFs • nltk inside UDF to extract specific ‘interests’ Text data about various people with interest in some sports Graph Frame nodes edges , Nodes: People, interests Edges: People > People People > interests Graph Pattern Queries (person)>[in]>(group5) & (person)>[likes]>(interest) & (person)>[might like]>(interest) Graph ML: ALS ML: K-means Recommend new interests to people Cluster people into groups with similar interests ≈
  4. Building a graph Create graph from ‘nodes’ and ‘edges’ Spark

    Data Frames The data frames were created in pySpark using ordinary SparkSQL with ‘edge weight’ metrics pre- calculated using Spark ML (Note: Spark’s UDFs can be wrapped around most python functions) g = GraphFrame(nodes, edges)
  5. Graph Queries Query Example: For each pair of similar people,

    list top keywords they have in common Kurt Thomas Aileen Morrison Zita Szabo Aisha Gerber Alberto Contador Giuseppe Guerini