Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GrapFrames for Faster Data Science Prototypes in pySpark

GrapFrames for Faster Data Science Prototypes in pySpark

GraphFrames speeds up big data prototype development by allowing us to perform mixed graph and non-graph data analysis in pySpark

Elena Nemtseva

March 07, 2017
Tweet

Other Decks in Technology

Transcript

  1. GrapFrames
    GraphFrames speeds up big data prototype development by allowing us to
    perform mixed graph and non-graph data analysis in pySpark
    Lightning talk by: Elena Nemtseva at 32 PyData London Meetup
    https://graphframes.github.io/
    People behind GraphFrames:
    Ankur Dave (UC Berkeley AMPLab)
    Alekh Jindal (Microsoft)
    Li Erran Li (Uber)
    Reynold Xin (Databricks)
    Joseph Gonzalez (UC Berkeley)
    Matei Zaharia (MIT and Databricks)

    View full-size slide

  2. Graph Data Strctures
    SQL (some structure, requires joins): SQL Server, Data Frames
    NoSql (unstructured, semi-structured): MongoDB, Redis
    Graph (highly-connected, all required joins ‘built in’): Neo4j, networkX, GraphLab, GraphX, GraphFrames
    Graph components:
    ● Nodes/Vertices (e.g. people or entities)
    ● Edges/Links - links between nodes (e.g. ‘friends with’, ‘member of’)
    ● Node and edge attributes – additional information about nodes (e.g. age, url)
    ● Edge weights representing metrics (e.g. link strength)
    Edge
    Source: 01 Destination: 100
    Attributes: {Type: ‘member of’, Weight: 0.7}
    Node
    ID: 100
    Attributes: {name: PyData, type: meetup}
    Node
    ID: 01
    Attributes: {name: Elena, type: person}

    View full-size slide

  3. DS Prototype Problem
    Need to quickly see if prototype works (gives expected results? limitations? extra data needed?)
    Ideally want to perform combined Graph and non-Graph analysis in the same pySpark session:
    ● Data Prep / Feature Extraction
    ● ML on Graph
    ● Graph Algorithms
    ● Graph pattern queries
    Data Frame
    ● Spark SQL
    ● Spark ML
    ● pure Python UDFs
    ● nltk inside UDF
    to extract specific
    ‘interests’
    Text data about
    various people
    with interest in
    some sports
    Graph Frame
    nodes edges
    ,
    Nodes:
    People, interests
    Edges:
    People > People
    People > interests
    Graph Pattern Queries
    (person)>[in]>(group5) &
    (person)>[likes]>(interest) &
    (person)>[might like]>(interest)
    Graph ML: ALS
    ML: K-means
    Recommend new
    interests to people
    Cluster people into groups
    with similar interests

    View full-size slide

  4. Building a graph
    Create graph from ‘nodes’ and ‘edges’ Spark Data Frames
    The data frames were created in pySpark using ordinary SparkSQL with ‘edge weight’ metrics pre-
    calculated using Spark ML (Note: Spark’s UDFs can be wrapped around most python functions)
    g = GraphFrame(nodes, edges)

    View full-size slide

  5. ML on Graph
    Product recommendations code sniplet from Ankur Dave’s paper:

    View full-size slide

  6. Graph Queries
    Query Example: For each pair of similar people, list top keywords they have in common
    Kurt Thomas
    Aileen Morrison Zita Szabo
    Aisha Gerber
    Alberto Contador Giuseppe Guerini

    View full-size slide