Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Graph Database Patterns in Python

Graph Database Patterns in Python

Creating and using models from a graph database can be quite different to the ones used for row/column/document-oriented databases, in the sense that the same query patterns could differ significantly in structure and performance.

Elizabeth Ramirez, software engineer on the Search, Archive and Semantics Team, showed how to create models in Python for Titan property graphs, that allow you to manipulate graphs as if you were querying with Gremlin DSL.

(Presented at PyCon 2015. Watch video from the talk.)

More Decks by The New York Times Developers

Other Decks in Programming

Transcript

  1. Graph Database Patterns
    Elizabeth Ramirez
    @eramirem

    View full-size slide

  2. What is this talk about
    ●  Property graph definition
    ●  Graphs at scale with Titan, Cassandra and
    Elasticsearch
    ●  Gremlin Query Language
    ●  Python Patterns for Titan Models

    View full-size slide

  3. What is not this talk about
    ●  Graph Theory
    ●  Titan Data Model
    ●  Existing libraries
    ●  Best practices

    View full-size slide

  4. The Property Graph Model
    G = (V, E, λ)
    V: set of vertices
    E: set of vertices identifiers
    λ: set of properties

    View full-size slide

  5. Why a Graph Database?
    ●  Semantic Web: Structured
    Knowledge Representation
    ●  Index-free adjacency: Like
    memory pointers, but in disk
    ●  Navigation between nodes in
    constant time.
    ●  Graph != No schema

    View full-size slide

  6. ●  Vendor agnostic
    ●  Blueprints: Collection of Java interfaces for
    representing Graphs
    ●  Pipes: Extension of Iterator, Iterable, chained
    together (Filter, Aggregation, SideEffect, etc.)
    ●  Groovy: Superset of Java, exposes full JDK to
    Gremlin
    Blueprints → Pipes → Gremlin
    TinkerPop Stack

    View full-size slide

  7. Why Titan?
    ●  Multiple options for storage backend
    (Cassandra, HBase, BerkeleyDB)
    ●  Multiple options for index backend (Lucene,
    Elasticsearch)
    ●  Based on Blueprints API and Tinkerpop Stack
    ●  Locking control to ensure consistency
    ●  Edge compression, Vertex-Centric Indices
    ●  Expressive querying using Gremlin:
    outV  =  g.v(4);  inV  =  g.v(400);    
    g.addEdge(null,  outV,  inV,  'BT');  

    View full-size slide

  8. Architecture (I)

    View full-size slide

  9. Architecture (II)
    $  nodetool  status  titan  
    expr:  syntax  error  
    Datacenter:  us-­‐east  
    ===================  
    Status=Up/Down  
    |/  State=Normal/Leaving/Joining/Moving  
    -­‐-­‐    Address                  Load              Tokens    Owns  (effective)    Host  ID                                                              Rack  
    UN    10.x.x.x    133.09  MB      256          32.1%                          9xxxxxxx-­‐xxxx-­‐xxxx-­‐xxxx-­‐xxxxxxxxxxx    1e  
    UN    10.x.x.x.  145.92  MB      256          33.8%                          fxxxxxxx-­‐xxxx-­‐xxxx-­‐xxxx-­‐xxxxxxxxxxx    1d  
    UN    10.x.x.x    135.59  MB      256          34.1%                          bxxxxxxx-­‐xxxx-­‐xxxx-­‐xxxx-­‐xxxxxxxxxxx    1d  
     

    View full-size slide

  10. Gremlin + Rexster: Remote Query Execution
    Gremlin: Graph Query Language
    Uses Groovy as host language
     
     
    Rexster: Graph Server
     
    RexsterClient  client  =  RexsterClientFactory.open("localhost",  "titan");  
    List>  results  =  client.execute("g.v(4).map");  
    ./gremlin.sh  
     
    gremlin>  g  =  TitanFactory.open('conf/titan-­‐cassandra.properties');  
    ==>titangraph[cassandrathrift:127.0.0.1]  
    gremlin>  g.v(27512).outE('link')  
    ==>e[1ffB3-­‐79K-­‐aG][27512-­‐link-­‐>1497496]  
    ==>e[1ffB5-­‐79K-­‐aG][27512-­‐link-­‐>1497500]  

    View full-size slide

  11. Graph of Semantic Knowledge

    View full-size slide

  12. Simple Traversals
    gremlin>  g.v(20000).has('namespace',  'concept')  
    ==>v[20000]  
    gremlin>  g.V('concept_name',  'California').has('concept_type',  'nytd_geo')  
    ==>v[23716]  
    gremlin>  g.v(27512).out('location')  
    ==>v[4]  
    gremlin>  g.v(27512).outE  
    ==>e[1ak0F-­‐79K-­‐bI][27512-­‐teragram-­‐>1796728]  
    ==>e[1ak0z-­‐79K-­‐bI][27512-­‐teragram-­‐>1796712]  
    ==>e[1ak0x-­‐79K-­‐bI][27512-­‐teragram-­‐>1804588]  
    ==>e[1ak0D-­‐79K-­‐bI][27512-­‐teragram-­‐>1796716]  
    ==>e[1ak0B-­‐79K-­‐bI][27512-­‐teragram-­‐>1796720]  
    ==>e[1ak0H-­‐79K-­‐bI][27512-­‐teragram-­‐>1796724]  
    ==>e[1c96H-­‐79K-­‐bY][27512-­‐mapping-­‐>1536760]  
    ==>e[1c96J-­‐79K-­‐bY][27512-­‐mapping-­‐>1655936]  
    ==>e[1c96F-­‐79K-­‐bY][27512-­‐mapping-­‐>1536756]  
    ==>e[1e0RP-­‐79K-­‐cm][27512-­‐location-­‐>4]  
    gremlin>  
     

    View full-size slide

  13. More complex traversals (I)

    View full-size slide

  14. More complex traversals (II)
    gremlin>  m=[];  
    gremlin>  
    g.v(12808).as('x').outE('taxonomy').has('taxonomy_relation',  
    'BT').inV().store(m).loop('x')
    {it.object.outE('taxonomy').has('taxonomy_relation',  
    'BT').inV().count()  !=  0}.iterate()  
    ==>null  
    gremlin>  m  
    ==>v[21812]  
    ==>v[16492]  
    ==>v[10176]  
    ==>v[19584]  

    View full-size slide

  15. More complex traversals (III)
    gremlin>  g.V.has('geocode_waypoint',  WITHIN,  Geoshape.circle(40.714,  
    -­‐74.0059,  1.0))  
    ==>v[4]  
    ==>v[320]  
    ==>v[2756]  
    ==>v[3252]  
    ==>v[1348]  
    ==>v[1084]  
    ==>v[8140]  
    gremlin>  g.V.has('concept_name',  CONTAINS,  'Barack').has('concept_name',  
    CONTAINS,  'Obama').filter({it.concept_status=='Active'})  
    ==>v[59360]  
    ==>v[714092]  
    ==>v[1105536]  
    gremlin>  
     

    View full-size slide

  16. ●  transform
    ●  filter
    ●  sideEffect
    Pipes Traversal Pattern

    View full-size slide

  17.        @classmethod  
           def  v(cls,  id):  
                   cls.pipe  =  "g.v({0})".format(id)  
                   return  cls  
     
           @classmethod  
           def  e(cls,  id):  
                   cls.pipe  =  "g.e({0})".format(repr(id))  
                   return  cls  
     
    Pipes and Filters in Python (I)

    View full-size slide

  18. @classmethod  
    def  addV(cls,  **kwargs):  
       cls.pipe  =  "v  =  g.addVertex()\n”  
       cls.pipe  +=  "v.setProperty('namespace',  '{0}')\n".format(cls.namespace)  
       for  p,  v  in  kwargs.items():  
           cls.pipe  +=  "v.setProperty('{0}',  {1})\n".format("_".join([cls.namespace,  p]),  repr(v))  
       cls.pipe  +=  "return  v"    
       return  cls  
    Pipes and Filters in Python (II)

    View full-size slide

  19. @classmethod  
    def  addE(cls,  outV,  inV,  **kwargs):  
       cls.pipe  =  "outV  =  g.v({0});  inV  =  g.v({1});  ".format(outV,  inV)  
       cls.pipe  +=  "e  =  g.addEdge(null,  outV,  inV,  '{0}');  ".format(cls.namespace)  
       for  p,  v  in  kwargs.iteritems():  
           cls.pipe  +=  'e.setProperty("{0}",  {1});  '.format("_".join([cls.namespace,  p]),  repr(v))  
       cls.pipe  +=  'return  e'    
       return  cls  
     
    Pipes and Filters in Python (III)

    View full-size slide

  20. class  GraphFactory(type):  
           """  Metaclass  for  graph  elements:  vertices  and  edges  """  
     
           def  __new__(cls,  name,  bases,  dct):  
                   if  'namespace'  not  in  dct:  
                         dct['namespace']  =  camel_to_snake(name)  
                   return  super(GraphFactory,  cls).__new__(cls,  name,  bases,  dct)  
     
    Factory Pattern (I)

    View full-size slide

  21. class  VertexElement(object):  
           @classmethod  
           def  v(cls,  id):  
                   cls.pipe  =  "g.v({0})".format(id)  
                   return  cls  
     
    class  EdgeElement(object):  
           @classmethod  
           def  e(cls,  id):  
                   cls.pipe  =  "g.e({0})".format(id)  
                   return  cls  
    Factory Pattern (II)

    View full-size slide

  22. class  GraphFactory(type):  
           def  __call__(cls):  
                   results  =  execute_query(cls.pipe)  
                   if  isinstance(results,  list):  
                           return  map(deserialize,  results)    
                   else:  
                           return  deserialize(results)  
    Factory Pattern (III)

    View full-size slide

  23. class  ExtractionRule(VertexElement):  
       __metaclass__  =  GraphFactory  
       @classmethod  
       def  get_by_id(self,  id):  
           return  self.v(id).has('namespace',  EQUAL,  self.namespace)()  
     
       @classmethod  
       def  get_for_variant(self,  variant,  **filters):  
           results  =  self.V().has('extraction_rule_trigger_term',  CONTAINS,  *utils.tokenize(variant))  
           results.inV('teragram').filter(type=concept_type).dedup().limit()()  
           return  results  
    Models

    View full-size slide

  24. @classmethod  
    def  search(self,  id=None,  **filters):  
       order  =  ['trigger_term',  'condition',  'descriptor',  'condition_type']    
       if  id:  
           results  =  self.get_by_id(id)  
           return  results  
       else:  
           parsed  =  OrderedDict(sorted(parsed.items(),  key=lambda  t:  order.index(t[0]),  reverse=True))  
           index,  value  =  parsed.popitem()  
           results  =  self.V('_'.join([self.namespace,  index]),  value)  
       results.filter(**parsed).limit()  
       return  results()  
     
    Models (II)

    View full-size slide

  25. Conclusions
    - Factories are the most universal design patterns.
    - Don't delegate the creation of types to your code.
    - For bulk imports, use a JVM language
    - Patterns that don’t do well: SELECT *

    View full-size slide