Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Graph Database Patterns in Python

Graph Database Patterns in Python

Creating and using models from a graph database can be quite different to the ones used for row/column/document-oriented databases, in the sense that the same query patterns could differ significantly in structure and performance.

Elizabeth Ramirez, software engineer on the Search, Archive and Semantics Team, showed how to create models in Python for Titan property graphs, that allow you to manipulate graphs as if you were querying with Gremlin DSL.

(Presented at PyCon 2015. Watch video from the talk.)

More Decks by The New York Times Developers

Other Decks in Programming

Transcript

  1. What is this talk about •  Property graph definition • 

    Graphs at scale with Titan, Cassandra and Elasticsearch •  Gremlin Query Language •  Python Patterns for Titan Models
  2. What is not this talk about •  Graph Theory • 

    Titan Data Model •  Existing libraries •  Best practices
  3. The Property Graph Model G = (V, E, λ) V:

    set of vertices E: set of vertices identifiers λ: set of properties
  4. Why a Graph Database? •  Semantic Web: Structured Knowledge Representation

    •  Index-free adjacency: Like memory pointers, but in disk •  Navigation between nodes in constant time. •  Graph != No schema
  5. •  Vendor agnostic •  Blueprints: Collection of Java interfaces for

    representing Graphs •  Pipes: Extension of Iterator, Iterable, chained together (Filter, Aggregation, SideEffect, etc.) •  Groovy: Superset of Java, exposes full JDK to Gremlin Blueprints → Pipes → Gremlin TinkerPop Stack
  6. Why Titan? •  Multiple options for storage backend (Cassandra, HBase,

    BerkeleyDB) •  Multiple options for index backend (Lucene, Elasticsearch) •  Based on Blueprints API and Tinkerpop Stack •  Locking control to ensure consistency •  Edge compression, Vertex-Centric Indices •  Expressive querying using Gremlin: outV  =  g.v(4);  inV  =  g.v(400);     g.addEdge(null,  outV,  inV,  'BT');  
  7. Architecture (II) $  nodetool  status  titan   expr:  syntax  error

      Datacenter:  us-­‐east   ===================   Status=Up/Down   |/  State=Normal/Leaving/Joining/Moving   -­‐-­‐    Address                  Load              Tokens    Owns  (effective)    Host  ID                                                              Rack   UN    10.x.x.x    133.09  MB      256          32.1%                          9xxxxxxx-­‐xxxx-­‐xxxx-­‐xxxx-­‐xxxxxxxxxxx    1e   UN    10.x.x.x.  145.92  MB      256          33.8%                          fxxxxxxx-­‐xxxx-­‐xxxx-­‐xxxx-­‐xxxxxxxxxxx    1d   UN    10.x.x.x    135.59  MB      256          34.1%                          bxxxxxxx-­‐xxxx-­‐xxxx-­‐xxxx-­‐xxxxxxxxxxx    1d    
  8. Gremlin + Rexster: Remote Query Execution Gremlin: Graph Query Language

    Uses Groovy as host language     Rexster: Graph Server   RexsterClient  client  =  RexsterClientFactory.open("localhost",  "titan");   List<Map<String,  Object>>  results  =  client.execute("g.v(4).map");   ./gremlin.sh     gremlin>  g  =  TitanFactory.open('conf/titan-­‐cassandra.properties');   ==>titangraph[cassandrathrift:127.0.0.1]   gremlin>  g.v(27512).outE('link')   ==>e[1ffB3-­‐79K-­‐aG][27512-­‐link-­‐>1497496]   ==>e[1ffB5-­‐79K-­‐aG][27512-­‐link-­‐>1497500]  
  9. Simple Traversals gremlin>  g.v(20000).has('namespace',  'concept')   ==>v[20000]   gremlin>  g.V('concept_name',

     'California').has('concept_type',  'nytd_geo')   ==>v[23716]   gremlin>  g.v(27512).out('location')   ==>v[4]   gremlin>  g.v(27512).outE   ==>e[1ak0F-­‐79K-­‐bI][27512-­‐teragram-­‐>1796728]   ==>e[1ak0z-­‐79K-­‐bI][27512-­‐teragram-­‐>1796712]   ==>e[1ak0x-­‐79K-­‐bI][27512-­‐teragram-­‐>1804588]   ==>e[1ak0D-­‐79K-­‐bI][27512-­‐teragram-­‐>1796716]   ==>e[1ak0B-­‐79K-­‐bI][27512-­‐teragram-­‐>1796720]   ==>e[1ak0H-­‐79K-­‐bI][27512-­‐teragram-­‐>1796724]   ==>e[1c96H-­‐79K-­‐bY][27512-­‐mapping-­‐>1536760]   ==>e[1c96J-­‐79K-­‐bY][27512-­‐mapping-­‐>1655936]   ==>e[1c96F-­‐79K-­‐bY][27512-­‐mapping-­‐>1536756]   ==>e[1e0RP-­‐79K-­‐cm][27512-­‐location-­‐>4]   gremlin>    
  10. More complex traversals (II) gremlin>  m=[];   gremlin>   g.v(12808).as('x').outE('taxonomy').has('taxonomy_relation',

      'BT').inV().store(m).loop('x') {it.object.outE('taxonomy').has('taxonomy_relation',   'BT').inV().count()  !=  0}.iterate()   ==>null   gremlin>  m   ==>v[21812]   ==>v[16492]   ==>v[10176]   ==>v[19584]  
  11. More complex traversals (III) gremlin>  g.V.has('geocode_waypoint',  WITHIN,  Geoshape.circle(40.714,   -­‐74.0059,

     1.0))   ==>v[4]   ==>v[320]   ==>v[2756]   ==>v[3252]   ==>v[1348]   ==>v[1084]   ==>v[8140]   gremlin>  g.V.has('concept_name',  CONTAINS,  'Barack').has('concept_name',   CONTAINS,  'Obama').filter({it.concept_status=='Active'})   ==>v[59360]   ==>v[714092]   ==>v[1105536]   gremlin>    
  12.        @classmethod          def  v(cls,

     id):                  cls.pipe  =  "g.v({0})".format(id)                  return  cls            @classmethod          def  e(cls,  id):                  cls.pipe  =  "g.e({0})".format(repr(id))                  return  cls     Pipes and Filters in Python (I)
  13. @classmethod   def  addV(cls,  **kwargs):      cls.pipe  =  "v

     =  g.addVertex()\n”      cls.pipe  +=  "v.setProperty('namespace',  '{0}')\n".format(cls.namespace)      for  p,  v  in  kwargs.items():          cls.pipe  +=  "v.setProperty('{0}',  {1})\n".format("_".join([cls.namespace,  p]),  repr(v))      cls.pipe  +=  "return  v"        return  cls   Pipes and Filters in Python (II)
  14. @classmethod   def  addE(cls,  outV,  inV,  **kwargs):      cls.pipe

     =  "outV  =  g.v({0});  inV  =  g.v({1});  ".format(outV,  inV)      cls.pipe  +=  "e  =  g.addEdge(null,  outV,  inV,  '{0}');  ".format(cls.namespace)      for  p,  v  in  kwargs.iteritems():          cls.pipe  +=  'e.setProperty("{0}",  {1});  '.format("_".join([cls.namespace,  p]),  repr(v))      cls.pipe  +=  'return  e'        return  cls     Pipes and Filters in Python (III)
  15. class  GraphFactory(type):          """  Metaclass  for  graph

     elements:  vertices  and  edges  """            def  __new__(cls,  name,  bases,  dct):                  if  'namespace'  not  in  dct:                        dct['namespace']  =  camel_to_snake(name)                  return  super(GraphFactory,  cls).__new__(cls,  name,  bases,  dct)     Factory Pattern (I)
  16. class  VertexElement(object):          @classmethod      

       def  v(cls,  id):                  cls.pipe  =  "g.v({0})".format(id)                  return  cls     class  EdgeElement(object):          @classmethod          def  e(cls,  id):                  cls.pipe  =  "g.e({0})".format(id)                  return  cls   Factory Pattern (II)
  17. class  GraphFactory(type):          def  __call__(cls):    

                 results  =  execute_query(cls.pipe)                  if  isinstance(results,  list):                          return  map(deserialize,  results)                    else:                          return  deserialize(results)   Factory Pattern (III)
  18. class  ExtractionRule(VertexElement):      __metaclass__  =  GraphFactory      @classmethod

         def  get_by_id(self,  id):          return  self.v(id).has('namespace',  EQUAL,  self.namespace)()        @classmethod      def  get_for_variant(self,  variant,  **filters):          results  =  self.V().has('extraction_rule_trigger_term',  CONTAINS,  *utils.tokenize(variant))          results.inV('teragram').filter(type=concept_type).dedup().limit()()          return  results   Models
  19. @classmethod   def  search(self,  id=None,  **filters):      order  =

     ['trigger_term',  'condition',  'descriptor',  'condition_type']        if  id:          results  =  self.get_by_id(id)          return  results      else:          parsed  =  OrderedDict(sorted(parsed.items(),  key=lambda  t:  order.index(t[0]),  reverse=True))          index,  value  =  parsed.popitem()          results  =  self.V('_'.join([self.namespace,  index]),  value)      results.filter(**parsed).limit()      return  results()     Models (II)
  20. Conclusions - Factories are the most universal design patterns. -

    Don't delegate the creation of types to your code. - For bulk imports, use a JVM language - Patterns that don’t do well: SELECT *