Elizabeth Ramirez - Graph Database Patterns in Python

Elizabeth Ramirez - Graph Database Patterns in Python

Creating and using models from a graph database can be quite different to the ones used for row/column/document-oriented databases, in the sense that the same query patterns could differ significantly in structure and performance. This session will present how to create models in Python for Titan property graphs, that allow you to manipulate graphs as if you were querying with Gremlin DSL.



PyCon 2015

April 18, 2015


  1. Graph Database Patterns Elizabeth Ramirez @eramirem

  2. What is this talk about •  Property graph definition • 

    Graphs at scale with Titan, Cassandra and Elasticsearch •  Gremlin Query Language •  Python Patterns for Titan Models
  3. What is not this talk about •  Graph Theory • 

    Titan Data Model •  Existing libraries •  Best practices
  4. The Property Graph Model G = (V, E, λ) V:

    set of vertices E: set of vertices identifiers λ: set of properties
  5. Why a Graph Database? •  Semantic Web: Structured Knowledge Representation

    •  Index-free adjacency: Like memory pointers, but in disk •  Navigation between nodes in constant time. •  Graph != No schema
  6. •  Vendor agnostic •  Blueprints: Collection of Java interfaces for

    representing Graphs •  Pipes: Extension of Iterator, Iterable, chained together (Filter, Aggregation, SideEffect, etc.) •  Groovy: Superset of Java, exposes full JDK to Gremlin Blueprints → Pipes → Gremlin TinkerPop Stack
  7. Why Titan? •  Multiple options for storage backend (Cassandra, HBase,

    BerkeleyDB) •  Multiple options for index backend (Lucene, Elasticsearch) •  Based on Blueprints API and Tinkerpop Stack •  Locking control to ensure consistency •  Edge compression, Vertex-Centric Indices •  Expressive querying using Gremlin: outV  =  g.v(4);  inV  =  g.v(400);     g.addEdge(null,  outV,  inV,  'BT');  
  8. Architecture (I)

  9. Architecture (II) $  nodetool  status  titan   expr:  syntax  error

      Datacenter:  us-­‐east   ===================   Status=Up/Down   |/  State=Normal/Leaving/Joining/Moving   -­‐-­‐    Address                  Load              Tokens    Owns  (effective)    Host  ID                                                              Rack   UN    10.x.x.x    133.09  MB      256          32.1%                          9xxxxxxx-­‐xxxx-­‐xxxx-­‐xxxx-­‐xxxxxxxxxxx    1e   UN    10.x.x.x.  145.92  MB      256          33.8%                          fxxxxxxx-­‐xxxx-­‐xxxx-­‐xxxx-­‐xxxxxxxxxxx    1d   UN    10.x.x.x    135.59  MB      256          34.1%                          bxxxxxxx-­‐xxxx-­‐xxxx-­‐xxxx-­‐xxxxxxxxxxx    1d    
  10. Gremlin + Rexster: Remote Query Execution Gremlin: Graph Query Language

    Uses Groovy as host language     Rexster: Graph Server   RexsterClient  client  =  RexsterClientFactory.open("localhost",  "titan");   List<Map<String,  Object>>  results  =  client.execute("g.v(4).map");   ./gremlin.sh     gremlin>  g  =  TitanFactory.open('conf/titan-­‐cassandra.properties');   ==>titangraph[cassandrathrift:]   gremlin>  g.v(27512).outE('link')   ==>e[1ffB3-­‐79K-­‐aG][27512-­‐link-­‐>1497496]   ==>e[1ffB5-­‐79K-­‐aG][27512-­‐link-­‐>1497500]  
  11. Graph of Semantic Knowledge

  12. Simple Traversals gremlin>  g.v(20000).has('namespace',  'concept')   ==>v[20000]   gremlin>  g.V('concept_name',

     'California').has('concept_type',  'nytd_geo')   ==>v[23716]   gremlin>  g.v(27512).out('location')   ==>v[4]   gremlin>  g.v(27512).outE   ==>e[1ak0F-­‐79K-­‐bI][27512-­‐teragram-­‐>1796728]   ==>e[1ak0z-­‐79K-­‐bI][27512-­‐teragram-­‐>1796712]   ==>e[1ak0x-­‐79K-­‐bI][27512-­‐teragram-­‐>1804588]   ==>e[1ak0D-­‐79K-­‐bI][27512-­‐teragram-­‐>1796716]   ==>e[1ak0B-­‐79K-­‐bI][27512-­‐teragram-­‐>1796720]   ==>e[1ak0H-­‐79K-­‐bI][27512-­‐teragram-­‐>1796724]   ==>e[1c96H-­‐79K-­‐bY][27512-­‐mapping-­‐>1536760]   ==>e[1c96J-­‐79K-­‐bY][27512-­‐mapping-­‐>1655936]   ==>e[1c96F-­‐79K-­‐bY][27512-­‐mapping-­‐>1536756]   ==>e[1e0RP-­‐79K-­‐cm][27512-­‐location-­‐>4]   gremlin>    
  13. More complex traversals (I)

  14. More complex traversals (II) gremlin>  m=[];   gremlin>   g.v(12808).as('x').outE('taxonomy').has('taxonomy_relation',

      'BT').inV().store(m).loop('x') {it.object.outE('taxonomy').has('taxonomy_relation',   'BT').inV().count()  !=  0}.iterate()   ==>null   gremlin>  m   ==>v[21812]   ==>v[16492]   ==>v[10176]   ==>v[19584]  
  15. More complex traversals (III) gremlin>  g.V.has('geocode_waypoint',  WITHIN,  Geoshape.circle(40.714,   -­‐74.0059,

     1.0))   ==>v[4]   ==>v[320]   ==>v[2756]   ==>v[3252]   ==>v[1348]   ==>v[1084]   ==>v[8140]   gremlin>  g.V.has('concept_name',  CONTAINS,   'Barack').has('concept_name',  CONTAINS,   'Obama').filter({it.concept_status=='Active'})   ==>v[59360]   ==>v[714092]   ==>v[1105536]   gremlin>    
  16. •  transform •  filter •  sideEffect Pipes Traversal Pattern

  17.        @classmethod          def  v(cls,

     id):                  cls.pipe  =  "g.v({0})".format(id)                  return  cls            @classmethod          def  e(cls,  id):                  cls.pipe  =  "g.e({0})".format(repr(id))                  return  cls     Pipes and Filters in Python (I)
  18. @classmethod   def  addV(cls,  **kwargs):          

           cls.pipe  =  "v  =  g.addVertex()\n"                  cls.pipe  +=  "v.setProperty('namespace',  '{0}') \n".format(cls.namespace)                  for  p,  v  in  kwargs.items():                          cls.pipe  +=  "v.setProperty('{0}',  {1}) \n".format("_".join([cls.namespace,  p]),  repr(v))                  cls.pipe  +=  "return  v"                    return  cls   Pipes and Filters in Python (II)
  19.      @classmethod        def  addE(cls,  outV,  inV,

     **kwargs):                  cls.pipe  =  "outV  =  g.v({0});  inV  =  g.v({1});  ".format(outV,  inV)                  cls.pipe  +=  "e  =  g.addEdge(null,  outV,  inV,  '{0}');  ".format(cls.namespace)                  for  p,  v  in  kwargs.iteritems():                          cls.pipe  +=  'e.setProperty("{0}",  {1});  '.format("_".join([cls.namespace,   p]),  repr(v))                  cls.pipe  +=  'return  e'                    return  cls     Pipes and Filters in Python (III)
  20. class  GraphFactory(type):          """  Metaclass  for  graph

     elements:  vertices  and  edges  """            def  __new__(cls,  name,  bases,  dct):                  if  'namespace'  not  in  dct:                        dct['namespace']  =  camel_to_snake(name)                  return  super(GraphFactory,  cls).__new__(cls,  name,  bases,  dct)     Factory Pattern (I)
  21. class  VertexElement(object):          @classmethod      

       def  v(cls,  id):                  cls.pipe  =  "g.v({0})".format(id)                  return  cls     class  EdgeElement(object):          @classmethod          def  e(cls,  id):                  cls.pipe  =  "g.e({0})".format(id)                  return  cls   Factory Pattern (II)
  22. class  GraphFactory(type):          def  __call__(cls):    

                 results  =  execute_query(cls.pipe)                  if  isinstance(results,  list):                          return  map(deserialize,  results)                    else:                          return  deserialize(results)   Factory Pattern (III)
  23. class  ExtractionRule(VertexElement):    __metaclass__  =  GraphFactory    @classmethod    def

     get_by_id(self,  id):      return  self.v(id).has('namespace',  EQUAL,  self.namespace)()      @classmethod    def  get_for_variant(self,  variant,  **filters):      results  =  self.V().has('extraction_rule_trigger_term',  CONTAINS,  *utils.tokenize(variant))    results.inV('teragram').filter(type=concept_type).dedup().limit()()      return  results   Models
  24. @classmethod   def  search(self,  id=None,  **filters):    order  =  ['trigger_term',

     'condition',  'descriptor',  'condition_type']      if  id:      results  =  self.get_by_id(id)      return  results    else:      parsed  =  OrderedDict(sorted(parsed.items(),  key=lambda  t:  order.index(t[0]),  reverse=True))    index,  value  =  parsed.popitem()      results  =  self.V('_'.join([self.namespace,  index]),  value)    results.filter(**parsed).limit()    return  results()     Models (II)
  25. Conclusions - Factories are the most universal design patterns. -

    Don't delegate the creation of types to your code. - For bulk imports, use a JVM language - Patterns that don’t do well: SELECT *
  26. Thank You!