Slide 1

Slide 1 text

Graph Database Patterns Elizabeth Ramirez @eramirem

Slide 2

Slide 2 text

What is this talk about ●  Property graph definition ●  Graphs at scale with Titan, Cassandra and Elasticsearch ●  Gremlin Query Language ●  Python Patterns for Titan Models

Slide 3

Slide 3 text

What is not this talk about ●  Graph Theory ●  Titan Data Model ●  Existing libraries ●  Best practices

Slide 4

Slide 4 text

The Property Graph Model G = (V, E, λ) V: set of vertices E: set of vertices identifiers λ: set of properties

Slide 5

Slide 5 text

Why a Graph Database? ●  Semantic Web: Structured Knowledge Representation ●  Index-free adjacency: Like memory pointers, but in disk ●  Navigation between nodes in constant time. ●  Graph != No schema

Slide 6

Slide 6 text

●  Vendor agnostic ●  Blueprints: Collection of Java interfaces for representing Graphs ●  Pipes: Extension of Iterator, Iterable, chained together (Filter, Aggregation, SideEffect, etc.) ●  Groovy: Superset of Java, exposes full JDK to Gremlin Blueprints → Pipes → Gremlin TinkerPop Stack

Slide 7

Slide 7 text

Why Titan? ●  Multiple options for storage backend (Cassandra, HBase, BerkeleyDB) ●  Multiple options for index backend (Lucene, Elasticsearch) ●  Based on Blueprints API and Tinkerpop Stack ●  Locking control to ensure consistency ●  Edge compression, Vertex-Centric Indices ●  Expressive querying using Gremlin: outV  =  g.v(4);  inV  =  g.v(400);     g.addEdge(null,  outV,  inV,  'BT');  

Slide 8

Slide 8 text

Architecture (I)

Slide 9

Slide 9 text

Architecture (II) $  nodetool  status  titan   expr:  syntax  error   Datacenter:  us-­‐east   ===================   Status=Up/Down   |/  State=Normal/Leaving/Joining/Moving   -­‐-­‐    Address                  Load              Tokens    Owns  (effective)    Host  ID                                                              Rack   UN    10.x.x.x    133.09  MB      256          32.1%                          9xxxxxxx-­‐xxxx-­‐xxxx-­‐xxxx-­‐xxxxxxxxxxx    1e   UN    10.x.x.x.  145.92  MB      256          33.8%                          fxxxxxxx-­‐xxxx-­‐xxxx-­‐xxxx-­‐xxxxxxxxxxx    1d   UN    10.x.x.x    135.59  MB      256          34.1%                          bxxxxxxx-­‐xxxx-­‐xxxx-­‐xxxx-­‐xxxxxxxxxxx    1d    

Slide 10

Slide 10 text

Gremlin + Rexster: Remote Query Execution Gremlin: Graph Query Language Uses Groovy as host language     Rexster: Graph Server   RexsterClient  client  =  RexsterClientFactory.open("localhost",  "titan");   List>  results  =  client.execute("g.v(4).map");   ./gremlin.sh     gremlin>  g  =  TitanFactory.open('conf/titan-­‐cassandra.properties');   ==>titangraph[cassandrathrift:127.0.0.1]   gremlin>  g.v(27512).outE('link')   ==>e[1ffB3-­‐79K-­‐aG][27512-­‐link-­‐>1497496]   ==>e[1ffB5-­‐79K-­‐aG][27512-­‐link-­‐>1497500]  

Slide 11

Slide 11 text

Graph of Semantic Knowledge

Slide 12

Slide 12 text

Simple Traversals gremlin>  g.v(20000).has('namespace',  'concept')   ==>v[20000]   gremlin>  g.V('concept_name',  'California').has('concept_type',  'nytd_geo')   ==>v[23716]   gremlin>  g.v(27512).out('location')   ==>v[4]   gremlin>  g.v(27512).outE   ==>e[1ak0F-­‐79K-­‐bI][27512-­‐teragram-­‐>1796728]   ==>e[1ak0z-­‐79K-­‐bI][27512-­‐teragram-­‐>1796712]   ==>e[1ak0x-­‐79K-­‐bI][27512-­‐teragram-­‐>1804588]   ==>e[1ak0D-­‐79K-­‐bI][27512-­‐teragram-­‐>1796716]   ==>e[1ak0B-­‐79K-­‐bI][27512-­‐teragram-­‐>1796720]   ==>e[1ak0H-­‐79K-­‐bI][27512-­‐teragram-­‐>1796724]   ==>e[1c96H-­‐79K-­‐bY][27512-­‐mapping-­‐>1536760]   ==>e[1c96J-­‐79K-­‐bY][27512-­‐mapping-­‐>1655936]   ==>e[1c96F-­‐79K-­‐bY][27512-­‐mapping-­‐>1536756]   ==>e[1e0RP-­‐79K-­‐cm][27512-­‐location-­‐>4]   gremlin>    

Slide 13

Slide 13 text

More complex traversals (I)

Slide 14

Slide 14 text

More complex traversals (II) gremlin>  m=[];   gremlin>   g.v(12808).as('x').outE('taxonomy').has('taxonomy_relation',   'BT').inV().store(m).loop('x') {it.object.outE('taxonomy').has('taxonomy_relation',   'BT').inV().count()  !=  0}.iterate()   ==>null   gremlin>  m   ==>v[21812]   ==>v[16492]   ==>v[10176]   ==>v[19584]  

Slide 15

Slide 15 text

More complex traversals (III) gremlin>  g.V.has('geocode_waypoint',  WITHIN,  Geoshape.circle(40.714,   -­‐74.0059,  1.0))   ==>v[4]   ==>v[320]   ==>v[2756]   ==>v[3252]   ==>v[1348]   ==>v[1084]   ==>v[8140]   gremlin>  g.V.has('concept_name',  CONTAINS,   'Barack').has('concept_name',  CONTAINS,   'Obama').filter({it.concept_status=='Active'})   ==>v[59360]   ==>v[714092]   ==>v[1105536]   gremlin>    

Slide 16

Slide 16 text

●  transform ●  filter ●  sideEffect Pipes Traversal Pattern

Slide 17

Slide 17 text

       @classmethod          def  v(cls,  id):                  cls.pipe  =  "g.v({0})".format(id)                  return  cls            @classmethod          def  e(cls,  id):                  cls.pipe  =  "g.e({0})".format(repr(id))                  return  cls     Pipes and Filters in Python (I)

Slide 18

Slide 18 text

@classmethod   def  addV(cls,  **kwargs):                  cls.pipe  =  "v  =  g.addVertex()\n"                  cls.pipe  +=  "v.setProperty('namespace',  '{0}') \n".format(cls.namespace)                  for  p,  v  in  kwargs.items():                          cls.pipe  +=  "v.setProperty('{0}',  {1}) \n".format("_".join([cls.namespace,  p]),  repr(v))                  cls.pipe  +=  "return  v"                    return  cls   Pipes and Filters in Python (II)

Slide 19

Slide 19 text

     @classmethod        def  addE(cls,  outV,  inV,  **kwargs):                  cls.pipe  =  "outV  =  g.v({0});  inV  =  g.v({1});  ".format(outV,  inV)                  cls.pipe  +=  "e  =  g.addEdge(null,  outV,  inV,  '{0}');  ".format(cls.namespace)                  for  p,  v  in  kwargs.iteritems():                          cls.pipe  +=  'e.setProperty("{0}",  {1});  '.format("_".join([cls.namespace,   p]),  repr(v))                  cls.pipe  +=  'return  e'                    return  cls     Pipes and Filters in Python (III)

Slide 20

Slide 20 text

class  GraphFactory(type):          """  Metaclass  for  graph  elements:  vertices  and  edges  """            def  __new__(cls,  name,  bases,  dct):                  if  'namespace'  not  in  dct:                        dct['namespace']  =  camel_to_snake(name)                  return  super(GraphFactory,  cls).__new__(cls,  name,  bases,  dct)     Factory Pattern (I)

Slide 21

Slide 21 text

class  VertexElement(object):          @classmethod          def  v(cls,  id):                  cls.pipe  =  "g.v({0})".format(id)                  return  cls     class  EdgeElement(object):          @classmethod          def  e(cls,  id):                  cls.pipe  =  "g.e({0})".format(id)                  return  cls   Factory Pattern (II)

Slide 22

Slide 22 text

class  GraphFactory(type):          def  __call__(cls):                  results  =  execute_query(cls.pipe)                  if  isinstance(results,  list):                          return  map(deserialize,  results)                    else:                          return  deserialize(results)   Factory Pattern (III)

Slide 23

Slide 23 text

class  ExtractionRule(VertexElement):    __metaclass__  =  GraphFactory    @classmethod    def  get_by_id(self,  id):      return  self.v(id).has('namespace',  EQUAL,  self.namespace)()      @classmethod    def  get_for_variant(self,  variant,  **filters):      results  =  self.V().has('extraction_rule_trigger_term',  CONTAINS,  *utils.tokenize(variant))    results.inV('teragram').filter(type=concept_type).dedup().limit()()      return  results   Models

Slide 24

Slide 24 text

@classmethod   def  search(self,  id=None,  **filters):    order  =  ['trigger_term',  'condition',  'descriptor',  'condition_type']      if  id:      results  =  self.get_by_id(id)      return  results    else:      parsed  =  OrderedDict(sorted(parsed.items(),  key=lambda  t:  order.index(t[0]),  reverse=True))    index,  value  =  parsed.popitem()      results  =  self.V('_'.join([self.namespace,  index]),  value)    results.filter(**parsed).limit()    return  results()     Models (II)

Slide 25

Slide 25 text

Conclusions - Factories are the most universal design patterns. - Don't delegate the creation of types to your code. - For bulk imports, use a JVM language - Patterns that don’t do well: SELECT *

Slide 26

Slide 26 text

Thank You!