Slide 1

Slide 1 text

Chemical Structure Handling in MongoDB Ma#  Swain   Cavendish  Laboratory,  University  of  Cambridge

Slide 2

Slide 2 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Overview • What  is  MongoDB?   • Chemical  similarity  searching   • Op:mising  screening  methods   • Benchmarking  query  performance   • Scaling  up  to  mul:ple  servers 2

Slide 3

Slide 3 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Relational database 3 Molecule id smiles name  [] vendors  [{name,  vid}] Molecule id smiles Synonym mol_id name Vendor id name Mol-Vend mol_id vendor_id vid MongoDB

Slide 4

Slide 4 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain 4 Document Example document = { _id: “CHEMBL25”, smiles: “CC(=O)Oc1ccccc1C(=O)O”, name: [“Aspirin”, “2-Acetoxybenzoic acid”, “Ecotrin"], rdmol: BinData(...), vendors: [ {name: “Sigma-Aldrich”, vid: “A2093_SIGMA"}, {name: “Acros Organics”, vid: “15818”} ] } db.mols.insert(document) db.mols.find({name: “Aspirin”}) db.mols.find({vendors.name: “Sigma-Aldrich”})

Slide 5

Slide 5 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Terminology 5 http://docs.mongodb.org/manual/reference/sql-comparison/ Relational Database MongoDB Table Collec7on Row Document Column Field Join Embed  (or  Reference) Primary  Key _id Par77on Shard

Slide 6

Slide 6 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Why MongoDB? • Easy  to  get  started  with  small  projects   • Schema-­‐less  –  easy  prototyping,  heterogeneous  data   • Language  drivers  –  C,  C++,  Java,  Python,  R,  Ruby,  …   • High  performance  and  scalability  for  large  projects 6

Slide 7

Slide 7 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Why not MongoDB? • No  mul:-­‐document  ACID  transac:ons   • No  joins   • Rela:vely  new,  lacking  a  proven  track  record   • PostgreSQL  fast  adding  JSON,  schema-­‐less  features 7

Slide 8

Slide 8 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain The Database Landscape 8 Performance   and   Scalability Features  and  Func:onality Document- oriented Relational Database Key-value store Redis MongoDB PostgreSQL Oracle CouchDB Memcached

Slide 9

Slide 9 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain • Fingerprints:   • Tanimoto  coefficient:   • Typical  tasks: Similarity Search Basics 9 01001011… [2,  5,  7,  8…] Return all molecules where T > 0.8 with query molecule Return top 20 molecules ranked by T with query molecule

Slide 10

Slide 10 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Screening: 1-bit Count Bounds • Use  easily  searchable  fingerprint  proper:es  to  filter   • Number  of  1-­‐bits  in  results  must  be  in  range  of  query 10 Swamidass,  S.  J.,  &  Baldi,  P.  J.  Chem.  Inf.  Model.  2007,  47,  302–317.  10.1021/ci600358f T  =  Tanimoto  threshold
 Na  =  Number  of  1-­‐bits  in  query  fingerprint
 Nb  =  Number  of  1-­‐bits  in  eligible  result  fingerprint

Slide 11

Slide 11 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Screening: Required Bits • At  least  one  bit  must  be  in  common  between  the  query   fingerprint  and  a  result  fingerprint   • What  is  the  smallest  subset  of  the  query  fingerprint   where  this  is  s:ll  true?   • Even  if  the  remaining  TNa – 1  bits  are  all  in  common,  it   wouldn’t  be  enough 11 Davy  Suvee,  2011,  hUp:/ /datablend.be/?p=254

Slide 12

Slide 12 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Fingerprints in MongoDB • RDKit  Morgan  Fingerprint  (circular,  ECFP-­‐like)   • Pre-­‐calculate  fingerprint  bits  and  count.  Add  indexes.   • Screening: 12 { _id: “CHEMBL25”, count: 24, bits: [11, 23, 33, 64, 80, 138, 175, 183, 193, 214, 239, 295, ...] } db.mols.find({count: {$gte: qmin, $lte: qmax}}) # Count bounds db.mols.find({bits: {$in: reqbits}) # Required bits

Slide 13

Slide 13 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain • Screening  methods:  count  bounds,  (rarest)  required  bits   • Fingerprint  radius:  2,  3,  4   • Fingerprint  folding:  none,  2048,  1024,  512 Optimising search performance 13 {“_id”:1301, “count”:1}, {“_id”:5338, “count”:3}, {“_id”:20821,“count”:8} reqbits = count_collection.find({'_id': {'$in': qfp}}) .sort('count', 1) .limit(ncommon)

Slide 14

Slide 14 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Screening Methods 14

Slide 15

Slide 15 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Fingerprint Folding 15 Unfolded  has  499,695  different  bits  in  ChEMBL

Slide 16

Slide 16 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Fingerprint Radius 16 Radius Unique bits 2 499,695 3 2,590,940 4 6,314,755

Slide 17

Slide 17 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Fingerprint Radius 17 Radius Unique bits 2 499,695 3 2,590,940 4 6,314,755

Slide 18

Slide 18 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Aggregation Framework • Perform  a  pipeline  of  processing  tasks  on  the  server   • Compute  new  fields,  filter,  transform,  group,  documents 18 Filter using screening methods Calculate bit intersection with query fp Calculate exact tanimoto coefficient Filter based on tanimoto threshold

Slide 19

Slide 19 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain 19 Aggregation Pipeline qn = len(qfp) # Num bits in query fingerprint
 qmin = int(ceil(qn * threshold)) # Min num bits in results fingerprints
 qmax = int(qn / threshold) # Max num bits in results fingerprints
 ncommon = qn - qmin + 1 # Num bits where >1 must be in common reqbits = count_collection.find({'_id': {'$in': qfp}}).sort('count', 1).limit(ncommon) aggregate = [
 {'$match': {'count': {'$gte': qmin, '$lte': qmax}, 'bits': {'$in': reqbits}}},
 {'$project': {
 'tanimoto': {'$let': {
 'vars': {'common': {'$size': {'$setIntersection': ['$bits', qfp]}}},
 'in': {'$divide': ['$$common', {'$subtract': [{'$add': [qn, '$count']}, '$$common']}]}
 }},
 }},
 {'$match': {'tanimoto': {'$gte': threshold}}}
 ] response = fp_collection.aggregate(aggregate)

Slide 20

Slide 20 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Benchmarks: Folding 20

Slide 21

Slide 21 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Benchmarks: Radius 21

Slide 22

Slide 22 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Benchmarks: PostgreSQL 22

Slide 23

Slide 23 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Scaling Up • Hard  limit  on  the  CPU,  RAM,  storage  of  a  single  server   • Awful  performance  if  working  set  doesn’t  fit  in  RAM   • Sharding:  Horizontal  scaling  to  mul:ple  servers   • Range-­‐based  sharding  vs  Hash-­‐based  sharding 23

Slide 24

Slide 24 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Future work • Sharding  –  can  we  handle  100  million  molecules?   • “kD  grid”  method  for  improved  screening   • Substructure  searches   • “Top  K  hits”  searches   • Count-­‐based  fingerprints   • Mul:-­‐molecule  queries,  compound  queries 24 Kristensen,  T.  G.  et  al.  Algorithms  Mol.  Biol.,  2010  5,  9.  10.1186/1748-­‐7188-­‐5-­‐9

Slide 25

Slide 25 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Conclusions • Possible  to  use  MongoDB  for  a  chemical  database   • Comparable  performance  to  PostgreSQL   • Use  sparse,  unfolded  fingerprints  for  best  performance   • Poten:al  for  horizontal  scaling  for  huge  databases 25

Slide 26

Slide 26 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Source Code 26 github.com/mcs07/mongodb-chemistry $ mchem load mymols.sdf $ mchem addfp $ mchem countfp $ mchem similar “O=C(Oc1ccccc1C(=O)O)C” CHEMBL25 CHEMBL350343

Slide 27

Slide 27 text

http://try.mongodb.org

Slide 28

Slide 28 text

Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Further Information • hUp:/ /blog.maU-­‐swain.com/post/87093745652/   • Swamidass,  S.  J.,  &  Baldi,  P.  Bounds  and  Algorithms  for  Fast  Exact  Searches  of   Chemical  Fingerprints  in  Linear  and  Sublinear  Time.  J.  Chem.  Inf.  Model.  2007,  47,   302–317.  10.1021/ci600358f   • Rogers,  D.,  &  Hahn,  M.  Extended-­‐Connec7vity  Fingerprints.  J.  Chem.  Inf.  Model.2010,   50,  742–754.  10.1021/ci100050t   • Kristensen,  T.  G.  et  al.  A  tree-­‐based  method  for  the  rapid  screening  of  chemical   fingerprints.  Algorithms  Mol.  Biol.,  2010  5,  9.  10.1186/1748-­‐7188-­‐5-­‐9   • RDKit  –  hUp:/ /www.rdkit.org/docs/GeingStartedInPython.html   • MongoDB  –  hUp:/ /docs.mongodb.org/manual/ 28