Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chemical Structure Handling in MongoDB

Matt Swain
November 26, 2014

Chemical Structure Handling in MongoDB

An investigation into using MongoDB for a chemical database, with a focus on chemical similarity searching.

Matt Swain

November 26, 2014
Tweet

More Decks by Matt Swain

Other Decks in Science

Transcript

  1. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Overview • What  is

     MongoDB?   • Chemical  similarity  searching   • Op:mising  screening  methods   • Benchmarking  query  performance   • Scaling  up  to  mul:ple  servers 2
  2. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Relational database 3 Molecule

    id smiles name  [] vendors  [{name,  vid}] Molecule id smiles Synonym mol_id name Vendor id name Mol-Vend mol_id vendor_id vid MongoDB
  3. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain 4 Document Example document

    = { _id: “CHEMBL25”, smiles: “CC(=O)Oc1ccccc1C(=O)O”, name: [“Aspirin”, “2-Acetoxybenzoic acid”, “Ecotrin"], rdmol: BinData(...), vendors: [ {name: “Sigma-Aldrich”, vid: “A2093_SIGMA"}, {name: “Acros Organics”, vid: “15818”} ] } db.mols.insert(document) db.mols.find({name: “Aspirin”}) db.mols.find({vendors.name: “Sigma-Aldrich”})
  4. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Terminology 5 http://docs.mongodb.org/manual/reference/sql-comparison/ Relational

    Database MongoDB Table Collec7on Row Document Column Field Join Embed  (or  Reference) Primary  Key _id Par77on Shard
  5. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Why MongoDB? • Easy

     to  get  started  with  small  projects   • Schema-­‐less  –  easy  prototyping,  heterogeneous  data   • Language  drivers  –  C,  C++,  Java,  Python,  R,  Ruby,  …   • High  performance  and  scalability  for  large  projects 6
  6. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Why not MongoDB? •

    No  mul:-­‐document  ACID  transac:ons   • No  joins   • Rela:vely  new,  lacking  a  proven  track  record   • PostgreSQL  fast  adding  JSON,  schema-­‐less  features 7
  7. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain The Database Landscape 8

    Performance   and   Scalability Features  and  Func:onality Document- oriented Relational Database Key-value store Redis MongoDB PostgreSQL Oracle CouchDB Memcached
  8. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain • Fingerprints:   •

    Tanimoto  coefficient:   • Typical  tasks: Similarity Search Basics 9 01001011… [2,  5,  7,  8…] Return all molecules where T > 0.8 with query molecule Return top 20 molecules ranked by T with query molecule
  9. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Screening: 1-bit Count Bounds

    • Use  easily  searchable  fingerprint  proper:es  to  filter   • Number  of  1-­‐bits  in  results  must  be  in  range  of  query 10 Swamidass,  S.  J.,  &  Baldi,  P.  J.  Chem.  Inf.  Model.  2007,  47,  302–317.  10.1021/ci600358f T  =  Tanimoto  threshold
 Na  =  Number  of  1-­‐bits  in  query  fingerprint
 Nb  =  Number  of  1-­‐bits  in  eligible  result  fingerprint
  10. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Screening: Required Bits •

    At  least  one  bit  must  be  in  common  between  the  query   fingerprint  and  a  result  fingerprint   • What  is  the  smallest  subset  of  the  query  fingerprint   where  this  is  s:ll  true?   • Even  if  the  remaining  TNa – 1  bits  are  all  in  common,  it   wouldn’t  be  enough 11 Davy  Suvee,  2011,  hUp:/ /datablend.be/?p=254
  11. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Fingerprints in MongoDB •

    RDKit  Morgan  Fingerprint  (circular,  ECFP-­‐like)   • Pre-­‐calculate  fingerprint  bits  and  count.  Add  indexes.   • Screening: 12 { _id: “CHEMBL25”, count: 24, bits: [11, 23, 33, 64, 80, 138, 175, 183, 193, 214, 239, 295, ...] } db.mols.find({count: {$gte: qmin, $lte: qmax}}) # Count bounds db.mols.find({bits: {$in: reqbits}) # Required bits
  12. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain • Screening  methods:  count

     bounds,  (rarest)  required  bits   • Fingerprint  radius:  2,  3,  4   • Fingerprint  folding:  none,  2048,  1024,  512 Optimising search performance 13 {“_id”:1301, “count”:1}, {“_id”:5338, “count”:3}, {“_id”:20821,“count”:8} reqbits = count_collection.find({'_id': {'$in': qfp}}) .sort('count', 1) .limit(ncommon)
  13. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Aggregation Framework • Perform

     a  pipeline  of  processing  tasks  on  the  server   • Compute  new  fields,  filter,  transform,  group,  documents 18 Filter using screening methods Calculate bit intersection with query fp Calculate exact tanimoto coefficient Filter based on tanimoto threshold
  14. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain 19 Aggregation Pipeline qn

    = len(qfp) # Num bits in query fingerprint
 qmin = int(ceil(qn * threshold)) # Min num bits in results fingerprints
 qmax = int(qn / threshold) # Max num bits in results fingerprints
 ncommon = qn - qmin + 1 # Num bits where >1 must be in common reqbits = count_collection.find({'_id': {'$in': qfp}}).sort('count', 1).limit(ncommon) aggregate = [
 {'$match': {'count': {'$gte': qmin, '$lte': qmax}, 'bits': {'$in': reqbits}}},
 {'$project': {
 'tanimoto': {'$let': {
 'vars': {'common': {'$size': {'$setIntersection': ['$bits', qfp]}}},
 'in': {'$divide': ['$$common', {'$subtract': [{'$add': [qn, '$count']}, '$$common']}]}
 }},
 }},
 {'$match': {'tanimoto': {'$gte': threshold}}}
 ] response = fp_collection.aggregate(aggregate)
  15. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Scaling Up • Hard

     limit  on  the  CPU,  RAM,  storage  of  a  single  server   • Awful  performance  if  working  set  doesn’t  fit  in  RAM   • Sharding:  Horizontal  scaling  to  mul:ple  servers   • Range-­‐based  sharding  vs  Hash-­‐based  sharding 23
  16. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Future work • Sharding

     –  can  we  handle  100  million  molecules?   • “kD  grid”  method  for  improved  screening   • Substructure  searches   • “Top  K  hits”  searches   • Count-­‐based  fingerprints   • Mul:-­‐molecule  queries,  compound  queries 24 Kristensen,  T.  G.  et  al.  Algorithms  Mol.  Biol.,  2010  5,  9.  10.1186/1748-­‐7188-­‐5-­‐9
  17. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Conclusions • Possible  to

     use  MongoDB  for  a  chemical  database   • Comparable  performance  to  PostgreSQL   • Use  sparse,  unfolded  fingerprints  for  best  performance   • Poten:al  for  horizontal  scaling  for  huge  databases 25
  18. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Source Code 26 github.com/mcs07/mongodb-chemistry

    $ mchem load mymols.sdf $ mchem addfp $ mchem countfp $ mchem similar “O=C(Oc1ccccc1C(=O)O)C” CHEMBL25 CHEMBL350343
  19. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Further Information • hUp:/

    /blog.maU-­‐swain.com/post/87093745652/   • Swamidass,  S.  J.,  &  Baldi,  P.  Bounds  and  Algorithms  for  Fast  Exact  Searches  of   Chemical  Fingerprints  in  Linear  and  Sublinear  Time.  J.  Chem.  Inf.  Model.  2007,  47,   302–317.  10.1021/ci600358f   • Rogers,  D.,  &  Hahn,  M.  Extended-­‐Connec7vity  Fingerprints.  J.  Chem.  Inf.  Model.2010,   50,  742–754.  10.1021/ci100050t   • Kristensen,  T.  G.  et  al.  A  tree-­‐based  method  for  the  rapid  screening  of  chemical   fingerprints.  Algorithms  Mol.  Biol.,  2010  5,  9.  10.1186/1748-­‐7188-­‐5-­‐9   • RDKit  –  hUp:/ /www.rdkit.org/docs/GeingStartedInPython.html   • MongoDB  –  hUp:/ /docs.mongodb.org/manual/ 28