Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chemical Structure Handling in MongoDB

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for Matt Swain Matt Swain
November 26, 2014

Chemical Structure Handling in MongoDB

An investigation into using MongoDB for a chemical database, with a focus on chemical similarity searching.

Avatar for Matt Swain

Matt Swain

November 26, 2014
Tweet

More Decks by Matt Swain

Other Decks in Science

Transcript

  1. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Overview • What  is

     MongoDB?   • Chemical  similarity  searching   • Op:mising  screening  methods   • Benchmarking  query  performance   • Scaling  up  to  mul:ple  servers 2
  2. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Relational database 3 Molecule

    id smiles name  [] vendors  [{name,  vid}] Molecule id smiles Synonym mol_id name Vendor id name Mol-Vend mol_id vendor_id vid MongoDB
  3. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain 4 Document Example document

    = { _id: “CHEMBL25”, smiles: “CC(=O)Oc1ccccc1C(=O)O”, name: [“Aspirin”, “2-Acetoxybenzoic acid”, “Ecotrin"], rdmol: BinData(...), vendors: [ {name: “Sigma-Aldrich”, vid: “A2093_SIGMA"}, {name: “Acros Organics”, vid: “15818”} ] } db.mols.insert(document) db.mols.find({name: “Aspirin”}) db.mols.find({vendors.name: “Sigma-Aldrich”})
  4. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Terminology 5 http://docs.mongodb.org/manual/reference/sql-comparison/ Relational

    Database MongoDB Table Collec7on Row Document Column Field Join Embed  (or  Reference) Primary  Key _id Par77on Shard
  5. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Why MongoDB? • Easy

     to  get  started  with  small  projects   • Schema-­‐less  –  easy  prototyping,  heterogeneous  data   • Language  drivers  –  C,  C++,  Java,  Python,  R,  Ruby,  …   • High  performance  and  scalability  for  large  projects 6
  6. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Why not MongoDB? •

    No  mul:-­‐document  ACID  transac:ons   • No  joins   • Rela:vely  new,  lacking  a  proven  track  record   • PostgreSQL  fast  adding  JSON,  schema-­‐less  features 7
  7. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain The Database Landscape 8

    Performance   and   Scalability Features  and  Func:onality Document- oriented Relational Database Key-value store Redis MongoDB PostgreSQL Oracle CouchDB Memcached
  8. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain • Fingerprints:   •

    Tanimoto  coefficient:   • Typical  tasks: Similarity Search Basics 9 01001011… [2,  5,  7,  8…] Return all molecules where T > 0.8 with query molecule Return top 20 molecules ranked by T with query molecule
  9. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Screening: 1-bit Count Bounds

    • Use  easily  searchable  fingerprint  proper:es  to  filter   • Number  of  1-­‐bits  in  results  must  be  in  range  of  query 10 Swamidass,  S.  J.,  &  Baldi,  P.  J.  Chem.  Inf.  Model.  2007,  47,  302–317.  10.1021/ci600358f T  =  Tanimoto  threshold
 Na  =  Number  of  1-­‐bits  in  query  fingerprint
 Nb  =  Number  of  1-­‐bits  in  eligible  result  fingerprint
  10. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Screening: Required Bits •

    At  least  one  bit  must  be  in  common  between  the  query   fingerprint  and  a  result  fingerprint   • What  is  the  smallest  subset  of  the  query  fingerprint   where  this  is  s:ll  true?   • Even  if  the  remaining  TNa – 1  bits  are  all  in  common,  it   wouldn’t  be  enough 11 Davy  Suvee,  2011,  hUp:/ /datablend.be/?p=254
  11. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Fingerprints in MongoDB •

    RDKit  Morgan  Fingerprint  (circular,  ECFP-­‐like)   • Pre-­‐calculate  fingerprint  bits  and  count.  Add  indexes.   • Screening: 12 { _id: “CHEMBL25”, count: 24, bits: [11, 23, 33, 64, 80, 138, 175, 183, 193, 214, 239, 295, ...] } db.mols.find({count: {$gte: qmin, $lte: qmax}}) # Count bounds db.mols.find({bits: {$in: reqbits}) # Required bits
  12. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain • Screening  methods:  count

     bounds,  (rarest)  required  bits   • Fingerprint  radius:  2,  3,  4   • Fingerprint  folding:  none,  2048,  1024,  512 Optimising search performance 13 {“_id”:1301, “count”:1}, {“_id”:5338, “count”:3}, {“_id”:20821,“count”:8} reqbits = count_collection.find({'_id': {'$in': qfp}}) .sort('count', 1) .limit(ncommon)
  13. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Aggregation Framework • Perform

     a  pipeline  of  processing  tasks  on  the  server   • Compute  new  fields,  filter,  transform,  group,  documents 18 Filter using screening methods Calculate bit intersection with query fp Calculate exact tanimoto coefficient Filter based on tanimoto threshold
  14. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain 19 Aggregation Pipeline qn

    = len(qfp) # Num bits in query fingerprint
 qmin = int(ceil(qn * threshold)) # Min num bits in results fingerprints
 qmax = int(qn / threshold) # Max num bits in results fingerprints
 ncommon = qn - qmin + 1 # Num bits where >1 must be in common reqbits = count_collection.find({'_id': {'$in': qfp}}).sort('count', 1).limit(ncommon) aggregate = [
 {'$match': {'count': {'$gte': qmin, '$lte': qmax}, 'bits': {'$in': reqbits}}},
 {'$project': {
 'tanimoto': {'$let': {
 'vars': {'common': {'$size': {'$setIntersection': ['$bits', qfp]}}},
 'in': {'$divide': ['$$common', {'$subtract': [{'$add': [qn, '$count']}, '$$common']}]}
 }},
 }},
 {'$match': {'tanimoto': {'$gte': threshold}}}
 ] response = fp_collection.aggregate(aggregate)
  15. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Scaling Up • Hard

     limit  on  the  CPU,  RAM,  storage  of  a  single  server   • Awful  performance  if  working  set  doesn’t  fit  in  RAM   • Sharding:  Horizontal  scaling  to  mul:ple  servers   • Range-­‐based  sharding  vs  Hash-­‐based  sharding 23
  16. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Future work • Sharding

     –  can  we  handle  100  million  molecules?   • “kD  grid”  method  for  improved  screening   • Substructure  searches   • “Top  K  hits”  searches   • Count-­‐based  fingerprints   • Mul:-­‐molecule  queries,  compound  queries 24 Kristensen,  T.  G.  et  al.  Algorithms  Mol.  Biol.,  2010  5,  9.  10.1186/1748-­‐7188-­‐5-­‐9
  17. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Conclusions • Possible  to

     use  MongoDB  for  a  chemical  database   • Comparable  performance  to  PostgreSQL   • Use  sparse,  unfolded  fingerprints  for  best  performance   • Poten:al  for  horizontal  scaling  for  huge  databases 25
  18. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Source Code 26 github.com/mcs07/mongodb-chemistry

    $ mchem load mymols.sdf $ mchem addfp $ mchem countfp $ mchem similar “O=C(Oc1ccccc1C(=O)O)C” CHEMBL25 CHEMBL350343
  19. Cambridge  Cheminforma:cs  Mee:ng  26/11/2014 Ma#  Swain Further Information • hUp:/

    /blog.maU-­‐swain.com/post/87093745652/   • Swamidass,  S.  J.,  &  Baldi,  P.  Bounds  and  Algorithms  for  Fast  Exact  Searches  of   Chemical  Fingerprints  in  Linear  and  Sublinear  Time.  J.  Chem.  Inf.  Model.  2007,  47,   302–317.  10.1021/ci600358f   • Rogers,  D.,  &  Hahn,  M.  Extended-­‐Connec7vity  Fingerprints.  J.  Chem.  Inf.  Model.2010,   50,  742–754.  10.1021/ci100050t   • Kristensen,  T.  G.  et  al.  A  tree-­‐based  method  for  the  rapid  screening  of  chemical   fingerprints.  Algorithms  Mol.  Biol.,  2010  5,  9.  10.1186/1748-­‐7188-­‐5-­‐9   • RDKit  –  hUp:/ /www.rdkit.org/docs/GeingStartedInPython.html   • MongoDB  –  hUp:/ /docs.mongodb.org/manual/ 28