Chemical Structure Handling in MongoDB

Chemical Structure Handling in MongoDB Ma# Swain Cavendish Laboratory,
University of Cambridge

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Overview • What is
MongoDB? • Chemical similarity searching • Op:mising screening methods • Benchmarking query performance • Scaling up to mul:ple servers 2

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Relational database 3 Molecule
id smiles name [] vendors [{name, vid}] Molecule id smiles Synonym mol_id name Vendor id name Mol-Vend mol_id vendor_id vid MongoDB

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain 4 Document Example document
= { _id: “CHEMBL25”, smiles: “CC(=O)Oc1ccccc1C(=O)O”, name: [“Aspirin”, “2-Acetoxybenzoic acid”, “Ecotrin"], rdmol: BinData(...), vendors: [ {name: “Sigma-Aldrich”, vid: “A2093_SIGMA"}, {name: “Acros Organics”, vid: “15818”} ] } db.mols.insert(document) db.mols.find({name: “Aspirin”}) db.mols.find({vendors.name: “Sigma-Aldrich”})

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Terminology 5 http://docs.mongodb.org/manual/reference/sql-comparison/ Relational
Database MongoDB Table Collec7on Row Document Column Field Join Embed (or Reference) Primary Key _id Par77on Shard

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Why MongoDB? • Easy
to get started with small projects • Schema-‐less – easy prototyping, heterogeneous data • Language drivers – C, C++, Java, Python, R, Ruby, … • High performance and scalability for large projects 6

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Why not MongoDB? •
No mul:-‐document ACID transac:ons • No joins • Rela:vely new, lacking a proven track record • PostgreSQL fast adding JSON, schema-‐less features 7

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain The Database Landscape 8
Performance and Scalability Features and Func:onality Document- oriented Relational Database Key-value store Redis MongoDB PostgreSQL Oracle CouchDB Memcached

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain • Fingerprints: •
Tanimoto coeﬃcient: • Typical tasks: Similarity Search Basics 9 01001011… [2, 5, 7, 8…] Return all molecules where T > 0.8 with query molecule Return top 20 molecules ranked by T with query molecule

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Screening: 1-bit Count Bounds
• Use easily searchable fingerprint proper:es to filter • Number of 1-‐bits in results must be in range of query 10 Swamidass, S. J., & Baldi, P. J. Chem. Inf. Model. 2007, 47, 302–317. 10.1021/ci600358f T = Tanimoto threshold  Na = Number of 1-‐bits in query fingerprint  Nb = Number of 1-‐bits in eligible result fingerprint

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Screening: Required Bits •
At least one bit must be in common between the query fingerprint and a result fingerprint • What is the smallest subset of the query fingerprint where this is s:ll true? • Even if the remaining TNa – 1 bits are all in common, it wouldn’t be enough 11 Davy Suvee, 2011, hUp:/ /datablend.be/?p=254

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Fingerprints in MongoDB •
RDKit Morgan Fingerprint (circular, ECFP-‐like) • Pre-‐calculate ﬁngerprint bits and count. Add indexes. • Screening: 12 { _id: “CHEMBL25”, count: 24, bits: [11, 23, 33, 64, 80, 138, 175, 183, 193, 214, 239, 295, ...] } db.mols.find({count: {$gte: qmin, $lte: qmax}}) # Count bounds db.mols.find({bits: {$in: reqbits}) # Required bits

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain • Screening methods: count
bounds, (rarest) required bits • Fingerprint radius: 2, 3, 4 • Fingerprint folding: none, 2048, 1024, 512 Optimising search performance 13 {“_id”:1301, “count”:1}, {“_id”:5338, “count”:3}, {“_id”:20821,“count”:8} reqbits = count_collection.find({'_id': {'$in': qfp}}) .sort('count', 1) .limit(ncommon)

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Screening Methods 14

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Fingerprint Folding 15 Unfolded
has 499,695 diﬀerent bits in ChEMBL

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Fingerprint Radius 16 Radius
Unique bits 2 499,695 3 2,590,940 4 6,314,755

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Fingerprint Radius 17 Radius
Unique bits 2 499,695 3 2,590,940 4 6,314,755

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Aggregation Framework • Perform
a pipeline of processing tasks on the server • Compute new fields, filter, transform, group, documents 18 Filter using screening methods Calculate bit intersection with query fp Calculate exact tanimoto coefficient Filter based on tanimoto threshold

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain 19 Aggregation Pipeline qn
= len(qfp) # Num bits in query fingerprint  qmin = int(ceil(qn * threshold)) # Min num bits in results fingerprints  qmax = int(qn / threshold) # Max num bits in results fingerprints  ncommon = qn - qmin + 1 # Num bits where >1 must be in common reqbits = count_collection.find({'_id': {'$in': qfp}}).sort('count', 1).limit(ncommon) aggregate = [  {'$match': {'count': {'$gte': qmin, '$lte': qmax}, 'bits': {'$in': reqbits}}},  {'$project': {  'tanimoto': {'$let': {  'vars': {'common': {'$size': {'$setIntersection': ['$bits', qfp]}}},  'in': {'$divide': ['$$common', {'$subtract': [{'$add': [qn, '$count']}, '$$common']}]}  }},  }},  {'$match': {'tanimoto': {'$gte': threshold}}}  ] response = fp_collection.aggregate(aggregate)

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Benchmarks: Folding 20

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Benchmarks: Radius 21

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Benchmarks: PostgreSQL 22

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Scaling Up • Hard
limit on the CPU, RAM, storage of a single server • Awful performance if working set doesn’t ﬁt in RAM • Sharding: Horizontal scaling to mul:ple servers • Range-‐based sharding vs Hash-‐based sharding 23

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Future work • Sharding
– can we handle 100 million molecules? • “kD grid” method for improved screening • Substructure searches • “Top K hits” searches • Count-‐based ﬁngerprints • Mul:-‐molecule queries, compound queries 24 Kristensen, T. G. et al. Algorithms Mol. Biol., 2010 5, 9. 10.1186/1748-‐7188-‐5-‐9

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Conclusions • Possible to
use MongoDB for a chemical database • Comparable performance to PostgreSQL • Use sparse, unfolded ﬁngerprints for best performance • Poten:al for horizontal scaling for huge databases 25

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Source Code 26 github.com/mcs07/mongodb-chemistry
$ mchem load mymols.sdf $ mchem addfp $ mchem countfp $ mchem similar “O=C(Oc1ccccc1C(=O)O)C” CHEMBL25 CHEMBL350343

http://try.mongodb.org

Cambridge Cheminforma:cs Mee:ng 26/11/2014 Ma# Swain Further Information • hUp:/
/blog.maU-‐swain.com/post/87093745652/ • Swamidass, S. J., & Baldi, P. Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sublinear Time. J. Chem. Inf. Model. 2007, 47, 302–317. 10.1021/ci600358f • Rogers, D., & Hahn, M. Extended-‐Connec7vity Fingerprints. J. Chem. Inf. Model.2010, 50, 742–754. 10.1021/ci100050t • Kristensen, T. G. et al. A tree-‐based method for the rapid screening of chemical ﬁngerprints. Algorithms Mol. Biol., 2010 5, 9. 10.1186/1748-‐7188-‐5-‐9 • RDKit – hUp:/ /www.rdkit.org/docs/GeingStartedInPython.html • MongoDB – hUp:/ /docs.mongodb.org/manual/ 28

Chemical Structure Handling in MongoDB

Chemical Structure Handling in MongoDB

Matt Swain

More Decks by Matt Swain

Other Decks in Science

Featured

Transcript