Approaches to Managing Genomic (DNA) Data

1 Large Scale Data Analysis with MongoDB

2 Large Scale Data Analysis with MongoDB

3 Approaches to Managing Genomic Data (DNA)

4 Approaches to Managing Genomic Data (DNA) With MongoDB

5 Partner, Agile Medicine Donn Felker Donn Felker Author. National
Speaker. Tech Entrepreneur. Software Developer.

6 The process of scienti!c discovery is, in e"ect, a
continual #ight from wonder. Albert Einstein

7 Problem: What is the Best Way To Store Genome
Data?

8 Approach/Stance

9 Academia & Research Status Quo Most popular tools used
for analysis in Medical Research and Academia ... and sometimes Python

10 Better Alternative?

11 Traditional DNA Research High Level Intro

13 Single Nucleotide Polymorphisms (SNPs) Understanding “snips” Source: https://www.23andme.com/gen101/snps/ To
make new cells, exis:ng cells divide in two. Before dividing the DNA of a cell is copied. During that copying process some:mes mistakes are made -‐ kind of like typos. These mistakes lead to varia:ons in the DNA sequence at par:cular loca:ons called Single Nucleo:de Polymorphisms or SNPs (pronounced “snips”). SNPs are Copying Errors SNPs can generate biological varia:on between people by causing differences in the recipes for proteins that are wriTen in genes. Those differences can in turn influence a variety of traits such as appearance, disease suscep:bility or response to drugs. While some SNPs lead to differences in health or physical appearance, most SNPs seem to lead to no observable differences between people at all. Consequence of SNPs DNA is passed from parent to child, so you inherit your SNPs versions from your parents. You will be a match with your siblings, grandparents, aunts, uncles, and cousins at many of these SNPs. But you will have far fewer matches with people to whom you are only distantly related. The number of SNPs where you match another person can therefore be used to tell how closely related you are. SNPs as a Measure of Genetic Similarity

17 Genotype Processing SNP Processing via chip.

18 Reference Genotype and Annotations SNP Variation ~0.1% - 0.4%
variances in each human genome*. Common Variances A 97% Similarity is shared between every human. The differences lie in the remaining %. SNPs are iden:fied by comparing the DNA against a reference Genotype. Reference Genotype is simply the average genome across the human genomes. This is the combined average that all SNPs are compared against. Your SNPs are determined by analyzing the differences of your SNP’s compared to the references genotype.

19 ~99% .1-.45 Genotype Data Size in Mongo 0.1% -
0.4% is not much, but in scale, its a lot .1-.4% ~12 GB of Data per person Genotype Sequenced Data ~99% Shared Genome Data ~400+ GB of Data per person

20 Genotyping 320 People ... ~3.8 TB Full DNA Sequence
for 320 People ... ~140 TB

22 Genome Storage Our new adventure into DNA and large
individual datasets MySQL Rela1onal Database 23479827349857920347502345 2342769782309809-‐84507860983456 45249580980984958609830945 987879423589087782789298498924 D N A ... HUGE DATASETS Person A

23 Genome Storage Our new adventure into DNA and large
individual datasets MySQL Rela1onal Database DROP TABLE; INSERT INTO... Person B Delete old data. Insert old data. Need old data? Do it again. Not enough space/performance in MySQL

24 Genome Storage MongoDB No SQL Database MySQL Rela1onal Database
Our new adventure into DNA and large individual datasets Default installations with a few indexes. Minimal changes to the stock install.

25 WHY STANDARD CONFIG?

26 MEDCO ABC Small Research Teams. At Best, one of
them is a Tech

27 Two Popular Data Stores MongoDB No SQL Database MySQL
Rela1onal Database MongoDB for document storage an MySQL for relational storage

28 Simple Needs to be Meant for Relatively Some Goals
A new data store needs to have a few things ... BigData Cheap(er) Hippa Compliant

29 Test Common Use Cases to Determine a Clear Advantage
Goals: Speed Cost. Flexibility. Tooling. Genome data has a tendency to change schema quite o_en due to new discoveries. Changes should be easy to implement and should not require a rebuild of the en:re datastore. Adapts to Change Inser:ng data should be fast. VERY FAST. Genotype data is quite large for each person and inser:ng data and querying for data should be quick. Query/Indexing The primary use case of this is for research and academia. Time is valuable and is normally spent focused on researching the medical issue at hand and not implemen:ng technical solu:ons around the datastore. The datastore is a TOOL, not the solu:on and it should help facilitate solving the issue, not become the issue. A Tool for Research Tradi:onally this type of research was done on university super computers with massive processing power. Unfortunately these machines require advanced reserva:ons and cost a lot to maintain. Future research should not be held up by wai:ng for the university super computer and should be cheap to operate. Cheaper to Operate

Testing and Research

31 EC2 XL Instances EBS PIOPS EBS PIOPS

32 SSD

33 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT Chromosome Flat Files Each chromosome !at "le is generated for easy processing. Each chromosome is given its own "le. Some larger than others. These are used for loading and analyzing the data load process.

34 1 Python open and load ﬂat ﬁle for
preprocessing and inser:on to db 2 Insert/Import data into the datastore (MongoDB or MySQL) Measure import :me. 4 Run test queries against new data in the data store. Measure query :mes. 3 Create indexes on data Measure index crea:on :me. Data Load Evaluation Process Inserting from !at "les

35 SINGLE RECORD INSERTION BULK INSERTION IMPORT How We Imported
Data Testing di#erent insertion methods for comprehensive analysis

36 Results

37 Individual Record Write Speeds

38 MongoDB Import

39 No MySQL Import Data is coming from a non-DB
source (!at "les/etc) and FK relationships would have to be created and run in order to properly. Other ways? Yes... but ... Outside of mysqlimport and disabling fk checks we were limited to importing the "rst "le and then keeping the keys in memory and then inserting the second table/etc. Complex data in !at "les = complex problems importing into Relational stores.

40 Index Creation Time

41 Query Times

42 db.stats() { "db" : "snp_research", "collections" : 3, "objects"
: 61268671, "avgObjSize" : 207.3250924603865, "dataSize" : 12702532880, "storageSize" : 15123124160, "numExtents" : 31, "indexes" : 4, "indexSize" : 6059462528, "fileSize" : 25691160576, "nsSizeMB" : 16, "dataFileVersion" : { "major" : 4, "minor" : 5 }, "ok" : 1 } db.stats() { "db" : "snp_research", "collections" : 3, "objects" : 61268671, "avgObjSize" : 207.3250924603865, "dataSize" : 12702532880, "storageSize" : 15123124160, "numExtents" : 31, "indexes" : 4, "indexSize" : 6058947440, "fileSize" : 25691160576, "nsSizeMB" : 16, "dataFileVersion" : { "major" : 4, "minor" : 5 }, "ok" : 1 } MongoDB Db Stats After Import mongoimport mongobulk insert data size: 11.83 GB

43 Clearly Outperforms MySQL out of the box when working
with Genome data.

44 Additional Considerations Tradi:onal rela:onal stores hold a large
set of data but no where near the size of data that Mongo is meant to hold out of the box. Being able to keep the addi:onal data easily available makes research and re-‐analysis of genome data easier. Genome schema changes are frequent and require tooling that can adapt to change. A schema-‐less system is a perfect ﬁt for genomic data. Adding and removing documents is built into the system. This is much more diﬃcult with tradi:onal rela:onal stores. MongoDB is built to be controlled with JavaScript. Many high performance applica:ons are being built with JavaScript. Younger researchers are much more likely to be familiar with the language. Scale & Growth Adapts to Change Research Programming

45 Thank You @donnfelker donnfelker.com [email protected]

Approaches to Managing Genomic (DNA) Data

Approaches to Managing Genomic (DNA) Data

More Decks by Donn Felker

Other Decks in Technology

Featured

Transcript