make new cells, exis:ng cells divide in two. Before dividing the DNA of a cell is copied. During that copying process some:mes mistakes are made -‐ kind of like typos. These mistakes lead to varia:ons in the DNA sequence at par:cular loca:ons called Single Nucleo:de Polymorphisms or SNPs (pronounced “snips”). SNPs are Copying Errors SNPs can generate biological varia:on between people by causing diﬀerences in the recipes for proteins that are wriTen in genes. Those diﬀerences can in turn inﬂuence a variety of traits such as appearance, disease suscep:bility or response to drugs. While some SNPs lead to diﬀerences in health or physical appearance, most SNPs seem to lead to no observable diﬀerences between people at all. Consequence of SNPs DNA is passed from parent to child, so you inherit your SNPs versions from your parents. You will be a match with your siblings, grandparents, aunts, uncles, and cousins at many of these SNPs. But you will have far fewer matches with people to whom you are only distantly related. The number of SNPs where you match another person can therefore be used to tell how closely related you are. SNPs as a Measure of Genetic Similarity
variances in each human genome*. Common Variances A 97% Similarity is shared between every human. The diﬀerences lie in the remaining %. SNPs are iden:ﬁed by comparing the DNA against a reference Genotype. Reference Genotype is simply the average genome across the human genomes. This is the combined average that all SNPs are compared against. Your SNPs are determined by analyzing the diﬀerences of your SNP’s compared to the references genotype.
Goals: Speed Cost. Flexibility. Tooling. Genome data has a tendency to change schema quite o_en due to new discoveries. Changes should be easy to implement and should not require a rebuild of the en:re datastore. Adapts to Change Inser:ng data should be fast. VERY FAST. Genotype data is quite large for each person and inser:ng data and querying for data should be quick. Query/Indexing The primary use case of this is for research and academia. Time is valuable and is normally spent focused on researching the medical issue at hand and not implemen:ng technical solu:ons around the datastore. The datastore is a TOOL, not the solu:on and it should help facilitate solving the issue, not become the issue. A Tool for Research Tradi:onally this type of research was done on university super computers with massive processing power. Unfortunately these machines require advanced reserva:ons and cost a lot to maintain. Future research should not be held up by wai:ng for the university super computer and should be cheap to operate. Cheaper to Operate
10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT Chromosome Flat Files Each chromosome !at "le is generated for easy processing. Each chromosome is given its own "le. Some larger than others. These are used for loading and analyzing the data load process.
preprocessing and inser:on to db 2 Insert/Import data into the datastore (MongoDB or MySQL) Measure import :me. 4 Run test queries against new data in the data store. Measure query :mes. 3 Create indexes on data Measure index crea:on :me. Data Load Evaluation Process Inserting from !at "les
source (!at "les/etc) and FK relationships would have to be created and run in order to properly. Other ways? Yes... but ... Outside of mysqlimport and disabling fk checks we were limited to importing the "rst "le and then keeping the keys in memory and then inserting the second table/etc. Complex data in !at "les = complex problems importing into Relational stores.