$30 off During Our Annual Pro Sale. View Details »

Approaches to Managing Genomic (DNA) Data

Approaches to Managing Genomic (DNA) Data

These slides outline (at a high level) the research we've accomplished at Agile Medicine to determine a best case minimum viable solution for quick and easy genomic data management.

Donn Felker

April 02, 2014

More Decks by Donn Felker

Other Decks in Technology


  1. 1 Large Scale Data Analysis with MongoDB

  2. 2 Large Scale Data Analysis with MongoDB

  3. 3 Approaches to Managing Genomic Data (DNA)

  4. 4 Approaches to Managing Genomic Data (DNA) With MongoDB

  5. 5 Partner,  Agile  Medicine Donn Felker Donn Felker Author. National

    Speaker. Tech Entrepreneur. Software Developer.
  6. 6 The process of scienti!c discovery is, in e"ect, a

    continual #ight from wonder. Albert  Einstein  
  7. 7 Problem: What is the Best Way To Store Genome

  8. 8 Approach/Stance

  9. 9 Academia & Research Status Quo Most popular tools used

    for analysis in Medical Research and Academia ... and sometimes Python
  10. 10 Better Alternative?

  11. 11 Traditional DNA Research High Level Intro

  12. 12

  13. 13 Single Nucleotide Polymorphisms (SNPs) Understanding “snips” Source: https://www.23andme.com/gen101/snps/ To

     make  new  cells,  exis:ng  cells  divide  in  two.  Before  dividing  the  DNA  of  a  cell  is  copied.   During  that  copying  process  some:mes  mistakes  are  made  -­‐  kind  of  like  typos.  These   mistakes  lead  to  varia:ons  in  the  DNA  sequence  at  par:cular  loca:ons  called  Single   Nucleo:de  Polymorphisms  or  SNPs  (pronounced  “snips”). SNPs are Copying Errors SNPs  can  generate  biological  varia:on  between  people  by  causing  differences  in  the   recipes  for  proteins  that  are  wriTen  in  genes.  Those  differences  can  in  turn  influence  a   variety  of  traits  such  as  appearance,  disease  suscep:bility  or  response  to  drugs.  While   some  SNPs  lead  to  differences  in  health  or  physical  appearance,  most  SNPs  seem  to  lead   to  no  observable  differences  between  people  at  all. Consequence of SNPs DNA  is  passed  from  parent  to  child,  so  you  inherit  your  SNPs  versions  from  your  parents.   You  will  be  a  match  with  your  siblings,  grandparents,  aunts,  uncles,  and  cousins  at  many   of  these  SNPs.  But  you  will  have  far  fewer  matches  with  people  to  whom  you  are  only   distantly  related.  The  number  of  SNPs  where  you  match  another  person  can  therefore  be   used  to  tell  how  closely  related  you  are. SNPs as a Measure of Genetic Similarity
  14. 14

  15. 15

  16. 16

  17. 17 Genotype Processing SNP Processing via chip.

  18. 18 Reference Genotype and Annotations SNP Variation ~0.1% - 0.4%

    variances in each human genome*. Common Variances A  97%  Similarity  is  shared  between   every  human.  The  differences  lie  in   the   remaining   %.   SNPs   are   iden:fied   by   comparing   the   DNA       against  a  reference  Genotype. Reference   Genotype  is   simply   the   average  genome  across  the  human   genomes.   This   is   the   combined   average  that  all  SNPs  are  compared   against. Your   SNPs   are   determined   by   analyzing   the   differences   of   your   SNP’s   compared   to   the   references   genotype.
  19. 19 ~99% .1-.45 Genotype Data Size in Mongo 0.1% -

    0.4% is not much, but in scale, its a lot .1-.4% ~12 GB of Data per person Genotype Sequenced Data ~99% Shared Genome Data ~400+ GB of Data per person
  20. 20 Genotyping 320 People ... ~3.8 TB Full DNA Sequence

    for 320 People ... ~140 TB
  21. 21

  22. 22 Genome Storage Our new adventure into DNA and large

    individual datasets MySQL Rela1onal  Database 23479827349857920347502345 2342769782309809-­‐84507860983456 45249580980984958609830945 987879423589087782789298498924 D N A ... HUGE DATASETS Person A
  23. 23 Genome Storage Our new adventure into DNA and large

    individual datasets MySQL Rela1onal  Database DROP TABLE; INSERT INTO... Person B Delete old data. Insert old data. Need old data? Do it again. Not enough space/performance in MySQL
  24. 24 Genome Storage MongoDB No  SQL  Database MySQL Rela1onal  Database

    Our new adventure into DNA and large individual datasets Default installations with a few indexes. Minimal changes to the stock install.

  26. 26 MEDCO ABC Small Research Teams. At Best, one of

    them is a Tech
  27. 27 Two Popular Data Stores MongoDB No  SQL  Database MySQL

    Rela1onal  Database MongoDB for document storage an MySQL for relational storage
  28. 28 Simple Needs to be Meant for Relatively Some Goals

    A new data store needs to have a few things ... BigData Cheap(er) Hippa Compliant
  29. 29 Test Common Use Cases to Determine a Clear Advantage

    Goals: Speed Cost. Flexibility. Tooling. Genome  data  has  a   tendency  to  change  schema   quite  o_en  due  to  new   discoveries.  Changes  should   be  easy  to  implement  and   should  not  require  a  rebuild   of  the  en:re  datastore.   Adapts to Change Inser:ng  data  should  be   fast.  VERY  FAST.  Genotype   data  is  quite  large  for  each   person  and  inser:ng  data   and  querying  for  data   should  be  quick. Query/Indexing The  primary  use  case  of  this     is  for  research  and   academia.  Time  is  valuable   and  is  normally  spent   focused  on  researching  the   medical  issue  at  hand  and   not  implemen:ng  technical   solu:ons  around  the   datastore.  The  datastore  is   a  TOOL,  not  the  solu:on   and  it  should  help  facilitate   solving  the  issue,  not   become  the  issue. A Tool for Research Tradi:onally  this  type  of   research  was  done  on   university  super  computers   with  massive  processing   power.  Unfortunately  these   machines  require  advanced   reserva:ons  and  cost  a  lot   to  maintain.  Future   research  should  not  be  held   up  by  wai:ng  for  the   university  super  computer   and  should  be  cheap  to   operate. Cheaper to Operate
  30. Testing and Research

  31. 31 EC2 XL Instances EBS PIOPS EBS PIOPS

  32. 32 SSD

  33. 33 1 2 3 4 5 6 7 8 9

    10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT Chromosome Flat Files Each chromosome !at "le is generated for easy processing. Each chromosome is given its own "le. Some larger than others. These are used for loading and analyzing the data load process.
  34. 34 1 Python  open  and  load   flat  file  for

      preprocessing  and   inser:on  to  db 2 Insert/Import  data  into   the  datastore   (MongoDB  or  MySQL) Measure  import  :me. 4 Run  test  queries   against  new  data  in  the   data  store.  Measure   query  :mes. 3 Create  indexes  on  data Measure  index  crea:on   :me. Data Load Evaluation Process Inserting from !at "les

    Data Testing di#erent insertion methods for comprehensive analysis
  36. 36 Results

  37. 37 Individual Record Write Speeds

  38. 38 MongoDB Import

  39. 39 No MySQL Import Data is coming from a non-DB

    source (!at "les/etc) and FK relationships would have to be created and run in order to properly. Other ways? Yes... but ... Outside of mysqlimport and disabling fk checks we were limited to importing the "rst "le and then keeping the keys in memory and then inserting the second table/etc. Complex data in !at "les = complex problems importing into Relational stores.
  40. 40 Index Creation Time

  41. 41 Query Times

  42. 42 db.stats() { "db" : "snp_research", "collections" : 3, "objects"

    : 61268671, "avgObjSize" : 207.3250924603865, "dataSize" : 12702532880, "storageSize" : 15123124160, "numExtents" : 31, "indexes" : 4, "indexSize" : 6059462528, "fileSize" : 25691160576, "nsSizeMB" : 16, "dataFileVersion" : { "major" : 4, "minor" : 5 }, "ok" : 1 } db.stats() { "db" : "snp_research", "collections" : 3, "objects" : 61268671, "avgObjSize" : 207.3250924603865, "dataSize" : 12702532880, "storageSize" : 15123124160, "numExtents" : 31, "indexes" : 4, "indexSize" : 6058947440, "fileSize" : 25691160576, "nsSizeMB" : 16, "dataFileVersion" : { "major" : 4, "minor" : 5 }, "ok" : 1 } MongoDB Db Stats After Import mongoimport mongobulk insert data size: 11.83 GB
  43. 43 Clearly Outperforms MySQL out of the box when working

    with Genome data.
  44. 44 Additional Considerations Tradi:onal  rela:onal  stores   hold  a  large

     set  of  data  but   no  where  near  the  size  of   data  that  Mongo  is  meant   to  hold  out  of  the  box.   Being  able  to  keep  the   addi:onal  data  easily   available  makes  research   and  re-­‐analysis  of  genome   data  easier. Genome  schema  changes   are  frequent  and  require   tooling  that  can  adapt  to   change.  A  schema-­‐less   system  is  a  perfect  fit  for   genomic  data.  Adding  and   removing  documents  is  built   into  the  system.  This  is   much  more  difficult  with   tradi:onal  rela:onal  stores. MongoDB  is  built  to  be   controlled  with  JavaScript.   Many  high  performance   applica:ons  are  being  built   with  JavaScript.  Younger   researchers  are  much  more   likely  to  be  familiar  with  the   language. Scale & Growth Adapts to Change Research Programming
  45. 45 Thank You @donnfelker donnfelker.com donn@agilemedicine.com