Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Approaches to Managing Genomic (DNA) Data

Approaches to Managing Genomic (DNA) Data

These slides outline (at a high level) the research we've accomplished at Agile Medicine to determine a best case minimum viable solution for quick and easy genomic data management.

Donn Felker

April 02, 2014
Tweet

More Decks by Donn Felker

Other Decks in Technology

Transcript

  1. 1
    Large Scale Data Analysis
    with MongoDB

    View full-size slide

  2. 2
    Large Scale Data Analysis
    with MongoDB

    View full-size slide

  3. 3
    Approaches to Managing
    Genomic Data (DNA)

    View full-size slide

  4. 4
    Approaches to Managing
    Genomic Data (DNA)
    With MongoDB

    View full-size slide

  5. 5
    Partner,  Agile  Medicine
    Donn Felker
    Donn Felker
    Author. National Speaker. Tech Entrepreneur. Software Developer.

    View full-size slide

  6. 6
    The process of scienti!c
    discovery is, in e"ect, a
    continual #ight from
    wonder.
    Albert  Einstein  

    View full-size slide

  7. 7
    Problem: What is the Best Way To Store Genome Data?

    View full-size slide

  8. 8
    Approach/Stance

    View full-size slide

  9. 9
    Academia & Research Status Quo
    Most popular tools used for analysis in Medical Research and Academia
    ... and sometimes Python

    View full-size slide

  10. 10
    Better Alternative?

    View full-size slide

  11. 11
    Traditional DNA Research
    High Level Intro

    View full-size slide

  12. 13
    Single Nucleotide Polymorphisms (SNPs)
    Understanding “snips”
    Source: https://www.23andme.com/gen101/snps/
    To  make  new  cells,  exis:ng  cells  divide  in  two.  Before  dividing  the  DNA  of  a  cell  is  copied.  
    During  that  copying  process  some:mes  mistakes  are  made  -­‐  kind  of  like  typos.  These  
    mistakes  lead  to  varia:ons  in  the  DNA  sequence  at  par:cular  loca:ons  called  Single  
    Nucleo:de  Polymorphisms  or  SNPs  (pronounced  “snips”).
    SNPs are Copying Errors
    SNPs  can  generate  biological  varia:on  between  people  by  causing  differences  in  the  
    recipes  for  proteins  that  are  wriTen  in  genes.  Those  differences  can  in  turn  influence  a  
    variety  of  traits  such  as  appearance,  disease  suscep:bility  or  response  to  drugs.  While  
    some  SNPs  lead  to  differences  in  health  or  physical  appearance,  most  SNPs  seem  to  lead  
    to  no  observable  differences  between  people  at  all.
    Consequence of SNPs
    DNA  is  passed  from  parent  to  child,  so  you  inherit  your  SNPs  versions  from  your  parents.  
    You  will  be  a  match  with  your  siblings,  grandparents,  aunts,  uncles,  and  cousins  at  many  
    of  these  SNPs.  But  you  will  have  far  fewer  matches  with  people  to  whom  you  are  only  
    distantly  related.  The  number  of  SNPs  where  you  match  another  person  can  therefore  be  
    used  to  tell  how  closely  related  you  are.
    SNPs as a Measure of Genetic Similarity

    View full-size slide

  13. 17
    Genotype Processing
    SNP Processing via chip.

    View full-size slide

  14. 18
    Reference Genotype and Annotations
    SNP Variation ~0.1% - 0.4% variances in each human genome*.
    Common
    Variances
    A  97%  Similarity  is  shared  between  
    every  human.  The  differences  lie  in  
    the   remaining   %.   SNPs   are  
    iden:fied   by   comparing   the   DNA  
     
     
    against  a  reference  Genotype.
    Reference   Genotype  is   simply   the  
    average  genome  across  the  human  
    genomes.   This   is   the   combined  
    average  that  all  SNPs  are  compared  
    against.
    Your   SNPs   are   determined   by  
    analyzing   the   differences   of   your  
    SNP’s   compared   to   the   references  
    genotype.

    View full-size slide

  15. 19
    ~99%
    .1-.45
    Genotype Data Size in Mongo
    0.1% - 0.4% is not much, but in scale, its a lot
    .1-.4% ~12 GB of Data per person
    Genotype Sequenced Data
    ~99%
    Shared Genome Data
    ~400+ GB of Data per person

    View full-size slide

  16. 20
    Genotyping 320 People ... ~3.8 TB
    Full DNA Sequence for 320 People ... ~140 TB

    View full-size slide

  17. 22
    Genome Storage
    Our new adventure into DNA and large individual datasets
    MySQL
    Rela1onal  Database
    23479827349857920347502345
    2342769782309809-­‐84507860983456
    45249580980984958609830945
    987879423589087782789298498924
    D
    N
    A
    ...
    HUGE DATASETS
    Person A

    View full-size slide

  18. 23
    Genome Storage
    Our new adventure into DNA and large individual datasets
    MySQL
    Rela1onal  Database
    DROP TABLE;
    INSERT INTO...
    Person B
    Delete old data. Insert old data.
    Need old data? Do it again.
    Not enough space/performance in MySQL

    View full-size slide

  19. 24
    Genome Storage
    MongoDB
    No  SQL  Database
    MySQL
    Rela1onal  Database
    Our new adventure into DNA and large individual datasets
    Default installations with a few indexes. Minimal changes to the stock install.

    View full-size slide

  20. 25
    WHY STANDARD CONFIG?

    View full-size slide

  21. 26
    MEDCO ABC
    Small Research Teams. At Best, one of them is a Tech

    View full-size slide

  22. 27
    Two Popular Data Stores
    MongoDB
    No  SQL  Database
    MySQL
    Rela1onal  Database
    MongoDB for document storage an MySQL for relational storage

    View full-size slide

  23. 28
    Simple
    Needs to be Meant for
    Relatively
    Some Goals
    A new data store needs to have a few things ...
    BigData
    Cheap(er)
    Hippa
    Compliant

    View full-size slide

  24. 29
    Test Common Use Cases
    to Determine a Clear
    Advantage
    Goals:
    Speed
    Cost.
    Flexibility.
    Tooling.
    Genome  data  has  a  
    tendency  to  change  schema  
    quite  o_en  due  to  new  
    discoveries.  Changes  should  
    be  easy  to  implement  and  
    should  not  require  a  rebuild  
    of  the  en:re  datastore.  
    Adapts to Change
    Inser:ng  data  should  be  
    fast.  VERY  FAST.  Genotype  
    data  is  quite  large  for  each  
    person  and  inser:ng  data  
    and  querying  for  data  
    should  be  quick.
    Query/Indexing
    The  primary  use  case  of  this    
    is  for  research  and  
    academia.  Time  is  valuable  
    and  is  normally  spent  
    focused  on  researching  the  
    medical  issue  at  hand  and  
    not  implemen:ng  technical  
    solu:ons  around  the  
    datastore.  The  datastore  is  
    a  TOOL,  not  the  solu:on  
    and  it  should  help  facilitate  
    solving  the  issue,  not  
    become  the  issue.
    A Tool for Research
    Tradi:onally  this  type  of  
    research  was  done  on  
    university  super  computers  
    with  massive  processing  
    power.  Unfortunately  these  
    machines  require  advanced  
    reserva:ons  and  cost  a  lot  
    to  maintain.  Future  
    research  should  not  be  held  
    up  by  wai:ng  for  the  
    university  super  computer  
    and  should  be  cheap  to  
    operate.
    Cheaper to Operate

    View full-size slide

  25. Testing and Research

    View full-size slide

  26. 31
    EC2 XL Instances
    EBS
    PIOPS
    EBS
    PIOPS

    View full-size slide

  27. 33
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    X
    Y
    MT
    Chromosome Flat Files
    Each chromosome !at "le is generated for easy processing.
    Each chromosome is given its own "le. Some larger than others.
    These are used for loading and analyzing the data load process.

    View full-size slide

  28. 34
    1
    Python  open  and  load  
    flat  file  for  
    preprocessing  and  
    inser:on  to  db
    2
    Insert/Import  data  into  
    the  datastore  
    (MongoDB  or  MySQL)
    Measure  import  :me.
    4
    Run  test  queries  
    against  new  data  in  the  
    data  store.  Measure  
    query  :mes.
    3
    Create  indexes  on  data
    Measure  index  crea:on  
    :me.
    Data Load Evaluation Process
    Inserting from !at "les

    View full-size slide

  29. 35
    SINGLE RECORD INSERTION
    BULK INSERTION
    IMPORT
    How We Imported Data
    Testing di#erent insertion methods for comprehensive analysis

    View full-size slide

  30. 37
    Individual Record Write Speeds

    View full-size slide

  31. 38
    MongoDB Import

    View full-size slide

  32. 39
    No MySQL Import
    Data is coming from a non-DB source (!at "les/etc) and FK relationships
    would have to be created and run in order to properly.
    Other ways? Yes... but ...
    Outside of mysqlimport and disabling fk checks we were limited to
    importing the "rst "le and then keeping the keys in memory and then
    inserting the second table/etc. Complex data in !at "les = complex
    problems importing into Relational stores.

    View full-size slide

  33. 40
    Index Creation Time

    View full-size slide

  34. 41
    Query Times

    View full-size slide

  35. 42
    db.stats()
    {
    "db" : "snp_research",
    "collections" : 3,
    "objects" : 61268671,
    "avgObjSize" : 207.3250924603865,
    "dataSize" : 12702532880,
    "storageSize" : 15123124160,
    "numExtents" : 31,
    "indexes" : 4,
    "indexSize" : 6059462528,
    "fileSize" : 25691160576,
    "nsSizeMB" : 16,
    "dataFileVersion" : {
    "major" : 4,
    "minor" : 5
    },
    "ok" : 1
    }
    db.stats()
    {
    "db" : "snp_research",
    "collections" : 3,
    "objects" : 61268671,
    "avgObjSize" : 207.3250924603865,
    "dataSize" : 12702532880,
    "storageSize" : 15123124160,
    "numExtents" : 31,
    "indexes" : 4,
    "indexSize" : 6058947440,
    "fileSize" : 25691160576,
    "nsSizeMB" : 16,
    "dataFileVersion" : {
    "major" : 4,
    "minor" : 5
    },
    "ok" : 1
    }
    MongoDB Db Stats After Import
    mongoimport
    mongobulk insert
    data size: 11.83 GB

    View full-size slide

  36. 43
    Clearly Outperforms MySQL out
    of the box when working with
    Genome data.

    View full-size slide

  37. 44
    Additional Considerations
    Tradi:onal  rela:onal  stores  
    hold  a  large  set  of  data  but  
    no  where  near  the  size  of  
    data  that  Mongo  is  meant  
    to  hold  out  of  the  box.  
    Being  able  to  keep  the  
    addi:onal  data  easily  
    available  makes  research  
    and  re-­‐analysis  of  genome  
    data  easier.
    Genome  schema  changes  
    are  frequent  and  require  
    tooling  that  can  adapt  to  
    change.  A  schema-­‐less  
    system  is  a  perfect  fit  for  
    genomic  data.  Adding  and  
    removing  documents  is  built  
    into  the  system.  This  is  
    much  more  difficult  with  
    tradi:onal  rela:onal  stores.
    MongoDB  is  built  to  be  
    controlled  with  JavaScript.  
    Many  high  performance  
    applica:ons  are  being  built  
    with  JavaScript.  Younger  
    researchers  are  much  more  
    likely  to  be  familiar  with  the  
    language.
    Scale & Growth Adapts to Change Research
    Programming

    View full-size slide

  38. 45
    Thank You
    @donnfelker
    donnfelker.com
    [email protected]

    View full-size slide