Pro Yearly is on sale from $80 to $50! »

The New Genomics

39488f9d172ab92fd352f2cd7b73258d?s=47 Matt Wood
October 02, 2012

The New Genomics

The value of reproducing, reusing and remixing scientific research. Slides from Strata London.

39488f9d172ab92fd352f2cd7b73258d?s=128

Matt Wood

October 02, 2012
Tweet

Transcript

  1. The New Genomics matthew@amazon.com Dr. Matt Wood

  2. Hello

  3. Hello

  4. Data

  5. DNA

  6. Chromosome 11 : ACTN3 : rs1815739

  7. Chromosome X : rs6625163

  8. Chromosome 19 : FUT2 : rs601338

  9. +0.25 Chromosome 15 : rs2472297

  10. Chromosome 2 : rs10427255

  11. TYPE II Chromosome 10 : rs7903146

  12. Chromosome 1 : rs4481887

  13. I know this, because...

  14. None
  15. A T C G G T C C A G

    G
  16. A T C G G T C C A G

    G A G C C A G G U C C Transcription
  17. A T C G G T C C A G

    G A G C C A G G U C C Translation Ser Glu Val Transcription
  18. None
  19. None
  20. Chromosome 11 : ACTN3 : rs1815739

  21. Chromosome X : rs6625163

  22. Chromosome 19 : FUT2 : rs601338

  23. +0.25 Chromosome 15 : rs2472297

  24. Chromosome 2 : rs10427255

  25. TYPE II Chromosome 10 : rs7903146

  26. Chromosome 1 : rs4481887

  27. I know all that, because...

  28. Human Genome Project

  29. 40 species ensembl.org

  30. Compare species

  31. Biological importance

  32. Step change

  33. Less time. Lower cost.

  34. None
  35. None
  36. Compare individuals

  37. None
  38. Data generation costs are falling (pretty much everywhere)

  39. Sequencing challenge X

  40. Amazona vittata

  41. Analytics challenge

  42. Lots of data, Lots of uses, Lots of users, Lots

    of locations
  43. Cost

  44. Analytics challenge X

  45. Accessibility challenge

  46. The New Genomics

  47. Graceful. Beautiful.

  48. Impossible to re-create

  49. Snowflake Science

  50. Reproducibility

  51. Reproducibility scales science

  52. Reproduce. Reuse. Remix.

  53. Value++

  54. None
  55. How do we get from here to there? 5PRINCIPLES REPRODUCIBILITY

    OF
  56. 1. Use the gravity of data 5 PRINCIPLES REPRODUCIBILITY OF

  57. Increasingly large data collections

  58. 1000 Genomes Project: 200Tb

  59. Challenging to obtain and manage

  60. Expensive to experiment

  61. Large barrier to reproducibility

  62. Data size will increase

  63. Data integration will increase

  64. Move data to the users

  65. Move data to the users X

  66. Move tools to the data

  67. Place data where it can consumed by tools

  68. Place tools where they can access data

  69. None
  70. None
  71. None
  72. Canonical source

  73. None
  74. More data, more users, more uses, more locations

  75. Cost and complexity

  76. Cost and complexity kill reproducibility

  77. Utility computing

  78. Availability

  79. Intel Xeon E5 NVIDIA Tesla GPUs

  80. 90 - 120k IOPS on SSDs

  81. Pay-as-you-go

  82. 100% Reserved capacity

  83. 100% Reserved capacity On-demand

  84. 100% Reserved capacity On-demand

  85. Spot instances

  86. Name-your-price

  87. None
  88. 2. Ease of use is a pre-requisite 5 PRINCIPLES REPRODUCIBILITY

    OF
  89. http://headrush.typepad.com/creating_passionate_users/2005/10/getting_users_p.html

  90. Help overcome the suck threshold

  91. Easy to embrace and extend

  92. Choose the right abstraction for the user

  93. $ ec2-run-instances

  94. $ starcluster start

  95. None
  96. None
  97. Package and automate

  98. Package and automate Amazon machine images, VM import

  99. Package and automate Amazon machine images, VM import Deployment scripts,

    CloudFormation, Chef, Puppet
  100. Expert-as-a-service

  101. None
  102. None
  103. 1000 Genomes Cloud BioLinux

  104. None
  105. Your HiSeq data Illumina BaseSpace

  106. DNA and RNA sequences Genomespace, Broad Institute at MIT

  107. Data as a programmable resource

  108. 3. Reuse is as important as reproduction 5 PRINCIPLES REPRODUCIBILITY

    OF
  109. Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics

  110. Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics

  111. Infonauts are hackers

  112. They have their own way of working

  113. The ‘Big Red Button’

  114. Fire and forget reproduction is a good first step, but

    limits longer term value.
  115. Monolithic, one-stop-shop

  116. Work well for intended purpose

  117. Challenging to install, dependency heavy

  118. Inflexible

  119. Embrace infonauts as hackers

  120. Small things. Loosely coupled.

  121. Easier to reuse

  122. Easier to integrate

  123. Scale out

  124. Cancer drug discovery: 50,000 cores < $1000 an hour Schrödinger

    and CycleServer
  125. 4. Build for collaboration 5 PRINCIPLES REPRODUCIBILITY OF

  126. Workflows are memes

  127. Reproduction is just the first step

  128. Bill of materials: code, data, configuration, infrastructure

  129. Full definition for reproduction

  130. Utility computing provides a playground for data science

  131. Code + AMI + custom datasets + public datasets +

    databases + compute + result data
  132. Code + AMI + custom datasets + public datasets +

    databases + compute + result data
  133. Code + AMI + custom datasets + public datasets +

    databases + compute + result data
  134. Code + AMI + custom datasets + public datasets +

    databases + compute + result data
  135. Package, automate, contribute.

  136. Utility platform provides scale for production runs

  137. 5. Provenance is a first class object 5 PRINCIPLES REPRODUCIBILITY

    OF
  138. Versioning becomes really important

  139. Especially in an active community

  140. Doubly so with loosely coupled tools

  141. Provenance metadata is a first class entity

  142. Distributed provenance

  143. 5PRINCIPLES REPRODUCIBILITY OF

  144. Remove constraints 5PRINCIPLES REPRODUCIBILITY OF

  145. Accelerate science 5PRINCIPLES REPRODUCIBILITY OF

  146. Chromosome 11 : ACTN3 : rs1815739

  147. Chromosome X : rs6625163

  148. Chromosome 19 : FUT2 : rs601338

  149. +0.25 Chromosome 15 : rs2472297

  150. Chromosome 2 : rs10427255

  151. TYPE II Chromosome 10 : rs7903146

  152. Chromosome 1 : rs4481887

  153. Thank you aws.amazon.com @mza matthew@amazon.com