Scaling Science

39488f9d172ab92fd352f2cd7b73258d?s=47 Matt Wood
November 21, 2012

Scaling Science

Introducing five principles for reproducibility.

39488f9d172ab92fd352f2cd7b73258d?s=128

Matt Wood

November 21, 2012
Tweet

Transcript

  1. Scaling Science matthew@amazon.com Dr. Matt Wood

  2. Hello

  3. Science

  4. Beautiful, unique.

  5. Impossible to re-create

  6. Snowflake Science

  7. Reproducibility

  8. Reproducibility scales science

  9. Reproduce. Reuse. Remix.

  10. Value++

  11. None
  12. How do we get from here to there? 5PRINCIPLES REPRODUCIBILITY

    OF
  13. 1. Data has Gravity 5 PRINCIPLES REPRODUCIBILITY OF

  14. Increasingly large data collections

  15. 1000 Genomes Project: 200Tb

  16. Challenging to obtain and manage

  17. Expensive to experiment

  18. Large barrier to reproducibility

  19. Data size will increase

  20. Data integration will increase

  21. Data dependencies will increase

  22. Move data to the users

  23. Move data to the users X

  24. Move tools to the data

  25. Place data where it can consumed by tools

  26. Place tools where they can access data

  27. None
  28. None
  29. None
  30. Canonical source

  31. None
  32. More data, more users, more uses, more locations

  33. Cost

  34. Force multiplier

  35. Cost

  36. Complexity

  37. Cost and complexity kill reproducibility

  38. Utility computing

  39. Availability

  40. Pay-as-you-go

  41. Flexibility

  42. Performance

  43. CPU + IO

  44. Intel Xeon E5 NVIDIA Tesla GPUs

  45. 240 TFLOPS

  46. 90 - 120k IOPS on SSDs

  47. Performance through productivity

  48. Cost

  49. On-demand access

  50. Reserved capacity

  51. 100% Reserved capacity

  52. 100% Reserved capacity On-demand

  53. 100% Reserved capacity On-demand

  54. Spot instances

  55. Utility computing enhanced reproducibility

  56. None
  57. 2. Ease of use is a pre-requisite 5 PRINCIPLES REPRODUCIBILITY

    OF
  58. http://headrush.typepad.com/creating_passionate_users/2005/10/getting_users_p.html

  59. Help overcome the suck threshold

  60. Easy to embrace and extend

  61. Choose the right abstraction for the user

  62. $ ec2-run-instances

  63. $ starcluster start

  64. None
  65. Package and automate

  66. Package and automate Amazon machine images, VM import

  67. Package and automate Amazon machine images, VM import Deployment scripts,

    CloudFormation, Chef, Puppet
  68. Expert-as-a-service

  69. None
  70. None
  71. 1000 Genomes Cloud BioLinux

  72. None
  73. Your HiSeq data Illumina BaseSpace

  74. Architectural freedom

  75. Freedom of abstraction

  76. 3. Reuse is as important as reproduction 5 PRINCIPLES REPRODUCIBILITY

    OF
  77. Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics

  78. Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics

  79. Infonauts are hackers

  80. They have their own way of working

  81. The ‘Big Red Button’

  82. Fire and forget reproduction is a good first step, but

    limits longer term value.
  83. Monolithic, one-stop-shop

  84. Work well for intended purpose

  85. Challenging to install, dependency heavy

  86. Di cult to grok

  87. Inflexible

  88. Infonauts are hackers: embrace it.

  89. Small things. Loosely coupled.

  90. Easier to grok

  91. Easier to reuse

  92. Easier to integrate

  93. Lower barrier to entry

  94. Scale out

  95. Build for reuse. Be remix friendly. Maximize value.

  96. 4. Build for collaboration 5 PRINCIPLES REPRODUCIBILITY OF

  97. Workflows are memes

  98. Reproduction is just the first step

  99. Bill of materials: code, data, configuration, infrastructure

  100. Full definition for reproduction

  101. Utility computing provides a playground for bioinformatics

  102. Code + AMI + custom datasets + public datasets +

    databases + compute + result data
  103. Code + AMI + custom datasets + public datasets +

    databases + compute + result data
  104. Code + AMI + custom datasets + public datasets +

    databases + compute + result data
  105. Code + AMI + custom datasets + public datasets +

    databases + compute + result data
  106. Package, automate, contribute.

  107. Utility platform provides scale for production runs

  108. Drug discovery on 50k cores: Less than $1000

  109. 5. Provenance is a first class object 5 PRINCIPLES REPRODUCIBILITY

    OF
  110. Versioning becomes really important

  111. Especially in an active community

  112. Doubly so with loosely coupled tools

  113. Provenance metadata is a first class entity

  114. Distributed provenance

  115. 1. Data has gravity 2. Ease of use is a

    pre-requisite 3. Reuse is as important as reproduction 4. Build for collaboration 5. Provenance is a first class object 5PRINCIPLES REPRODUCIBILITY OF
  116. None
  117. Thank you aws.amazon.com @mza matthew@amazon.com