$30 off During Our Annual Pro Sale. View Details »

Scaling Science

Matt Wood
November 21, 2012

Scaling Science

Introducing five principles for reproducibility.

Matt Wood

November 21, 2012
Tweet

More Decks by Matt Wood

Other Decks in Science

Transcript

  1. Scaling Science
    [email protected]
    Dr. Matt Wood

    View Slide

  2. Hello

    View Slide

  3. Science

    View Slide

  4. Beautiful, unique.

    View Slide

  5. Impossible to re-create

    View Slide

  6. Snowflake Science

    View Slide

  7. Reproducibility

    View Slide

  8. Reproducibility scales science

    View Slide

  9. Reproduce. Reuse. Remix.

    View Slide

  10. Value++

    View Slide

  11. View Slide

  12. How do we get from
    here to there?
    5PRINCIPLES
    REPRODUCIBILITY
    OF

    View Slide

  13. 1. Data has Gravity
    5 PRINCIPLES
    REPRODUCIBILITY
    OF

    View Slide

  14. Increasingly large data
    collections

    View Slide

  15. 1000 Genomes Project: 200Tb

    View Slide

  16. Challenging to obtain and manage

    View Slide

  17. Expensive to experiment

    View Slide

  18. Large barrier to reproducibility

    View Slide

  19. Data size will increase

    View Slide

  20. Data integration will increase

    View Slide

  21. Data dependencies will increase

    View Slide

  22. Move data to the users

    View Slide

  23. Move data to the users
    X

    View Slide

  24. Move tools to the data

    View Slide

  25. Place data where it can
    consumed by tools

    View Slide

  26. Place tools where they
    can access data

    View Slide

  27. View Slide

  28. View Slide

  29. View Slide

  30. Canonical source

    View Slide

  31. View Slide

  32. More data,
    more users,
    more uses,
    more locations

    View Slide

  33. Cost

    View Slide

  34. Force multiplier

    View Slide

  35. Cost

    View Slide

  36. Complexity

    View Slide

  37. Cost and complexity
    kill reproducibility

    View Slide

  38. Utility computing

    View Slide

  39. Availability

    View Slide

  40. Pay-as-you-go

    View Slide

  41. Flexibility

    View Slide

  42. Performance

    View Slide

  43. CPU + IO

    View Slide

  44. Intel Xeon E5
    NVIDIA Tesla GPUs

    View Slide

  45. 240 TFLOPS

    View Slide

  46. 90 - 120k IOPS on SSDs

    View Slide

  47. Performance through productivity

    View Slide

  48. Cost

    View Slide

  49. On-demand access

    View Slide

  50. Reserved capacity

    View Slide

  51. 100%
    Reserved capacity

    View Slide

  52. 100%
    Reserved capacity
    On-demand

    View Slide

  53. 100%
    Reserved capacity
    On-demand

    View Slide

  54. Spot instances

    View Slide

  55. Utility computing enhanced
    reproducibility

    View Slide

  56. View Slide

  57. 2. Ease of use is a pre-requisite
    5 PRINCIPLES
    REPRODUCIBILITY
    OF

    View Slide

  58. http://headrush.typepad.com/creating_passionate_users/2005/10/getting_users_p.html

    View Slide

  59. Help overcome the suck threshold

    View Slide

  60. Easy to embrace and extend

    View Slide

  61. Choose the right abstraction
    for the user

    View Slide

  62. $ ec2-run-instances

    View Slide

  63. $ starcluster start

    View Slide

  64. View Slide

  65. Package and automate

    View Slide

  66. Package and automate
    Amazon machine images,
    VM import

    View Slide

  67. Package and automate
    Amazon machine images,
    VM import
    Deployment scripts,
    CloudFormation, Chef, Puppet

    View Slide

  68. Expert-as-a-service

    View Slide

  69. View Slide

  70. View Slide

  71. 1000 Genomes
    Cloud BioLinux

    View Slide

  72. View Slide

  73. Your HiSeq data
    Illumina BaseSpace

    View Slide

  74. Architectural freedom

    View Slide

  75. Freedom of abstraction

    View Slide

  76. 3. Reuse is as important as
    reproduction
    5 PRINCIPLES
    REPRODUCIBILITY
    OF

    View Slide

  77. Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics

    View Slide

  78. Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics

    View Slide

  79. Infonauts are hackers

    View Slide

  80. They have their own way of
    working

    View Slide

  81. The ‘Big Red Button’

    View Slide

  82. Fire and forget reproduction
    is a good first step, but limits
    longer term value.

    View Slide

  83. Monolithic, one-stop-shop

    View Slide

  84. Work well for intended purpose

    View Slide

  85. Challenging to install,
    dependency heavy

    View Slide

  86. Di cult to grok

    View Slide

  87. Inflexible

    View Slide

  88. Infonauts are hackers:
    embrace it.

    View Slide

  89. Small things. Loosely coupled.

    View Slide

  90. Easier to grok

    View Slide

  91. Easier to reuse

    View Slide

  92. Easier to integrate

    View Slide

  93. Lower barrier to entry

    View Slide

  94. Scale out

    View Slide

  95. Build for reuse.
    Be remix friendly.
    Maximize value.

    View Slide

  96. 4. Build for collaboration
    5 PRINCIPLES
    REPRODUCIBILITY
    OF

    View Slide

  97. Workflows are memes

    View Slide

  98. Reproduction is just the first step

    View Slide

  99. Bill of materials:
    code, data, configuration,
    infrastructure

    View Slide

  100. Full definition for reproduction

    View Slide

  101. Utility computing provides a
    playground for bioinformatics

    View Slide

  102. Code + AMI +
    custom datasets + public datasets +
    databases + compute + result data

    View Slide

  103. Code + AMI +
    custom datasets + public datasets +
    databases + compute + result data

    View Slide

  104. Code + AMI +
    custom datasets + public datasets +
    databases + compute + result data

    View Slide

  105. Code + AMI +
    custom datasets + public datasets +
    databases + compute + result data

    View Slide

  106. Package, automate, contribute.

    View Slide

  107. Utility platform provides
    scale for production runs

    View Slide

  108. Drug discovery on 50k cores:
    Less than $1000

    View Slide

  109. 5. Provenance is a first class object
    5 PRINCIPLES
    REPRODUCIBILITY
    OF

    View Slide

  110. Versioning becomes really important

    View Slide

  111. Especially in an active community

    View Slide

  112. Doubly so with loosely coupled tools

    View Slide

  113. Provenance metadata is a
    first class entity

    View Slide

  114. Distributed provenance

    View Slide

  115. 1. Data has gravity
    2. Ease of use is a pre-requisite
    3. Reuse is as important as reproduction
    4. Build for collaboration
    5. Provenance is a first class object
    5PRINCIPLES
    REPRODUCIBILITY
    OF

    View Slide

  116. View Slide

  117. Thank you
    aws.amazon.com
    @mza
    [email protected]

    View Slide