Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Enabling Research in the Cloud

39488f9d172ab92fd352f2cd7b73258d?s=47 Matt Wood
April 28, 2012

Enabling Research in the Cloud

From a talk at Bio-IT World Expo 2012, these slides introduce how the cloud is enabling modern research, from data collection, computation and collaboration, to scientific reproducibility and reuse.

39488f9d172ab92fd352f2cd7b73258d?s=128

Matt Wood

April 28, 2012
Tweet

Transcript

  1. Enabling Research Matt Wood D A T A I N

    T E N S I V E & H I G H P E R F O R M A N C E C O M P U T I N G Cloud in the
  2. Hello.

  3. None
  4. None
  5. Thank you

  6. Infrastructure building blocks

  7. None
  8. Consumer business Seller business

  9. Decades of experience Operations, management and scale

  10. Programmatic access

  11. Unexpected innovation

  12. Blinding flash of the obvious

  13. None
  14. 6 years young Amazon S3 launched on March 14th, 2006

  15. 0 250.000 500.000 750.000 1000.000 Q4 2006 Q4 2007 Q4

    2008 Q4 2009 Q4 2010 Q4 2011 Q1 2012 Billions of objects Objects in S3 906B 600k+ peak transactions per second
  16. 99.999999999% durability

  17. Life sciences

  18. A T C G

  19. Storage Compute Databases Tools

  20. Collection Computation Collaboration

  21. Collection Computation Collaboration

  22. Collection Computation Collaboration

  23. Collection Computation Collaboration

  24. Availability

  25. Availability Programmable On-demand

  26. Flexibility

  27. Elasticity

  28. Collection Computation Collaboration

  29. Collection Computation Collaboration

  30. None
  31. Data stays local

  32. Availability zones Design for durability

  33. Shared responsibility

  34. Data movement

  35. Data movement Upload with large object support

  36. Data movement Upload with large object support Multi-part, parallel uploads

  37. Data movement Upload with large object support Multi-part, parallel uploads

    Physical media
  38. Data movement Upload with large object support Multi-part, parallel uploads

    Physical media Private network connection
  39. AWS Direct Connect

  40. Direct connection to AWS regions

  41. Consistent network performance

  42. Private connectivity

  43. Elastic 1Gbps and 10 Gbps

  44. Reduced bandwidth costs ISP and lower Direct Connect pricing

  45. Globus Online 3.8 PB moved (as of this morning!)

  46. Aspera

  47. Public Datasets

  48. 1000 Genomes Project aws.amazon.com/1000genomes

  49. Collection Computation Collaboration

  50. Scale

  51. Scale How much can I get? What size will get

    me time most quickly?
  52. Value How much do I need? What value does it

    have for me?
  53. Economies of scale

  54. 19 price drops Committed to passing savings to customers

  55. Utilisation

  56. Achieving economies of scale 100% Time

  57. Achieving economies of scale 100% Reserved capacity

  58. Achieving economies of scale 100% Reserved capacity On-demand

  59. Achieving economies of scale 100% Reserved capacity On-demand

  60. Spot market Choose your own price for compute

  61. Scale out

  62. 30k cores On the spot market. $1279 per hour.

  63. 50k cores Schrodinger and Cycle Computing

  64. 51,132 cores Schrodinger and Cycle Computing 6742 instances. $4828 per

    hour
  65. Elastic MapReduce Myrna. Crossbow.

  66. Scale up

  67. CC2 Tightly coupled workflows

  68. 240 TFLOPS 42nd fastest supercomputer

  69. Scale cores

  70. GPU on demand AMBER

  71. Scale?

  72. Getting stuff done

  73. StarCluster

  74. Cloud BioLinux Ready to roll with 1000 Genomes data

  75. Collection Computation Collaboration

  76. Galaxy

  77. None
  78. synapse.sagebase.org Collaboration platform for clinical genomic datasets

  79. None
  80. AWS for Education aws.amazon.com/education

  81. Storage Compute Databases Tools

  82. Materials Methods Hypotheses Results

  83. Data Code Pipeline Infrastructure

  84. Data Code Pipeline Infrastructure

  85. Fully defined Data sources. Infrastructure stack. Metadata.

  86. Data Code Pipeline Infrastructure Results

  87. Data Code Pipeline Infrastructure Results Data Code Pipeline Infrastructure Results

  88. Data Code Pipeline Infrastructure Results Data Code Pipeline Infrastructure New

    results
  89. Data Code Pipeline Infrastructure Results v2 Data Code Pipeline Infrastructure

    New results
  90. Reproduce. Remix. Reuse.

  91. Enabled by programmable infrastructure

  92. Enabling science

  93. aws.amazon.com /genomics

  94. Airbnb, CapitalIQ, Marketshare, Bioproximity, Schrodinger and MIT http://aws.amazon.com/big-data-and-hpc-event/boston/

  95. Thank you!

  96. Q & A matthew@amazon.com

  97. Introducing the panel...

  98. Angel Pizarro - U. Pennsylvania Anushka Brownley - Complete Genomics

    Stephen Litster - Novartis