Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data-driven Innovation

39488f9d172ab92fd352f2cd7b73258d?s=47 Matt Wood
October 10, 2012

Data-driven Innovation

Slides from my session at the #AWS Public Sector Summit, 2012.

39488f9d172ab92fd352f2cd7b73258d?s=128

Matt Wood

October 10, 2012
Tweet

More Decks by Matt Wood

Other Decks in Technology

Transcript

  1. Data-driven innovation matthew@amazon.com Dr. Matt Wood @mza

  2. Hello

  3. Hello

  4. Data

  5. DNA

  6. Chromosome 11 : ACTN3 : rs1815739

  7. Chromosome X : rs6625163

  8. Chromosome 19 : FUT2 : rs601338

  9. Chromosome 2 : rs10427255

  10. TYPE II Chromosome 10 : rs7903146

  11. +0.25 Chromosome 15 : rs2472297

  12. I know this, because...

  13. None
  14. A T C G G T C C A G

    G
  15. A T C G G T C C A G

    G A G C C A G G U C C Transcription
  16. A T C G G T C C A G

    G A G C C A G G U C C Translation Ser Glu Val Transcription
  17. None
  18. None
  19. Chromosome 11 : ACTN3 : rs1815739

  20. Chromosome X : rs6625163

  21. Chromosome 19 : FUT2 : rs601338

  22. Chromosome 2 : rs10427255

  23. TYPE II Chromosome 10 : rs7903146

  24. +0.25 Chromosome 15 : rs2472297

  25. I know all that, because...

  26. Human Genome Project

  27. 40 species ensembl.org

  28. Compare

  29. Change

  30. Less

  31. None
  32. None
  33. Compare

  34. Transformative

  35. None
  36. Data generation costs are falling everywhere

  37. Customer segmentation, financial modeling, system analysis, line of sight, business

    intelligence.
  38. Opportunity

  39. Transformation

  40. Innovation

  41. Generation Collection & storage Analytics & computation Collaboration & sharing

  42. Generation Collection & storage Analytics & computation Collaboration & sharing

    lower cost, increased throughput
  43. Generation Collection & storage Analytics & computation Collaboration & sharing

    lower cost, increased throughput highly constrained
  44. Barrier

  45. Data generation challenge X

  46. Analytics challenge

  47. Accessibility challenge

  48. Enter the AWS Cloud

  49. Utility

  50. Remove constraints

  51. Data-driven innovation

  52. Distributed

  53. 2

  54. 2 Software for distributed storage & analysis

  55. 2 Software for distributed storage & analysis Infrastructure for distributed

    storage & analysis
  56. Software Frameworks for data-intensive work loads. Distributed by design.

  57. Infrastructure Platform for data-intensive work loads. Distributed by design.

  58. Support the data timeline

  59. Generation Collection & storage Analytics & computation Collaboration & sharing

    highly constrained
  60. Generation Collection & storage Analytics & computation Collaboration & sharing

  61. Lower the barrier to entry

  62. Agility

  63. Responsive

  64. Generation Collection & storage Analytics & computation Collaboration & sharing

  65. Generation DynamoDB Analytics & computation Collaboration & sharing

  66. Generation DynamoDB EC2, Elastic MapReduce Collaboration & sharing

  67. Generation DynamoDB EC2, Elastic MapReduce S3, Public Datasets

  68. Tools and techniques for working productively with data

  69. Scale

  70. Secure

  71. 2 Software for distributed storage & analysis Infrastructure for distributed

    storage & analysis
  72. Amazon EC2

  73. Scale out systems Embarrassingly parallel Queue based distribution Small, medium

    and high scale
  74. High performance

  75. High performance Compute performance

  76. Cluster Compute Intel Xeon E5-2670 10 gigabit, non-blocking network 60.5

    Gb Placement groupings
  77. Cluster Compute Intel Xeon E5-2670 10 gigabit, non-blocking network 60.5

    Gb Placement groupings +GPU
  78. 240 TFLOPS

  79. High performance Compute performance IO performance

  80. Unstructured

  81. Variable

  82. Amazon DynamoDB Predictable, consistent performance Unlimited storage Single digit millisecond

    latencies No schema. Zero admin.
  83. ...and SSDs for all

  84. hi1.4xlarge 2 x 1Tb SSD storage 10 gigabit networking HVM:

    90k IOPS read, 9k to 75k write PV: 120k IOPS read, 10k to 85k write
  85. Netflix “The hi1.4xlarge configuration is about half the system cost

    for the same throughput.” http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.html
  86. Provisioned IOPS Provision required IO performance EBS optimized instances

  87. Cost optimization

  88. Reserved capacity

  89. Reserved capacity On-demand

  90. Reserved capacity On-demand

  91. Spot instances

  92. None
  93. $0.2530 vs $2.40

  94. 2 Software for distributed storage & analysis Infrastructure for distributed

    storage & analysis
  95. map/reduce

  96. Map. Reduce.

  97. Write functions. Scale up.

  98. Hadoop

  99. Undi erentiated heavy lifting

  100. Amazon Elastic MapReduce Managed Hadoop Clusters Easy to provision and

    monitor Write two functions. Scale up. Choice of Hadoop flavors
  101. Amazon Elastic MapReduce Integrates with S3 Analytics for DynamoDB Perfect

    for Spot pricing
  102. Input data S3

  103. Elastic MapReduce Code Input data S3

  104. Elastic MapReduce Code Name node Input data S3

  105. Elastic MapReduce Code Name node Input data S3 Elastic cluster

  106. Elastic MapReduce Code Name node Input data S3 Elastic cluster

    HDFS
  107. Elastic MapReduce Code Name node Input data S3 Elastic cluster

    HDFS Queries + BI Via JDBC, Pig, Hive
  108. Elastic MapReduce Code Name node Output S3 + SimpleDB Input

    data S3 Elastic cluster HDFS Queries + BI Via JDBC, Pig, Hive
  109. Output S3 + SimpleDB Input data S3

  110. CDC Centers for Disease Control and Prevention

  111. “BioSense 2.0 protects the health of the American people by

    providing timely insight into the health of communities, regions, and the nation by o ering a variety of features to improve data collection, standardization, storage, analysis, and collaboration”
  112. Health data Collection & storage Analytics & computation Collaboration &

    sharing
  113. Health data Collection & storage Analytics & computation Collaboration &

    sharing highly constrained
  114. HIPAA, HITECH, FISMA Moderate

  115. GovCloud

  116. Beyond a definition of Big Data

  117. Chromosome 11 : ACTN3 : rs1815739

  118. Chromosome X : rs6625163

  119. Chromosome 19 : FUT2 : rs601338

  120. Chromosome 2 : rs10427255

  121. TYPE II Chromosome 10 : rs7903146

  122. +0.25 Chromosome 15 : rs2472297

  123. Thank you aws.amazon.com @mza matthew@amazon.com