A Platform for Big Data

A Platform for Big Data

39488f9d172ab92fd352f2cd7b73258d?s=128

Matt Wood

June 11, 2013
Tweet

Transcript

  1. DATA Building a for a presentation by DR. MATT WOOD

    PLATFORM
  2. THANK YOU Hello, and

  3. SEVEN years young

  4. SERVICES to support virtually any workload Broad and deep

  5. None
  6. EVERY DAY to power amazon.com in 2003 Add enough server

    capacity
  7. UTILITY Computing delivered as a

  8. ECONOMIES of scale to lower prices Take advantage of the

  9. Q4 2006 Q1 2007 Q2 2007 Q3 2007 Q4 2007

    Q1 2008 Q2 2008 Q3 2008 Q4 2008 Q1 2009 Q2 2009 Q3 2009 Q4 2009 Q1 2010 Q2 2010 Q3 2010 Q4 2010 Q1 2011 Q2 2011 Q3 2011 Q4 2011 Q1 2012 Q2 2012 Q3 2012 Q4 2012 Q1 2013 2 TRILLION OBJECTS
  10. 5/22/2010 6/12/2010 7/3/2010 7/24/2010 8/14/2010 9/4/2010 9/25/2010 10/16/2010 11/6/2010 11/27/2010

    12/18/2010 1/8/2011 1/29/2011 2/19/2011 3/12/2011 4/2/2011 4/23/2011 5/14/2011 6/4/2011 6/25/2011 7/16/2011 8/6/2011 8/27/2011 9/17/2011 10/8/2011 10/29/2011 11/19/2011 12/10/2011 12/31/2011 1/21/2012 2/11/2012 3/3/2012 3/24/2012 4/14/2012 5/5/2012 5/26/2012 6/16/2012 7/7/2012 7/28/2012 8/18/2012 9/8/2012 9/29/2012 10/20/2012 11/10/2012 12/01/2012 12/22/2012 1/12/2013 2/2/2013 2/23/2013 3/16/2013 4/6/2013 5.5 MILLION HADOOP CLUSTERS
  11. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING

  12. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING

    Lower cost, Higher throughput
  13. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING

    Lower cost, Higher throughput
  14. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING

    Lower cost, Higher throughput Highly constrained
  15. 1990 2000 2010 2020 The Data Analysis Gap Enterprise Data

    Data in Warehouse Generated data Available for analysis Data volume Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  16. REMOVES resource constraints Utility computing

  17. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING

  18. Technologies and techniques for working productively with data, at any

    scale.
  19. Technologies and techniques for working productively with data, at any

    scale.
  20. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

  21. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

    AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AWS IMPORT/EXPORT
  22. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

    AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS IMPORT/EXPORT AMAZON CG1
  23. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

    AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON CG1
  24. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

    AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON CG1
  25. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

    AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON CG1
  26. NOSQL DATASTORE Managed

  27. UNLIMITED Virtually throughput and scale

  28. MILLISECOND Single digit latencies

  29. SOLID STATE Running on drives

  30. DURABILITY Storing data with across data centers and availability zones

  31. ZERO ADMIN

  32. KEYS & VALUES Store without requiring a schema

  33. KEYS & VALUES AMAZON DYNAMODB ORDER ID DATE ORDER TOTAL

    MERCHANT
  34. KEYS & VALUES AMAZON DYNAMODB ORDER ID DATE ORDER TOTAL

    MERCHANT Hash key
  35. KEYS & VALUES AMAZON DYNAMODB ORDER ID DATE ORDER TOTAL

    MERCHANT Hash key Range key
  36. KEYS & VALUES AMAZON DYNAMODB ORDER ID DATE ORDER TOTAL

    MERCHANT Hash key Range key Secondary index
  37. KEYS & VALUES AMAZON DYNAMODB ORDER ID DATE ORDER TOTAL

    MERCHANT Hash key Range key Secondary index Projected attribute
  38. API AMAZON DYNAMODB CreateTable UpdateTable DeleteTable DescribeTable ListTables Query Scan

    PutItem GetItem UpdateItem DeleteItem BatchGetItem BatchWriteItem
  39. READS, WRITES, UPDATES AMAZON DYNAMODB Item level transactions only. Conditional

    and atomic updates. Counts. Top/bottom n values. Results paged to 1MB in size.
  40. THROUGHPUT Provisioned

  41. PROVISIONED THROUGHPUT AMAZON DYNAMODB Provision the IO your application needs.

    Pay per unit of provisioned capacity. Consistent predictable performance, irrespective of scale. Designed for uniform workload.
  42. YOUR APP DYNAMODB

  43. YOUR APP DYNAMODB READ THROUGHPUT

  44. READ THROUGHPUT AMAZON DYNAMODB IO per 4kb item. Strong and

    eventual consistency. Mix and match consistency.
  45. YOUR APP DYNAMODB READ THROUGHPUT WRITE THROUGHPUT

  46. WRITE THROUGHPUT AMAZON DYNAMODB IO per 1kb item. Atomic increment

    and decrement. Optimistic concurrency control.
  47. YOUR APP DYNAMODB READ THROUGHPUT WRITE THROUGHPUT

  48. YOUR APP DYNAMODB READ THROUGHPUT WRITE THROUGHPUT

  49. YOUR APP DYNAMODB READ THROUGHPUT WRITE THROUGHPUT 14.2% 14.2% 14.2%

    14.2% 14.2% 14.2% 14.2% THROUGHPUT
  50. YOUR APP DYNAMODB READ THROUGHPUT WRITE THROUGHPUT 14.2% 14.2% 14.2%

    14.2% 14.2% 14.2% 14.2% THROUGHPUT KEY ACCESS 14.2% 14.2% 14.2% 14.2% 14.2% 14.2% 14.2%
  51. YOUR APP DYNAMODB READ THROUGHPUT WRITE THROUGHPUT 14.2% 14.2% 14.2%

    14.2% 14.2% 14.2% 14.2% THROUGHPUT KEY ACCESS 0% 50% 0% 50% 0% 0% 0%
  52. None
  53. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

    AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON CG1
  54. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

    AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON CG1
  55. HADOOP CLUSTERS Managed

  56. ELASTICITY Hadoop with

  57. Input data S3, DynamoDB, Redshift

  58. Elastic MapReduce Code Input data S3, DynamoDB, Redshift

  59. Elastic MapReduce Code Name node Input data S3, DynamoDB, Redshift

  60. Elastic MapReduce Code Name node Input data Elastic cluster S3,

    DynamoDB, Redshift S3/HDFS
  61. Elastic MapReduce Code Name node Input data S3/HDFS Queries +

    BI Via JDBC, Pig, Hive S3, DynamoDB, Redshift Elastic cluster
  62. Elastic MapReduce Code Name node Output Input data Queries +

    BI Via JDBC, Pig, Hive S3, DynamoDB, Redshift Elastic cluster S3/HDFS
  63. Output Input data S3, DynamoDB, Redshift

  64. 10 HOURS ELASTIC MAPREDUCE

  65. 6 HOURS ELASTIC MAPREDUCE

  66. PEAK CAPACITY ELASTIC MAPREDUCE

  67. HADOOP ALL THE WAY DOWN ELASTIC MAPREDUCE Pig, Hive, Mesos,

    Avro, Spark, Shark MapR, Informatica Mahout, Nutch, Flume Accumulo, Cascading, Oozie HBase, Sqoop
  68. SPOT Built for

  69. On demand instance: $0.50 per hour $0.0350 Today: 7% of

    on-demand price. “Overclock” by 14x
  70. None
  71. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

    AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON CG1
  72. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

    AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON CG1
  73. DATA WAREHOUSE Managed, petabyte scale

  74. 100s GB to 1.6PB Scale from

  75. COLUMNAR STORE REDSHIFT Designed for columnar access. Automatic data compression.

    Large block size. Best practices for data loading. Continual incremental backup to S3.
  76. PARALLEL PROCESSING REDSHIFT Fully bisectional 10 gigE network. 128GB RAM.

    Xeon E5 platform. 16TB across 24 spindles.
  77. LEADER COMPUTE COMPUTE COMPUTE S3 BI TOOLS

  78. LEADER COMPUTE COMPUTE COMPUTE S3 BI TOOLS READ ONLY LEADER

    COMPUTE COMPUTE COMPUTE S3 COMPUTE COMPUTE
  79. $999 PER TB PER YEAR

  80. HS1 ON EC2 2.4 GB/s of 2MiB sequential reads. 2.6

    GB/s for sequential writes.
  81. HI1 ON EC2 2 x 1TB SSDs 4kb random reads:

    120k IOPS 4kb random writes: 10k - 80k IOPS
  82. None
  83. Technologies and techniques for working productively with data, at any

    scale.
  84. http://headrush.typepad.com/creating_passionate_users/2005/10/getting_users_p.html

  85. PRE-REQUISITE Ease of use is a

  86. COMPUTE Move data to the

  87. DATA Move tools to the

  88. CONSUMED Place data where it can be by those tools

  89. RIGHT LEVEL Expose data at the

  90. None
  91. None
  92. None
  93. None
  94. S3 DYNAMODB EMR EMR REDSHIFT DYNAMODB DATA PIPELINE

  95. S3 DYNAMODB EMR EMR REDSHIFT DATA PIPELINE DYNAMODB

  96. create external table items_db (id string, votes bigint, views bigint)

    stored by 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' tblproperties ("dynamodb.table.name" = "items", "dynamodb.column.mapping" = "id:id,votes:votes,views:views");
  97. select id, likes, views from items_db order by views desc;

  98. CREATE EXTERNAL TABLE orders_s3_new_export ( order_id string, customer_id string, order_date

    int, total double ) PARTITIONED BY (year string, month string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://export_bucket'; INSERT OVERWRITE TABLE orders_s3_new_export PARTITION (year='2012', month='01') SELECT * from orders_ddb_2012_01;
  99. S3 DYNAMODB EMR EMR REDSHIFT DATA PIPELINE DYNAMODB

  100. S3DISTCOPY Fast, optimized data movement with

  101. HDFS Work with S3 as

  102. S3 DYNAMODB EMR EMR REDSHIFT DYNAMODB DATA PIPELINE

  103. COPY Read and load data in parallel with

  104. READRATIO Use to limit throughput consumption

  105. S3 DYNAMODB EMR EMR REDSHIFT DATA PIPELINE DYNAMODB

  106. DATA INTENSIVE Reliable, scheduled workflows

  107. INPUT DATA ACTIVITY OUTPUT

  108. INPUT DATA ACTIVITY OUTPUT Precondition checks Failure and delay notifications

  109. None
  110. S3 DYNAMODB EMR EMR REDSHIFT DYNAMODB DATA PIPELINE

  111. Amazon S3 http://www.youtube.com/watch?v=oGcZ7WVx6EI Legacy data warehousing Cassandra Aegisthus Hadoop, Hive,

    Pig
  112. Amazon S3 http://www.youtube.com/watch?v=oGcZ7WVx6EI Legacy data warehousing Cassandra Aegisthus Hadoop, Hive,

    Pig Microstrategy Sting R
  113. None
  114. None
  115. 98% time saved for clinical trial simulations Internal System AWS

    Individual Clinical Trial Simulation Run Time (Min) 56 56 Total Number of Clinical Trial Simulations 2000 2000 No. Servers 2 256 No. CPU’s 32 2048 Total Analysis Run Time (hr) 60 1.2 Cost ?? $336
  116. Reduced burden on pediatric subjects Traditional Design Design Optimized Using

    Clinical Trial Simulation # of subjects 60 40 # of blood samples per subject 12 5 Length of stay per subject 72 hours 26 hours Length of study 2.5 years 1.7 years Total study cost $700K $250K Length and cost projected based on historical data in pediatric subjects
  117. Anurag Gupta awgupta@amazon.com David Lang davelang@amazon.com Matt Wood matthew@amazon.com Jon

    Einkauf jeinkauf@amazon.com