$30 off During Our Annual Pro Sale. View Details »

A Platform for Big Data

A Platform for Big Data

Matt Wood

June 11, 2013
Tweet

More Decks by Matt Wood

Other Decks in Technology

Transcript

  1. DATA
    Building a
    for
    a presentation by
    DR. MATT WOOD
    PLATFORM

    View Slide

  2. THANK YOU
    Hello, and

    View Slide

  3. SEVEN
    years young

    View Slide

  4. SERVICES
    to support virtually any workload
    Broad and deep

    View Slide

  5. View Slide

  6. EVERY DAY
    to power amazon.com in 2003
    Add enough server capacity

    View Slide

  7. UTILITY
    Computing delivered as a

    View Slide

  8. ECONOMIES
    of scale to lower prices
    Take advantage of the

    View Slide

  9. Q4 2006
    Q1 2007
    Q2 2007
    Q3 2007
    Q4 2007
    Q1 2008
    Q2 2008
    Q3 2008
    Q4 2008
    Q1 2009
    Q2 2009
    Q3 2009
    Q4 2009
    Q1 2010
    Q2 2010
    Q3 2010
    Q4 2010
    Q1 2011
    Q2 2011
    Q3 2011
    Q4 2011
    Q1 2012
    Q2 2012
    Q3 2012
    Q4 2012
    Q1 2013
    2 TRILLION OBJECTS

    View Slide

  10. 5/22/2010
    6/12/2010
    7/3/2010
    7/24/2010
    8/14/2010
    9/4/2010
    9/25/2010
    10/16/2010
    11/6/2010
    11/27/2010
    12/18/2010
    1/8/2011
    1/29/2011
    2/19/2011
    3/12/2011
    4/2/2011
    4/23/2011
    5/14/2011
    6/4/2011
    6/25/2011
    7/16/2011
    8/6/2011
    8/27/2011
    9/17/2011
    10/8/2011
    10/29/2011
    11/19/2011
    12/10/2011
    12/31/2011
    1/21/2012
    2/11/2012
    3/3/2012
    3/24/2012
    4/14/2012
    5/5/2012
    5/26/2012
    6/16/2012
    7/7/2012
    7/28/2012
    8/18/2012
    9/8/2012
    9/29/2012
    10/20/2012
    11/10/2012
    12/01/2012
    12/22/2012
    1/12/2013
    2/2/2013
    2/23/2013
    3/16/2013
    4/6/2013
    5.5 MILLION HADOOP CLUSTERS

    View Slide

  11. GENERATION
    COLLECTION &
    STORAGE
    COMPUTATION &
    ANALYTICS
    COLLABORATION &
    SHARING

    View Slide

  12. GENERATION
    COLLECTION &
    STORAGE
    COMPUTATION &
    ANALYTICS
    COLLABORATION &
    SHARING
    Lower cost,
    Higher throughput

    View Slide

  13. GENERATION
    COLLECTION &
    STORAGE
    COMPUTATION &
    ANALYTICS
    COLLABORATION &
    SHARING
    Lower cost,
    Higher throughput

    View Slide

  14. GENERATION
    COLLECTION &
    STORAGE
    COMPUTATION &
    ANALYTICS
    COLLABORATION &
    SHARING
    Lower cost,
    Higher throughput
    Highly constrained

    View Slide

  15. 1990 2000 2010 2020
    The Data Analysis Gap
    Enterprise Data Data in Warehouse
    Generated data
    Available for analysis
    Data volume
    Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
    IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

    View Slide

  16. REMOVES
    resource constraints
    Utility computing

    View Slide

  17. GENERATION
    COLLECTION &
    STORAGE
    COMPUTATION &
    ANALYTICS
    COLLABORATION &
    SHARING

    View Slide

  18. Technologies and techniques
    for working productively with
    data, at any scale.

    View Slide

  19. Technologies and techniques
    for working productively with
    data, at any scale.

    View Slide

  20. GENERATION
    COLLECTION &
    STORAGE
    COMPUTATION &
    ANALYSIS
    COLLABORATION &
    SHARING

    View Slide

  21. GENERATION
    COLLECTION &
    STORAGE
    COMPUTATION &
    ANALYSIS
    COLLABORATION &
    SHARING
    AMAZON S3
    AMAZON DYNAMODB
    AMAZON GLACIER
    AMAZON RDS
    AWS IMPORT/EXPORT

    View Slide

  22. GENERATION
    COLLECTION &
    STORAGE
    COMPUTATION &
    ANALYSIS
    COLLABORATION &
    SHARING
    AMAZON S3
    AMAZON DYNAMODB
    AMAZON GLACIER
    AMAZON RDS
    AMAZON CC2
    AMAZON HS1
    AMAZON CR1
    AWS IMPORT/EXPORT
    AMAZON CG1

    View Slide

  23. GENERATION
    COLLECTION &
    STORAGE
    COMPUTATION &
    ANALYSIS
    COLLABORATION &
    SHARING
    AMAZON S3
    AMAZON DYNAMODB
    AMAZON GLACIER
    AMAZON RDS
    AMAZON CC2
    AMAZON HS1
    AMAZON CR1
    AWS DATA PIPELINE
    AMAZON SWF
    AWS CLOUDFORMATION
    AWS IMPORT/EXPORT
    AMAZON CG1

    View Slide

  24. GENERATION
    COLLECTION &
    STORAGE
    COMPUTATION &
    ANALYSIS
    COLLABORATION &
    SHARING
    AMAZON S3
    AMAZON DYNAMODB
    AMAZON GLACIER
    AMAZON RDS
    AMAZON CC2
    AMAZON HS1
    AMAZON CR1
    AWS DATA PIPELINE
    AMAZON SWF
    AWS CLOUDFORMATION
    AWS IMPORT/EXPORT
    AMAZON REDSHIFT
    AMAZON ELASTIC MAPREDUCE
    AMAZON CG1

    View Slide

  25. GENERATION
    COLLECTION &
    STORAGE
    COMPUTATION &
    ANALYSIS
    COLLABORATION &
    SHARING
    AMAZON S3
    AMAZON DYNAMODB
    AMAZON GLACIER
    AMAZON RDS
    AMAZON CC2
    AMAZON HS1
    AMAZON CR1
    AWS DATA PIPELINE
    AMAZON SWF
    AWS CLOUDFORMATION
    AWS IMPORT/EXPORT
    AMAZON REDSHIFT
    AMAZON ELASTIC MAPREDUCE
    AMAZON CG1

    View Slide

  26. NOSQL DATASTORE
    Managed

    View Slide

  27. UNLIMITED
    Virtually
    throughput and scale

    View Slide

  28. MILLISECOND
    Single digit
    latencies

    View Slide

  29. SOLID STATE
    Running on
    drives

    View Slide

  30. DURABILITY
    Storing data with
    across data centers and availability zones

    View Slide

  31. ZERO ADMIN

    View Slide

  32. KEYS & VALUES
    Store
    without requiring a schema

    View Slide

  33. KEYS & VALUES
    AMAZON DYNAMODB
    ORDER ID DATE
    ORDER TOTAL
    MERCHANT

    View Slide

  34. KEYS & VALUES
    AMAZON DYNAMODB
    ORDER ID DATE
    ORDER TOTAL
    MERCHANT
    Hash key

    View Slide

  35. KEYS & VALUES
    AMAZON DYNAMODB
    ORDER ID DATE
    ORDER TOTAL
    MERCHANT
    Hash key Range key

    View Slide

  36. KEYS & VALUES
    AMAZON DYNAMODB
    ORDER ID DATE
    ORDER TOTAL
    MERCHANT
    Hash key Range key
    Secondary index

    View Slide

  37. KEYS & VALUES
    AMAZON DYNAMODB
    ORDER ID DATE
    ORDER TOTAL
    MERCHANT
    Hash key Range key
    Secondary index
    Projected attribute

    View Slide

  38. API
    AMAZON DYNAMODB
    CreateTable
    UpdateTable
    DeleteTable
    DescribeTable
    ListTables
    Query
    Scan
    PutItem
    GetItem
    UpdateItem
    DeleteItem
    BatchGetItem
    BatchWriteItem

    View Slide

  39. READS, WRITES, UPDATES
    AMAZON DYNAMODB
    Item level transactions only.
    Conditional and atomic updates.
    Counts. Top/bottom n values.
    Results paged to 1MB in size.

    View Slide

  40. THROUGHPUT
    Provisioned

    View Slide

  41. PROVISIONED THROUGHPUT
    AMAZON DYNAMODB
    Provision the IO your application needs.
    Pay per unit of provisioned capacity.
    Consistent predictable performance,
    irrespective of scale.
    Designed for uniform workload.

    View Slide

  42. YOUR APP
    DYNAMODB

    View Slide

  43. YOUR APP
    DYNAMODB
    READ THROUGHPUT

    View Slide

  44. READ THROUGHPUT
    AMAZON DYNAMODB
    IO per 4kb item.
    Strong and eventual consistency.
    Mix and match consistency.

    View Slide

  45. YOUR APP
    DYNAMODB
    READ THROUGHPUT WRITE THROUGHPUT

    View Slide

  46. WRITE THROUGHPUT
    AMAZON DYNAMODB
    IO per 1kb item.
    Atomic increment and decrement.
    Optimistic concurrency control.

    View Slide

  47. YOUR APP
    DYNAMODB
    READ THROUGHPUT WRITE THROUGHPUT

    View Slide

  48. YOUR APP
    DYNAMODB
    READ THROUGHPUT WRITE THROUGHPUT

    View Slide

  49. YOUR APP
    DYNAMODB
    READ THROUGHPUT WRITE THROUGHPUT
    14.2% 14.2% 14.2% 14.2% 14.2% 14.2% 14.2%
    THROUGHPUT

    View Slide

  50. YOUR APP
    DYNAMODB
    READ THROUGHPUT WRITE THROUGHPUT
    14.2% 14.2% 14.2% 14.2% 14.2% 14.2% 14.2%
    THROUGHPUT
    KEY ACCESS 14.2% 14.2% 14.2% 14.2% 14.2% 14.2% 14.2%

    View Slide

  51. YOUR APP
    DYNAMODB
    READ THROUGHPUT WRITE THROUGHPUT
    14.2% 14.2% 14.2% 14.2% 14.2% 14.2% 14.2%
    THROUGHPUT
    KEY ACCESS 0% 50% 0% 50% 0% 0% 0%

    View Slide

  52. View Slide

  53. GENERATION
    COLLECTION &
    STORAGE
    COMPUTATION &
    ANALYSIS
    COLLABORATION &
    SHARING
    AMAZON S3
    AMAZON DYNAMODB
    AMAZON GLACIER
    AMAZON RDS
    AMAZON CC2
    AMAZON HS1
    AMAZON CR1
    AWS DATA PIPELINE
    AMAZON SWF
    AWS CLOUDFORMATION
    AWS IMPORT/EXPORT
    AMAZON REDSHIFT
    AMAZON ELASTIC MAPREDUCE
    AMAZON CG1

    View Slide

  54. GENERATION
    COLLECTION &
    STORAGE
    COMPUTATION &
    ANALYSIS
    COLLABORATION &
    SHARING
    AMAZON S3
    AMAZON DYNAMODB
    AMAZON GLACIER
    AMAZON RDS
    AMAZON CC2
    AMAZON HS1
    AMAZON CR1
    AWS DATA PIPELINE
    AMAZON SWF
    AWS CLOUDFORMATION
    AWS IMPORT/EXPORT
    AMAZON REDSHIFT
    AMAZON ELASTIC MAPREDUCE
    AMAZON CG1

    View Slide

  55. HADOOP CLUSTERS
    Managed

    View Slide

  56. ELASTICITY
    Hadoop with

    View Slide

  57. Input data
    S3, DynamoDB, Redshift

    View Slide

  58. Elastic
    MapReduce
    Code
    Input data
    S3, DynamoDB, Redshift

    View Slide

  59. Elastic
    MapReduce
    Code Name
    node
    Input data
    S3, DynamoDB, Redshift

    View Slide

  60. Elastic
    MapReduce
    Code Name
    node
    Input data
    Elastic
    cluster
    S3, DynamoDB, Redshift
    S3/HDFS

    View Slide

  61. Elastic
    MapReduce
    Code Name
    node
    Input data
    S3/HDFS
    Queries
    + BI
    Via JDBC, Pig, Hive
    S3, DynamoDB, Redshift
    Elastic
    cluster

    View Slide

  62. Elastic
    MapReduce
    Code Name
    node
    Output
    Input data
    Queries
    + BI
    Via JDBC, Pig, Hive
    S3, DynamoDB, Redshift
    Elastic
    cluster
    S3/HDFS

    View Slide

  63. Output
    Input data
    S3, DynamoDB, Redshift

    View Slide

  64. 10 HOURS
    ELASTIC MAPREDUCE

    View Slide

  65. 6 HOURS
    ELASTIC MAPREDUCE

    View Slide

  66. PEAK CAPACITY
    ELASTIC MAPREDUCE

    View Slide

  67. HADOOP ALL THE WAY DOWN
    ELASTIC MAPREDUCE
    Pig, Hive, Mesos, Avro, Spark, Shark
    MapR, Informatica
    Mahout, Nutch, Flume
    Accumulo, Cascading, Oozie
    HBase, Sqoop

    View Slide

  68. SPOT
    Built for

    View Slide

  69. On demand instance: $0.50 per hour
    $0.0350
    Today: 7% of on-demand price.
    “Overclock” by 14x

    View Slide

  70. View Slide

  71. GENERATION
    COLLECTION &
    STORAGE
    COMPUTATION &
    ANALYSIS
    COLLABORATION &
    SHARING
    AMAZON S3
    AMAZON DYNAMODB
    AMAZON GLACIER
    AMAZON RDS
    AMAZON CC2
    AMAZON HS1
    AMAZON CR1
    AWS DATA PIPELINE
    AMAZON SWF
    AWS CLOUDFORMATION
    AWS IMPORT/EXPORT
    AMAZON REDSHIFT
    AMAZON ELASTIC MAPREDUCE
    AMAZON CG1

    View Slide

  72. GENERATION
    COLLECTION &
    STORAGE
    COMPUTATION &
    ANALYSIS
    COLLABORATION &
    SHARING
    AMAZON S3
    AMAZON DYNAMODB
    AMAZON GLACIER
    AMAZON RDS
    AMAZON CC2
    AMAZON HS1
    AMAZON CR1
    AWS DATA PIPELINE
    AMAZON SWF
    AWS CLOUDFORMATION
    AWS IMPORT/EXPORT
    AMAZON REDSHIFT
    AMAZON ELASTIC MAPREDUCE
    AMAZON CG1

    View Slide

  73. DATA WAREHOUSE
    Managed, petabyte scale

    View Slide

  74. 100s GB to 1.6PB
    Scale from

    View Slide

  75. COLUMNAR STORE
    REDSHIFT
    Designed for columnar access.
    Automatic data compression.
    Large block size.
    Best practices for data loading.
    Continual incremental backup to S3.

    View Slide

  76. PARALLEL PROCESSING
    REDSHIFT
    Fully bisectional 10 gigE network.
    128GB RAM.
    Xeon E5 platform.
    16TB across 24 spindles.

    View Slide

  77. LEADER
    COMPUTE
    COMPUTE
    COMPUTE
    S3
    BI TOOLS

    View Slide

  78. LEADER
    COMPUTE
    COMPUTE
    COMPUTE
    S3
    BI TOOLS
    READ ONLY
    LEADER
    COMPUTE
    COMPUTE
    COMPUTE
    S3
    COMPUTE
    COMPUTE

    View Slide

  79. $999
    PER TB PER YEAR

    View Slide

  80. HS1 ON EC2
    2.4 GB/s of 2MiB sequential reads.
    2.6 GB/s for sequential writes.

    View Slide

  81. HI1 ON EC2
    2 x 1TB SSDs
    4kb random reads: 120k IOPS
    4kb random writes: 10k - 80k IOPS

    View Slide

  82. View Slide

  83. Technologies and techniques
    for working productively with
    data, at any scale.

    View Slide

  84. http://headrush.typepad.com/creating_passionate_users/2005/10/getting_users_p.html

    View Slide

  85. PRE-REQUISITE
    Ease of use is a

    View Slide

  86. COMPUTE
    Move data to the

    View Slide

  87. DATA
    Move tools to the

    View Slide

  88. CONSUMED
    Place data where it can be
    by those tools

    View Slide

  89. RIGHT LEVEL
    Expose data at the

    View Slide

  90. View Slide

  91. View Slide

  92. View Slide

  93. View Slide

  94. S3
    DYNAMODB EMR EMR REDSHIFT DYNAMODB
    DATA
    PIPELINE

    View Slide

  95. S3
    DYNAMODB EMR EMR REDSHIFT
    DATA
    PIPELINE
    DYNAMODB

    View Slide

  96. create external table items_db
    (id string, votes bigint, views bigint) stored by
    'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
    tblproperties
    ("dynamodb.table.name" = "items",
    "dynamodb.column.mapping" =
    "id:id,votes:votes,views:views");

    View Slide

  97. select id, likes, views
    from items_db
    order by views desc;

    View Slide

  98. CREATE EXTERNAL TABLE orders_s3_new_export ( order_id string,
    customer_id string, order_date int, total double )
    PARTITIONED BY (year string, month string)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    LOCATION 's3://export_bucket';
    INSERT OVERWRITE TABLE orders_s3_new_export
    PARTITION (year='2012', month='01')
    SELECT * from orders_ddb_2012_01;

    View Slide

  99. S3
    DYNAMODB EMR EMR REDSHIFT
    DATA
    PIPELINE
    DYNAMODB

    View Slide

  100. S3DISTCOPY
    Fast, optimized data movement with

    View Slide

  101. HDFS
    Work with S3 as

    View Slide

  102. S3
    DYNAMODB EMR EMR REDSHIFT DYNAMODB
    DATA
    PIPELINE

    View Slide

  103. COPY
    Read and load data in parallel with

    View Slide

  104. READRATIO
    Use
    to limit throughput consumption

    View Slide

  105. S3
    DYNAMODB EMR EMR REDSHIFT
    DATA
    PIPELINE
    DYNAMODB

    View Slide

  106. DATA INTENSIVE
    Reliable, scheduled
    workflows

    View Slide

  107. INPUT DATA
    ACTIVITY
    OUTPUT

    View Slide

  108. INPUT DATA
    ACTIVITY
    OUTPUT
    Precondition checks
    Failure and delay
    notifications

    View Slide

  109. View Slide

  110. S3
    DYNAMODB EMR EMR REDSHIFT DYNAMODB
    DATA
    PIPELINE

    View Slide

  111. Amazon S3
    http://www.youtube.com/watch?v=oGcZ7WVx6EI
    Legacy data warehousing
    Cassandra Aegisthus Hadoop, Hive, Pig

    View Slide

  112. Amazon S3
    http://www.youtube.com/watch?v=oGcZ7WVx6EI
    Legacy data warehousing
    Cassandra Aegisthus Hadoop, Hive, Pig
    Microstrategy
    Sting
    R

    View Slide

  113. View Slide

  114. View Slide

  115. 98% time saved for clinical trial simulations
    Internal System AWS
    Individual Clinical Trial Simulation Run Time (Min) 56 56
    Total Number of Clinical Trial Simulations 2000 2000
    No. Servers 2 256
    No. CPU’s 32 2048
    Total Analysis Run Time (hr) 60 1.2
    Cost ?? $336

    View Slide

  116. Reduced burden on pediatric subjects
    Traditional Design
    Design Optimized Using Clinical Trial
    Simulation
    # of subjects 60 40
    # of blood samples per subject 12 5
    Length of stay per subject 72 hours 26 hours
    Length of study 2.5 years 1.7 years
    Total study cost $700K $250K
    Length and cost projected based on historical data in pediatric subjects

    View Slide

  117. View Slide