Slide 1

Slide 1 text

DATA Building a for a presentation by DR. MATT WOOD PLATFORM

Slide 2

Slide 2 text

THANK YOU Hello, and

Slide 3

Slide 3 text

SEVEN years young

Slide 4

Slide 4 text

SERVICES to support virtually any workload Broad and deep

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

EVERY DAY to power amazon.com in 2003 Add enough server capacity

Slide 7

Slide 7 text

UTILITY Computing delivered as a

Slide 8

Slide 8 text

ECONOMIES of scale to lower prices Take advantage of the

Slide 9

Slide 9 text

Q4 2006 Q1 2007 Q2 2007 Q3 2007 Q4 2007 Q1 2008 Q2 2008 Q3 2008 Q4 2008 Q1 2009 Q2 2009 Q3 2009 Q4 2009 Q1 2010 Q2 2010 Q3 2010 Q4 2010 Q1 2011 Q2 2011 Q3 2011 Q4 2011 Q1 2012 Q2 2012 Q3 2012 Q4 2012 Q1 2013 2 TRILLION OBJECTS

Slide 10

Slide 10 text

5/22/2010 6/12/2010 7/3/2010 7/24/2010 8/14/2010 9/4/2010 9/25/2010 10/16/2010 11/6/2010 11/27/2010 12/18/2010 1/8/2011 1/29/2011 2/19/2011 3/12/2011 4/2/2011 4/23/2011 5/14/2011 6/4/2011 6/25/2011 7/16/2011 8/6/2011 8/27/2011 9/17/2011 10/8/2011 10/29/2011 11/19/2011 12/10/2011 12/31/2011 1/21/2012 2/11/2012 3/3/2012 3/24/2012 4/14/2012 5/5/2012 5/26/2012 6/16/2012 7/7/2012 7/28/2012 8/18/2012 9/8/2012 9/29/2012 10/20/2012 11/10/2012 12/01/2012 12/22/2012 1/12/2013 2/2/2013 2/23/2013 3/16/2013 4/6/2013 5.5 MILLION HADOOP CLUSTERS

Slide 11

Slide 11 text

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING

Slide 12

Slide 12 text

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING Lower cost, Higher throughput

Slide 13

Slide 13 text

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING Lower cost, Higher throughput

Slide 14

Slide 14 text

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING Lower cost, Higher throughput Highly constrained

Slide 15

Slide 15 text

1990 2000 2010 2020 The Data Analysis Gap Enterprise Data Data in Warehouse Generated data Available for analysis Data volume Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Slide 16

Slide 16 text

REMOVES resource constraints Utility computing

Slide 17

Slide 17 text

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING

Slide 18

Slide 18 text

Technologies and techniques for working productively with data, at any scale.

Slide 19

Slide 19 text

Technologies and techniques for working productively with data, at any scale.

Slide 20

Slide 20 text

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

Slide 21

Slide 21 text

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AWS IMPORT/EXPORT

Slide 22

Slide 22 text

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS IMPORT/EXPORT AMAZON CG1

Slide 23

Slide 23 text

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON CG1

Slide 24

Slide 24 text

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON CG1

Slide 25

Slide 25 text

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON CG1

Slide 26

Slide 26 text

NOSQL DATASTORE Managed

Slide 27

Slide 27 text

UNLIMITED Virtually throughput and scale

Slide 28

Slide 28 text

MILLISECOND Single digit latencies

Slide 29

Slide 29 text

SOLID STATE Running on drives

Slide 30

Slide 30 text

DURABILITY Storing data with across data centers and availability zones

Slide 31

Slide 31 text

ZERO ADMIN

Slide 32

Slide 32 text

KEYS & VALUES Store without requiring a schema

Slide 33

Slide 33 text

KEYS & VALUES AMAZON DYNAMODB ORDER ID DATE ORDER TOTAL MERCHANT

Slide 34

Slide 34 text

KEYS & VALUES AMAZON DYNAMODB ORDER ID DATE ORDER TOTAL MERCHANT Hash key

Slide 35

Slide 35 text

KEYS & VALUES AMAZON DYNAMODB ORDER ID DATE ORDER TOTAL MERCHANT Hash key Range key

Slide 36

Slide 36 text

KEYS & VALUES AMAZON DYNAMODB ORDER ID DATE ORDER TOTAL MERCHANT Hash key Range key Secondary index

Slide 37

Slide 37 text

KEYS & VALUES AMAZON DYNAMODB ORDER ID DATE ORDER TOTAL MERCHANT Hash key Range key Secondary index Projected attribute

Slide 38

Slide 38 text

API AMAZON DYNAMODB CreateTable UpdateTable DeleteTable DescribeTable ListTables Query Scan PutItem GetItem UpdateItem DeleteItem BatchGetItem BatchWriteItem

Slide 39

Slide 39 text

READS, WRITES, UPDATES AMAZON DYNAMODB Item level transactions only. Conditional and atomic updates. Counts. Top/bottom n values. Results paged to 1MB in size.

Slide 40

Slide 40 text

THROUGHPUT Provisioned

Slide 41

Slide 41 text

PROVISIONED THROUGHPUT AMAZON DYNAMODB Provision the IO your application needs. Pay per unit of provisioned capacity. Consistent predictable performance, irrespective of scale. Designed for uniform workload.

Slide 42

Slide 42 text

YOUR APP DYNAMODB

Slide 43

Slide 43 text

YOUR APP DYNAMODB READ THROUGHPUT

Slide 44

Slide 44 text

READ THROUGHPUT AMAZON DYNAMODB IO per 4kb item. Strong and eventual consistency. Mix and match consistency.

Slide 45

Slide 45 text

YOUR APP DYNAMODB READ THROUGHPUT WRITE THROUGHPUT

Slide 46

Slide 46 text

WRITE THROUGHPUT AMAZON DYNAMODB IO per 1kb item. Atomic increment and decrement. Optimistic concurrency control.

Slide 47

Slide 47 text

YOUR APP DYNAMODB READ THROUGHPUT WRITE THROUGHPUT

Slide 48

Slide 48 text

YOUR APP DYNAMODB READ THROUGHPUT WRITE THROUGHPUT

Slide 49

Slide 49 text

YOUR APP DYNAMODB READ THROUGHPUT WRITE THROUGHPUT 14.2% 14.2% 14.2% 14.2% 14.2% 14.2% 14.2% THROUGHPUT

Slide 50

Slide 50 text

YOUR APP DYNAMODB READ THROUGHPUT WRITE THROUGHPUT 14.2% 14.2% 14.2% 14.2% 14.2% 14.2% 14.2% THROUGHPUT KEY ACCESS 14.2% 14.2% 14.2% 14.2% 14.2% 14.2% 14.2%

Slide 51

Slide 51 text

YOUR APP DYNAMODB READ THROUGHPUT WRITE THROUGHPUT 14.2% 14.2% 14.2% 14.2% 14.2% 14.2% 14.2% THROUGHPUT KEY ACCESS 0% 50% 0% 50% 0% 0% 0%

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON CG1

Slide 54

Slide 54 text

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON CG1

Slide 55

Slide 55 text

HADOOP CLUSTERS Managed

Slide 56

Slide 56 text

ELASTICITY Hadoop with

Slide 57

Slide 57 text

Input data S3, DynamoDB, Redshift

Slide 58

Slide 58 text

Elastic MapReduce Code Input data S3, DynamoDB, Redshift

Slide 59

Slide 59 text

Elastic MapReduce Code Name node Input data S3, DynamoDB, Redshift

Slide 60

Slide 60 text

Elastic MapReduce Code Name node Input data Elastic cluster S3, DynamoDB, Redshift S3/HDFS

Slide 61

Slide 61 text

Elastic MapReduce Code Name node Input data S3/HDFS Queries + BI Via JDBC, Pig, Hive S3, DynamoDB, Redshift Elastic cluster

Slide 62

Slide 62 text

Elastic MapReduce Code Name node Output Input data Queries + BI Via JDBC, Pig, Hive S3, DynamoDB, Redshift Elastic cluster S3/HDFS

Slide 63

Slide 63 text

Output Input data S3, DynamoDB, Redshift

Slide 64

Slide 64 text

10 HOURS ELASTIC MAPREDUCE

Slide 65

Slide 65 text

6 HOURS ELASTIC MAPREDUCE

Slide 66

Slide 66 text

PEAK CAPACITY ELASTIC MAPREDUCE

Slide 67

Slide 67 text

HADOOP ALL THE WAY DOWN ELASTIC MAPREDUCE Pig, Hive, Mesos, Avro, Spark, Shark MapR, Informatica Mahout, Nutch, Flume Accumulo, Cascading, Oozie HBase, Sqoop

Slide 68

Slide 68 text

SPOT Built for

Slide 69

Slide 69 text

On demand instance: $0.50 per hour $0.0350 Today: 7% of on-demand price. “Overclock” by 14x

Slide 70

Slide 70 text

No content

Slide 71

Slide 71 text

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON CG1

Slide 72

Slide 72 text

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON CG1

Slide 73

Slide 73 text

DATA WAREHOUSE Managed, petabyte scale

Slide 74

Slide 74 text

100s GB to 1.6PB Scale from

Slide 75

Slide 75 text

COLUMNAR STORE REDSHIFT Designed for columnar access. Automatic data compression. Large block size. Best practices for data loading. Continual incremental backup to S3.

Slide 76

Slide 76 text

PARALLEL PROCESSING REDSHIFT Fully bisectional 10 gigE network. 128GB RAM. Xeon E5 platform. 16TB across 24 spindles.

Slide 77

Slide 77 text

LEADER COMPUTE COMPUTE COMPUTE S3 BI TOOLS

Slide 78

Slide 78 text

LEADER COMPUTE COMPUTE COMPUTE S3 BI TOOLS READ ONLY LEADER COMPUTE COMPUTE COMPUTE S3 COMPUTE COMPUTE

Slide 79

Slide 79 text

$999 PER TB PER YEAR

Slide 80

Slide 80 text

HS1 ON EC2 2.4 GB/s of 2MiB sequential reads. 2.6 GB/s for sequential writes.

Slide 81

Slide 81 text

HI1 ON EC2 2 x 1TB SSDs 4kb random reads: 120k IOPS 4kb random writes: 10k - 80k IOPS

Slide 82

Slide 82 text

No content

Slide 83

Slide 83 text

Technologies and techniques for working productively with data, at any scale.

Slide 84

Slide 84 text

http://headrush.typepad.com/creating_passionate_users/2005/10/getting_users_p.html

Slide 85

Slide 85 text

PRE-REQUISITE Ease of use is a

Slide 86

Slide 86 text

COMPUTE Move data to the

Slide 87

Slide 87 text

DATA Move tools to the

Slide 88

Slide 88 text

CONSUMED Place data where it can be by those tools

Slide 89

Slide 89 text

RIGHT LEVEL Expose data at the

Slide 90

Slide 90 text

No content

Slide 91

Slide 91 text

No content

Slide 92

Slide 92 text

No content

Slide 93

Slide 93 text

No content

Slide 94

Slide 94 text

S3 DYNAMODB EMR EMR REDSHIFT DYNAMODB DATA PIPELINE

Slide 95

Slide 95 text

S3 DYNAMODB EMR EMR REDSHIFT DATA PIPELINE DYNAMODB

Slide 96

Slide 96 text

create external table items_db (id string, votes bigint, views bigint) stored by 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' tblproperties ("dynamodb.table.name" = "items", "dynamodb.column.mapping" = "id:id,votes:votes,views:views");

Slide 97

Slide 97 text

select id, likes, views from items_db order by views desc;

Slide 98

Slide 98 text

CREATE EXTERNAL TABLE orders_s3_new_export ( order_id string, customer_id string, order_date int, total double ) PARTITIONED BY (year string, month string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://export_bucket'; INSERT OVERWRITE TABLE orders_s3_new_export PARTITION (year='2012', month='01') SELECT * from orders_ddb_2012_01;

Slide 99

Slide 99 text

S3 DYNAMODB EMR EMR REDSHIFT DATA PIPELINE DYNAMODB

Slide 100

Slide 100 text

S3DISTCOPY Fast, optimized data movement with

Slide 101

Slide 101 text

HDFS Work with S3 as

Slide 102

Slide 102 text

S3 DYNAMODB EMR EMR REDSHIFT DYNAMODB DATA PIPELINE

Slide 103

Slide 103 text

COPY Read and load data in parallel with

Slide 104

Slide 104 text

READRATIO Use to limit throughput consumption

Slide 105

Slide 105 text

S3 DYNAMODB EMR EMR REDSHIFT DATA PIPELINE DYNAMODB

Slide 106

Slide 106 text

DATA INTENSIVE Reliable, scheduled workflows

Slide 107

Slide 107 text

INPUT DATA ACTIVITY OUTPUT

Slide 108

Slide 108 text

INPUT DATA ACTIVITY OUTPUT Precondition checks Failure and delay notifications

Slide 109

Slide 109 text

No content

Slide 110

Slide 110 text

S3 DYNAMODB EMR EMR REDSHIFT DYNAMODB DATA PIPELINE

Slide 111

Slide 111 text

Amazon S3 http://www.youtube.com/watch?v=oGcZ7WVx6EI Legacy data warehousing Cassandra Aegisthus Hadoop, Hive, Pig

Slide 112

Slide 112 text

Amazon S3 http://www.youtube.com/watch?v=oGcZ7WVx6EI Legacy data warehousing Cassandra Aegisthus Hadoop, Hive, Pig Microstrategy Sting R

Slide 113

Slide 113 text

No content

Slide 114

Slide 114 text

No content

Slide 115

Slide 115 text

98% time saved for clinical trial simulations Internal System AWS Individual Clinical Trial Simulation Run Time (Min) 56 56 Total Number of Clinical Trial Simulations 2000 2000 No. Servers 2 256 No. CPU’s 32 2048 Total Analysis Run Time (hr) 60 1.2 Cost ?? $336

Slide 116

Slide 116 text

Reduced burden on pediatric subjects Traditional Design Design Optimized Using Clinical Trial Simulation # of subjects 60 40 # of blood samples per subject 12 5 Length of stay per subject 72 hours 26 hours Length of study 2.5 years 1.7 years Total study cost $700K $250K Length and cost projected based on historical data in pediatric subjects

Slide 117

Slide 117 text

Anurag Gupta [email protected] David Lang [email protected] Matt Wood [email protected] Jon Einkauf [email protected]