A Platform for Big Data

DATA Building a for a presentation by DR. MATT WOOD
PLATFORM

THANK YOU Hello, and

SEVEN years young

SERVICES to support virtually any workload Broad and deep

EVERY DAY to power amazon.com in 2003 Add enough server
capacity

UTILITY Computing delivered as a

ECONOMIES of scale to lower prices Take advantage of the

Q4 2006 Q1 2007 Q2 2007 Q3 2007 Q4 2007
Q1 2008 Q2 2008 Q3 2008 Q4 2008 Q1 2009 Q2 2009 Q3 2009 Q4 2009 Q1 2010 Q2 2010 Q3 2010 Q4 2010 Q1 2011 Q2 2011 Q3 2011 Q4 2011 Q1 2012 Q2 2012 Q3 2012 Q4 2012 Q1 2013 2 TRILLION OBJECTS

5/22/2010 6/12/2010 7/3/2010 7/24/2010 8/14/2010 9/4/2010 9/25/2010 10/16/2010 11/6/2010 11/27/2010
12/18/2010 1/8/2011 1/29/2011 2/19/2011 3/12/2011 4/2/2011 4/23/2011 5/14/2011 6/4/2011 6/25/2011 7/16/2011 8/6/2011 8/27/2011 9/17/2011 10/8/2011 10/29/2011 11/19/2011 12/10/2011 12/31/2011 1/21/2012 2/11/2012 3/3/2012 3/24/2012 4/14/2012 5/5/2012 5/26/2012 6/16/2012 7/7/2012 7/28/2012 8/18/2012 9/8/2012 9/29/2012 10/20/2012 11/10/2012 12/01/2012 12/22/2012 1/12/2013 2/2/2013 2/23/2013 3/16/2013 4/6/2013 5.5 MILLION HADOOP CLUSTERS

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING

Lower cost, Higher throughput

Lower cost, Higher throughput Highly constrained

1990 2000 2010 2020 The Data Analysis Gap Enterprise Data
Data in Warehouse Generated data Available for analysis Data volume Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

REMOVES resource constraints Utility computing

Technologies and techniques for working productively with data, at any
scale.

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYSIS COLLABORATION & SHARING

AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AWS IMPORT/EXPORT

AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS IMPORT/EXPORT AMAZON CG1

AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON CG1

AMAZON S3 AMAZON DYNAMODB AMAZON GLACIER AMAZON RDS AMAZON CC2 AMAZON HS1 AMAZON CR1 AWS DATA PIPELINE AMAZON SWF AWS CLOUDFORMATION AWS IMPORT/EXPORT AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON CG1

NOSQL DATASTORE Managed

UNLIMITED Virtually throughput and scale

MILLISECOND Single digit latencies

SOLID STATE Running on drives

DURABILITY Storing data with across data centers and availability zones

ZERO ADMIN

KEYS & VALUES Store without requiring a schema

KEYS & VALUES AMAZON DYNAMODB ORDER ID DATE ORDER TOTAL
MERCHANT

MERCHANT Hash key

MERCHANT Hash key Range key

MERCHANT Hash key Range key Secondary index

MERCHANT Hash key Range key Secondary index Projected attribute

API AMAZON DYNAMODB CreateTable UpdateTable DeleteTable DescribeTable ListTables Query Scan
PutItem GetItem UpdateItem DeleteItem BatchGetItem BatchWriteItem

READS, WRITES, UPDATES AMAZON DYNAMODB Item level transactions only. Conditional
and atomic updates. Counts. Top/bottom n values. Results paged to 1MB in size.

THROUGHPUT Provisioned

PROVISIONED THROUGHPUT AMAZON DYNAMODB Provision the IO your application needs.
Pay per unit of provisioned capacity. Consistent predictable performance, irrespective of scale. Designed for uniform workload.

YOUR APP DYNAMODB

YOUR APP DYNAMODB READ THROUGHPUT

READ THROUGHPUT AMAZON DYNAMODB IO per 4kb item. Strong and
eventual consistency. Mix and match consistency.

YOUR APP DYNAMODB READ THROUGHPUT WRITE THROUGHPUT

WRITE THROUGHPUT AMAZON DYNAMODB IO per 1kb item. Atomic increment
and decrement. Optimistic concurrency control.

YOUR APP DYNAMODB READ THROUGHPUT WRITE THROUGHPUT

YOUR APP DYNAMODB READ THROUGHPUT WRITE THROUGHPUT 14.2% 14.2% 14.2%
14.2% 14.2% 14.2% 14.2% THROUGHPUT

14.2% 14.2% 14.2% 14.2% THROUGHPUT KEY ACCESS 14.2% 14.2% 14.2% 14.2% 14.2% 14.2% 14.2%

14.2% 14.2% 14.2% 14.2% THROUGHPUT KEY ACCESS 0% 50% 0% 50% 0% 0% 0%

HADOOP CLUSTERS Managed

ELASTICITY Hadoop with

Input data S3, DynamoDB, Redshift

Elastic MapReduce Code Input data S3, DynamoDB, Redshift

Elastic MapReduce Code Name node Input data S3, DynamoDB, Redshift

Elastic MapReduce Code Name node Input data Elastic cluster S3,
DynamoDB, Redshift S3/HDFS

Elastic MapReduce Code Name node Input data S3/HDFS Queries +
BI Via JDBC, Pig, Hive S3, DynamoDB, Redshift Elastic cluster

Elastic MapReduce Code Name node Output Input data Queries +
BI Via JDBC, Pig, Hive S3, DynamoDB, Redshift Elastic cluster S3/HDFS

Output Input data S3, DynamoDB, Redshift

10 HOURS ELASTIC MAPREDUCE

6 HOURS ELASTIC MAPREDUCE

PEAK CAPACITY ELASTIC MAPREDUCE

HADOOP ALL THE WAY DOWN ELASTIC MAPREDUCE Pig, Hive, Mesos,
Avro, Spark, Shark MapR, Informatica Mahout, Nutch, Flume Accumulo, Cascading, Oozie HBase, Sqoop

SPOT Built for

On demand instance: $0.50 per hour $0.0350 Today: 7% of
on-demand price. “Overclock” by 14x

DATA WAREHOUSE Managed, petabyte scale

100s GB to 1.6PB Scale from

COLUMNAR STORE REDSHIFT Designed for columnar access. Automatic data compression.
Large block size. Best practices for data loading. Continual incremental backup to S3.

PARALLEL PROCESSING REDSHIFT Fully bisectional 10 gigE network. 128GB RAM.
Xeon E5 platform. 16TB across 24 spindles.

LEADER COMPUTE COMPUTE COMPUTE S3 BI TOOLS

LEADER COMPUTE COMPUTE COMPUTE S3 BI TOOLS READ ONLY LEADER
COMPUTE COMPUTE COMPUTE S3 COMPUTE COMPUTE

$999 PER TB PER YEAR

HS1 ON EC2 2.4 GB/s of 2MiB sequential reads. 2.6
GB/s for sequential writes.

HI1 ON EC2 2 x 1TB SSDs 4kb random reads:
120k IOPS 4kb random writes: 10k - 80k IOPS

Technologies and techniques for working productively with data, at any
scale.

http://headrush.typepad.com/creating_passionate_users/2005/10/getting_users_p.html

PRE-REQUISITE Ease of use is a

COMPUTE Move data to the

DATA Move tools to the

CONSUMED Place data where it can be by those tools

RIGHT LEVEL Expose data at the

S3 DYNAMODB EMR EMR REDSHIFT DYNAMODB DATA PIPELINE

S3 DYNAMODB EMR EMR REDSHIFT DATA PIPELINE DYNAMODB

create external table items_db (id string, votes bigint, views bigint)
stored by 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' tblproperties ("dynamodb.table.name" = "items", "dynamodb.column.mapping" = "id:id,votes:votes,views:views");

select id, likes, views from items_db order by views desc;

CREATE EXTERNAL TABLE orders_s3_new_export ( order_id string, customer_id string, order_date
int, total double ) PARTITIONED BY (year string, month string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://export_bucket'; INSERT OVERWRITE TABLE orders_s3_new_export PARTITION (year='2012', month='01') SELECT * from orders_ddb_2012_01;

S3DISTCOPY Fast, optimized data movement with

HDFS Work with S3 as

COPY Read and load data in parallel with

READRATIO Use to limit throughput consumption

DATA INTENSIVE Reliable, scheduled workflows

INPUT DATA ACTIVITY OUTPUT

INPUT DATA ACTIVITY OUTPUT Precondition checks Failure and delay notifications

Amazon S3 http://www.youtube.com/watch?v=oGcZ7WVx6EI Legacy data warehousing Cassandra Aegisthus Hadoop, Hive,
Pig

Amazon S3 http://www.youtube.com/watch?v=oGcZ7WVx6EI Legacy data warehousing Cassandra Aegisthus Hadoop, Hive,
Pig Microstrategy Sting R

98% time saved for clinical trial simulations Internal System AWS
Individual Clinical Trial Simulation Run Time (Min) 56 56 Total Number of Clinical Trial Simulations 2000 2000 No. Servers 2 256 No. CPU’s 32 2048 Total Analysis Run Time (hr) 60 1.2 Cost ?? $336

Reduced burden on pediatric subjects Traditional Design Design Optimized Using
Clinical Trial Simulation # of subjects 60 40 # of blood samples per subject 12 5 Length of stay per subject 72 hours 26 hours Length of study 2.5 years 1.7 years Total study cost $700K $250K Length and cost projected based on historical data in pediatric subjects

Anurag Gupta [email protected] David Lang [email protected] Matt Wood [email protected] Jon
Einkauf [email protected]

A Platform for Big Data

A Platform for Big Data

More Decks by Matt Wood

Other Decks in Technology

Featured

Transcript