Slide 1

Slide 1 text

DATA The life cycle a presentation by DR. MATT WOOD

Slide 2

Slide 2 text

THANK YOU Hello, and

Slide 3

Slide 3 text

SEVEN years young

Slide 4

Slide 4 text

SERVICES to support virtually any workload Broad and deep

Slide 5

Slide 5 text

2007 2008 2009 2010 2011 2012 159 82 61 48 24 9

Slide 6

Slide 6 text

SECURITY capabilities Comprehensive

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

EVERY DAY to power amazon.com in 2003 Add enough server capacity

Slide 9

Slide 9 text

UTILITY Computing delivered as a

Slide 10

Slide 10 text

ECONOMIES of scale to lower prices Take advantage of the

Slide 11

Slide 11 text

Free steak campaign Facebook page Mars exploration ops Consumer social app Ticket pricing optimization SAP & Sharepoint Securities Trading Data Archiving Marketing web site Interactive TV apps Financial markets analytics Consumer social app Big data analytics Web site & media sharing Disaster recovery Media streaming Web and mobile apps Streaming webcasts Facebook app Consumer social app Business line of sight Mobile analytics IT operations Digital media Core IT and media Ground campaign

Slide 12

Slide 12 text

Q4 2006 Q1 2007 Q2 2007 Q3 2007 Q4 2007 Q1 2008 Q2 2008 Q3 2008 Q4 2008 Q1 2009 Q2 2009 Q3 2009 Q4 2009 Q1 2010 Q2 2010 Q3 2010 Q4 2010 Q1 2011 Q2 2011 Q3 2011 Q4 2011 Q1 2012 Q2 2012 Q3 2012 Q4 2012 Q1 2013 2 TRILLION OBJECTS

Slide 13

Slide 13 text

5/22/2010 6/12/2010 7/3/2010 7/24/2010 8/14/2010 9/4/2010 9/25/2010 10/16/2010 11/6/2010 11/27/2010 12/18/2010 1/8/2011 1/29/2011 2/19/2011 3/12/2011 4/2/2011 4/23/2011 5/14/2011 6/4/2011 6/25/2011 7/16/2011 8/6/2011 8/27/2011 9/17/2011 10/8/2011 10/29/2011 11/19/2011 12/10/2011 12/31/2011 1/21/2012 2/11/2012 3/3/2012 3/24/2012 4/14/2012 5/5/2012 5/26/2012 6/16/2012 7/7/2012 7/28/2012 8/18/2012 9/8/2012 9/29/2012 10/20/2012 11/10/2012 12/01/2012 12/22/2012 1/12/2013 2/2/2013 2/23/2013 3/16/2013 4/6/2013 5.5 MILLION HADOOP CLUSTERS

Slide 14

Slide 14 text

DATA Let’s talk about

Slide 15

Slide 15 text

GENERATION COLLECTION & STORAGE ANALYTICS & COMPUTATION COLLABORATION & SHARING

Slide 16

Slide 16 text

DATA generation Decreasing cost of

Slide 17

Slide 17 text

GENERATION COLLECTION & STORAGE ANALYTICS & COMPUTATION COLLABORATION & SHARING

Slide 18

Slide 18 text

GENERATION COLLECTION & STORAGE ANALYTICS & COMPUTATION COLLABORATION & SHARING Lower cost, Higher throughput

Slide 19

Slide 19 text

GENERATION COLLECTION & STORAGE ANALYTICS & COMPUTATION COLLABORATION & SHARING Lower cost, Higher throughput Highly constrained

Slide 20

Slide 20 text

1990 2000 2010 2020 The Data Analysis Gap Enterprise Data Data in Warehouse Generated data Available for analysis Data volume Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Slide 21

Slide 21 text

REMOVES resource constraints Utility computing

Slide 22

Slide 22 text

GENERATION COLLECTION & STORAGE ANALYTICS & COMPUTATION COLLABORATION & SHARING Lower cost, Higher throughput Highly constrained

Slide 23

Slide 23 text

GENERATION COLLECTION & STORAGE ANALYTICS & COMPUTATION COLLABORATION & SHARING

Slide 24

Slide 24 text

Technologies and techniques for working productively with data, at any scale.

Slide 25

Slide 25 text

CELL PHONES Let’s talk about

Slide 26

Slide 26 text

Number of calls Call duration Airtime purchase frequency and size Mobility patterns CELL PHONES: incredible data generators

Slide 27

Slide 27 text

Size of purchase Number of purchases Mobile airtime purchases Lower income household Higher income household

Slide 28

Slide 28 text

Women More calls Longer calls Larger social network More personal calls Men Fewer calls Shorter calls Smaller social network More work-related calls

Slide 29

Slide 29 text

HEALTH CARE Let’s talk about

Slide 30

Slide 30 text

Average daily number of cells that moved out from the communal sections. Linus Bengtsson et al. PLoS Medicine, 2011

Slide 31

Slide 31 text

Discussion topics Sentiment and context Social graph Interactions SOCIAL NETWORKS: incredible data generators

Slide 32

Slide 32 text

You Are What You Tweet: Analyzing Twitter for Public Health. M. J. Paul and M. Dredze, 2011 Tweeting about Flu

Slide 33

Slide 33 text

Discussing unemployment: Ireland

Slide 34

Slide 34 text

Discussing unemployment: America

Slide 35

Slide 35 text

Tweeting about Food

Slide 36

Slide 36 text

Tweets about the price of rice Official food price inflation Tweeting about Food

Slide 37

Slide 37 text

VIDEO GAMES Let’s talk about

Slide 38

Slide 38 text

Search results Ad placement Buying history Page views WEB APPLICATIONS: incredible data generators

Slide 39

Slide 39 text

“Who buys video games?”

Slide 40

Slide 40 text

3.5 billion records 13 TB of click stream logs 71 million unique cookies Per day:

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

500% return on ad spend 17,000% reduction in procurement time Results:

Slide 44

Slide 44 text

GALAXIES Let’s talk about

Slide 45

Slide 45 text

“How do galaxies form?”

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

ME Let’s talk about

Slide 52

Slide 52 text

Chromosome 11 : ACTN3 : rs1815739

Slide 53

Slide 53 text

Chromosome X : rs6625163

Slide 54

Slide 54 text

Chromosome 19 : FUT2 : rs601338

Slide 55

Slide 55 text

Chromosome 2 : rs10427255

Slide 56

Slide 56 text

TYPE II Chromosome 10 : rs7903146

Slide 57

Slide 57 text

+0.25 Chromosome 15 : rs2472297

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

GENERATION COLLECTION & STORAGE ANALYTICS & COMPUTATION COLLABORATION & SHARING

Slide 64

Slide 64 text

Technologies and techniques for working productively with data, at any scale.

Slide 65

Slide 65 text

Speeding server provisioning for R&D apps Extending capacity for internal grid environments Slowing internally hosted compute infrastructure growth On-boarding security, validation services and compliance Hosting research data Reducing cost while extending capabilities Challenges

Slide 66

Slide 66 text

Clinical pharmacology and pharmacometrics Molecular dynamics Computational genomics Research portfolio Primary uses

Slide 67

Slide 67 text

98% time saved for clinical trial simulations Internal System AWS Individual Clinical Trial Simulation Run Time (Min) 56 56 Total Number of Clinical Trial Simulations 2000 2000 No. Servers 2 256 No. CPU’s 32 2048 Total Analysis Run Time (hr) 60 1.2 Cost ?? $336

Slide 68

Slide 68 text

Reduced burden on pediatric subjects Traditional Design Design Optimized Using Clinical Trial Simulation # of subjects 60 40 # of blood samples per subject 12 5 Length of stay per subject 72 hours 26 hours Length of study 2.5 years 1.7 years Total study cost $700K $250K Length and cost projected based on historical data in pediatric subjects

Slide 69

Slide 69 text

A PLATFORM Let’s talk about for data and analytics

Slide 70

Slide 70 text

Technologies and techniques for working productively with data, at any scale.

Slide 71

Slide 71 text

AMAZON EC2 AMAZON REDSHIFT AMAZON EMR AMAZON DYNAMODB AMAZON S3 AMAZON RDS AWS STORAGE GATEWAY AMAZON SWF AMAZON GLACIER AWS DATA PIPELINE AMAZON MACHINE IMAGES AMAZON PUBLIC DATASETS CLUSTER COMPUTE INSTANCES HIGH STORAGE INSTANCES HIGH I/O INSTANCES GPU INSTANCES (NO PHI YET, SORRY)

Slide 72

Slide 72 text

Technologies and techniques for working productively with data, at any scale.

Slide 73

Slide 73 text

http://headrush.typepad.com/creating_passionate_users/2005/10/getting_users_p.html

Slide 74

Slide 74 text

PRE-REQUISITE Ease of use is a

Slide 75

Slide 75 text

RIGHT LEVEL Expose data at the

Slide 76

Slide 76 text

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING

Slide 77

Slide 77 text

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING JASPERSOFT ON AMAZON EC2 AMAZON REDSHIFT AMAZON EMR AMAZON DYNAMODB

Slide 78

Slide 78 text

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING JASPERSOFT ON AMAZON EC2 AMAZON REDSHIFT AMAZON EMR AMAZON DYNAMODB AMAZON RDS AMAZON RDS AMAZON EC2

Slide 79

Slide 79 text

GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING JASPERSOFT ON AMAZON EC2 AMAZON REDSHIFT AMAZON EMR AMAZON DYNAMODB AMAZON RDS AMAZON RDS AMAZON EC2 AMAZON S3

Slide 80

Slide 80 text

Amazon S3 http://www.youtube.com/watch?v=oGcZ7WVx6EI Legacy data warehousing Cassandra Aegisthus Hadoop, Hive, Pig

Slide 81

Slide 81 text

Amazon S3 http://www.youtube.com/watch?v=oGcZ7WVx6EI Legacy data warehousing Cassandra Aegisthus Hadoop, Hive, Pig Microstrategy Sting R

Slide 82

Slide 82 text

No content

Slide 83

Slide 83 text

Technologies and techniques for working productively with data, at any scale.

Slide 84

Slide 84 text

12.5 3 years hours

Slide 85

Slide 85 text

12.5 3 $20M $4k years hours

Slide 86

Slide 86 text

No content

Slide 87

Slide 87 text

$1k Less than today

Slide 88

Slide 88 text

1,000,000+ core hours

Slide 89

Slide 89 text

matthew@amazon.com aws.amazon.com