Big Data Analytics - Speaker Deck

Embed

Start on current slide

Slide 1

Slide 1 text

Big Data Analytics w i t h A m a z o n W e b S e r v i c e s Dr. Matt Wood An Online Seminar for Partners. Wednesday 1st August.

Slide 2

Slide 2 text

Hello, and thank you.

Slide 3

Slide 3 text

Big Data Analytics An introduction

Slide 4

Slide 4 text

Big Data Analytics An introduction The story of analytics on AWS

Slide 5

Slide 5 text

Big Data Analytics An introduction The story of analytics on AWS Integrating partners

Slide 6

Slide 6 text

Big Data Analytics An introduction The story of analytics on AWS Integrating partners Partner success stories

Slide 7

Slide 7 text

INTRODUCING BIG DATA 1

Slide 8

Slide 8 text

Data for competitive advantage.

Slide 9

Slide 9 text

Customer segmentation, financial modeling, system analysis, line-of-sight, business intelligence. Using data

Slide 10

Slide 10 text

Generation Collection & storage Analytics & computation Collaboration & sharing

Slide 11

Slide 11 text

Cost of data generation is falling.

Slide 12

Slide 12 text

Generation Collection & storage Analytics & computation Collaboration & sharing lower cost, increased throughput

Slide 13

Slide 13 text

Generation Collection & storage Analytics & computation Collaboration & sharing HIGHLY CONSTRAINED

Slide 14

Slide 14 text

Very high barrier to turning data into information.

Slide 15

Slide 15 text

Move from a data generation challenge to analytics challenge.

Slide 16

Slide 16 text

Enter the Cloud.

Slide 17

Slide 17 text

Remove the constraints.

Slide 18

Slide 18 text

Enable data-driven innovation.

Slide 19

Slide 19 text

Move to a distributed data approach.

Slide 20

Slide 20 text

Maturation of two things.

Slide 21

Slide 21 text

Maturation of two things. Software for distributed storage and analysis

Slide 22

Slide 22 text

Maturation of two things. Software for distributed storage and analysis Infrastructure for distributed storage and analysis

Slide 23

Slide 23 text

Frameworks for data-intensive workloads. Software Distributed by design.

Slide 24

Slide 24 text

Platform for data-intensive workloads. Infrastructure Distributed by design.

Slide 25

Slide 25 text

Support the data timeline.

Slide 26

Slide 26 text

Generation Collection & storage Analytics & computation Collaboration & sharing HIGHLY CONSTRAINED

Slide 27

Slide 27 text

Generation Collection & storage Analytics & computation Collaboration & sharing

Slide 28

Slide 28 text

Lower the barrier to entry.

Slide 29

Slide 29 text

Accelerate time to market and increase agility.

Slide 30

Slide 30 text

Enable new business opportunities.

Slide 31

Slide 31 text

Washington Post Pinterest NASA

Slide 32

Slide 32 text

“AWS enables Pfizer to explore difficult or deep scientific questions in a timely, scalable manner and helps us make better decisions more quickly” Michael Miller, Pfizer

Slide 33

Slide 33 text

THE STORY OF ANALYTICS 2

Slide 34

Slide 34 text

EC2 Utility computing. 6 years young.

Slide 35

Slide 35 text

Embarrassingly parallel problems. Scale out systems Queue based distribution. Small, medium and high scale.

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

EC2 Utility computing. 6 years young. Cost optimization.

Slide 40

Slide 40 text

Achieving economies of scale 100% Time

Slide 41

Slide 41 text

Reserved capacity Achieving economies of scale 100% Time

Slide 42

Slide 42 text

Reserved capacity Achieving economies of scale 100% Time On-demand

Slide 43

Slide 43 text

Reserved capacity Achieving economies of scale 100% Time On-demand UNUSED CAPACITY

Slide 44

Slide 44 text

Bid on unused EC2 capacity. Spot Instances Very large discount. Perfect for batch runs. Balance cost and scale.

Slide 45

Slide 45 text

$650 per hour

Slide 46

Slide 46 text

Pattern for distributed computing. Map/reduce Software frameworks such as Hadoop. Write two functions. Scale up.

Slide 47

Slide 47 text

Pattern for distributed computing. Map/reduce Software frameworks such as Hadoop. Write two functions. Scale up. Complex cluster configuration and management.

Slide 48

Slide 48 text

Managed Hadoop clusters. Amazon Elastic MapReduce Easy to provision and monitor. Write two functions. Scale up. Optimized for S3 access.

Slide 49

Slide 49 text

Input data S3 UNDER THE HOOD i i

Slide 50

Slide 50 text

Elastic MapReduce Code Input data S3 UNDER THE HOOD i i

Slide 51

Slide 51 text

Elastic MapReduce Code Name node Input data S3 UNDER THE HOOD i i

Slide 52

Slide 52 text

Elastic MapReduce Code Name node Input data S3 Elastic cluster UNDER THE HOOD i i

Slide 53

Slide 53 text

Elastic MapReduce Code Name node Input data S3 Elastic cluster HDFS UNDER THE HOOD i i

Slide 54

Slide 54 text

Elastic MapReduce Code Name node Input data S3 Elastic cluster HDFS Queries + BI Via JDBC, Pig, Hive UNDER THE HOOD i i

Slide 55

Slide 55 text

Elastic MapReduce Code Name node Output S3 + SimpleDB Input data S3 Elastic cluster HDFS Queries + BI Via JDBC, Pig, Hive UNDER THE HOOD i i

Slide 56

Slide 56 text

Output S3 + SimpleDB Input data S3 UNDER THE HOOD i i

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

No content

Slide 66

Slide 66 text

No content

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

No content

Slide 70

Slide 70 text

No content

Slide 71

Slide 71 text

Performance

Slide 72

Slide 72 text

Performance Compute performance

Slide 73

Slide 73 text

Intel Xeon E5-2670 Cluster Compute 10 gig E non-blocking network Placement groupings 60.5 Gb UNDER THE HOOD i i

Slide 74

Slide 74 text

Intel Xeon E5-2670 Cluster Compute 10 gig E non-blocking network Placement groupings 60.5 Gb UNDER THE HOOD i i + GPU enabled instances

Slide 75

Slide 75 text

Performance Compute performance

Slide 76

Slide 76 text

Performance Compute performance IO performance

Slide 77

Slide 77 text

NoSQL Unstructured data storage.

Slide 78

Slide 78 text

Predictable, consistent performance DynamoDB Unlimited storage No schema for unstructured data Single digit millisecond latencies Backed on solid state drives

Slide 79

Slide 79 text

...and SSDs for all. New Hi1 storage instances.

Slide 80

Slide 80 text

2 x 1Tb SSDs hi1.4xlarge 10 GigE network HVM: 90k IOPS read, 9k to 75k write PV: 120k IOPS read, 10k to 85k write UNDER THE HOOD i i

Slide 81

Slide 81 text

Netflix “The hi1.4xlarge configuration is about half the system cost for the same throughput.” http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.html

Slide 82

Slide 82 text

EBS Elastic Block Store

Slide 83

Slide 83 text

Provisioned IOPS Provision required IO performance

Slide 84

Slide 84 text

Provisioned IOPS Provision required IO performance + EBS-optimized instances with dedicated throughput

Slide 85

Slide 85 text

Generation Collection & storage Analytics & computation Collaboration & sharing

Slide 86

Slide 86 text

Performance + ease of use

Slide 87

Slide 87 text

PARTNER INTEGRATION 3

Slide 88

Slide 88 text

Extend platform with partners

Slide 89

Slide 89 text

Innovate on behalf of customers

Slide 90

Slide 90 text

Remove undifferentiated heavy lifting

Slide 91

Slide 91 text

Rolled the Amazon Hadoop optimizations into MapR MapR distribution for EMR Choice for EMR customers Easy deployment for MapR customers

Slide 92

Slide 92 text

Hadoop distribution MapR distribution for EMR Integrated into EMR NFS and ODBC drivers High availability and cluster mirroring

Slide 93

Slide 93 text

Enterprise data toolchain Informatica on EMR “Swiss army knife” for data formats Data integration Available to all on EMR

Slide 94

Slide 94 text

AWS Marketplace Karmasphere, Marketshare, Acunu Cassandra, Metamarkets, Aspera and more. aws.amazon.com/marketplace

Slide 95

Slide 95 text

PARTNER SUCCESS STORIES 4

Slide 96

Slide 96 text

Razorfish

Slide 97

Slide 97 text

3.5 billion records 71MM unique cookies 1.7MM targeted ads per day

Slide 98

Slide 98 text

3.5 billion records 71MM unique cookies 1.7MM targeted ads per day 500% improvement in return on ad spend.

Slide 99

Slide 99 text

Cycle Computing + Schrodinger

Slide 100

Slide 100 text

30k cores, $4200 an hour (compared to $10+ million)

Slide 101

Slide 101 text

Marketshare + Ticketmaster Optimize live event pricing

Slide 102

Slide 102 text

Reduced developer infrastructure management time by 3 hours a day

Slide 103

Slide 103 text

Thank you!

Slide 104

Slide 104 text

Q & A [email protected] @mza on Twitter