Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data Analytics

39488f9d172ab92fd352f2cd7b73258d?s=47 Matt Wood
August 01, 2012

Big Data Analytics

An introduction to Big Data Analytics in the cloud.

39488f9d172ab92fd352f2cd7b73258d?s=128

Matt Wood

August 01, 2012
Tweet

More Decks by Matt Wood

Other Decks in Technology

Transcript

  1. Big Data Analytics w i t h A m a

    z o n W e b S e r v i c e s Dr. Matt Wood An Online Seminar for Partners. Wednesday 1st August.
  2. Hello, and thank you.

  3. Big Data Analytics An introduction

  4. Big Data Analytics An introduction The story of analytics on

    AWS
  5. Big Data Analytics An introduction The story of analytics on

    AWS Integrating partners
  6. Big Data Analytics An introduction The story of analytics on

    AWS Integrating partners Partner success stories
  7. INTRODUCING BIG DATA 1

  8. Data for competitive advantage.

  9. Customer segmentation, financial modeling, system analysis, line-of-sight, business intelligence. Using

    data
  10. Generation Collection & storage Analytics & computation Collaboration & sharing

  11. Cost of data generation is falling.

  12. Generation Collection & storage Analytics & computation Collaboration & sharing

    lower cost, increased throughput
  13. Generation Collection & storage Analytics & computation Collaboration & sharing

    HIGHLY CONSTRAINED
  14. Very high barrier to turning data into information.

  15. Move from a data generation challenge to analytics challenge.

  16. Enter the Cloud.

  17. Remove the constraints.

  18. Enable data-driven innovation.

  19. Move to a distributed data approach.

  20. Maturation of two things.

  21. Maturation of two things. Software for distributed storage and analysis

  22. Maturation of two things. Software for distributed storage and analysis

    Infrastructure for distributed storage and analysis
  23. Frameworks for data-intensive workloads. Software Distributed by design.

  24. Platform for data-intensive workloads. Infrastructure Distributed by design.

  25. Support the data timeline.

  26. Generation Collection & storage Analytics & computation Collaboration & sharing

    HIGHLY CONSTRAINED
  27. Generation Collection & storage Analytics & computation Collaboration & sharing

  28. Lower the barrier to entry.

  29. Accelerate time to market and increase agility.

  30. Enable new business opportunities.

  31. Washington Post Pinterest NASA

  32. “AWS enables Pfizer to explore difficult or deep scientific questions

    in a timely, scalable manner and helps us make better decisions more quickly” Michael Miller, Pfizer
  33. THE STORY OF ANALYTICS 2

  34. EC2 Utility computing. 6 years young.

  35. Embarrassingly parallel problems. Scale out systems Queue based distribution. Small,

    medium and high scale.
  36. None
  37. None
  38. None
  39. EC2 Utility computing. 6 years young. Cost optimization.

  40. Achieving economies of scale 100% Time

  41. Reserved capacity Achieving economies of scale 100% Time

  42. Reserved capacity Achieving economies of scale 100% Time On-demand

  43. Reserved capacity Achieving economies of scale 100% Time On-demand UNUSED

    CAPACITY
  44. Bid on unused EC2 capacity. Spot Instances Very large discount.

    Perfect for batch runs. Balance cost and scale.
  45. $650 per hour

  46. Pattern for distributed computing. Map/reduce Software frameworks such as Hadoop.

    Write two functions. Scale up.
  47. Pattern for distributed computing. Map/reduce Software frameworks such as Hadoop.

    Write two functions. Scale up. Complex cluster configuration and management.
  48. Managed Hadoop clusters. Amazon Elastic MapReduce Easy to provision and

    monitor. Write two functions. Scale up. Optimized for S3 access.
  49. Input data S3 UNDER THE HOOD i i

  50. Elastic MapReduce Code Input data S3 UNDER THE HOOD i

    i
  51. Elastic MapReduce Code Name node Input data S3 UNDER THE

    HOOD i i
  52. Elastic MapReduce Code Name node Input data S3 Elastic cluster

    UNDER THE HOOD i i
  53. Elastic MapReduce Code Name node Input data S3 Elastic cluster

    HDFS UNDER THE HOOD i i
  54. Elastic MapReduce Code Name node Input data S3 Elastic cluster

    HDFS Queries + BI Via JDBC, Pig, Hive UNDER THE HOOD i i
  55. Elastic MapReduce Code Name node Output S3 + SimpleDB Input

    data S3 Elastic cluster HDFS Queries + BI Via JDBC, Pig, Hive UNDER THE HOOD i i
  56. Output S3 + SimpleDB Input data S3 UNDER THE HOOD

    i i
  57. None
  58. None
  59. None
  60. None
  61. None
  62. None
  63. None
  64. None
  65. None
  66. None
  67. None
  68. None
  69. None
  70. None
  71. Performance

  72. Performance Compute performance

  73. Intel Xeon E5-2670 Cluster Compute 10 gig E non-blocking network

    Placement groupings 60.5 Gb UNDER THE HOOD i i
  74. Intel Xeon E5-2670 Cluster Compute 10 gig E non-blocking network

    Placement groupings 60.5 Gb UNDER THE HOOD i i + GPU enabled instances
  75. Performance Compute performance

  76. Performance Compute performance IO performance

  77. NoSQL Unstructured data storage.

  78. Predictable, consistent performance DynamoDB Unlimited storage No schema for unstructured

    data Single digit millisecond latencies Backed on solid state drives
  79. ...and SSDs for all. New Hi1 storage instances.

  80. 2 x 1Tb SSDs hi1.4xlarge 10 GigE network HVM: 90k

    IOPS read, 9k to 75k write PV: 120k IOPS read, 10k to 85k write UNDER THE HOOD i i
  81. Netflix “The hi1.4xlarge configuration is about half the system cost

    for the same throughput.” http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.html
  82. EBS Elastic Block Store

  83. Provisioned IOPS Provision required IO performance

  84. Provisioned IOPS Provision required IO performance + EBS-optimized instances with

    dedicated throughput
  85. Generation Collection & storage Analytics & computation Collaboration & sharing

  86. Performance + ease of use

  87. PARTNER INTEGRATION 3

  88. Extend platform with partners

  89. Innovate on behalf of customers

  90. Remove undifferentiated heavy lifting

  91. Rolled the Amazon Hadoop optimizations into MapR MapR distribution for

    EMR Choice for EMR customers Easy deployment for MapR customers
  92. Hadoop distribution MapR distribution for EMR Integrated into EMR NFS

    and ODBC drivers High availability and cluster mirroring
  93. Enterprise data toolchain Informatica on EMR “Swiss army knife” for

    data formats Data integration Available to all on EMR
  94. AWS Marketplace Karmasphere, Marketshare, Acunu Cassandra, Metamarkets, Aspera and more.

    aws.amazon.com/marketplace
  95. PARTNER SUCCESS STORIES 4

  96. Razorfish

  97. 3.5 billion records 71MM unique cookies 1.7MM targeted ads per

    day
  98. 3.5 billion records 71MM unique cookies 1.7MM targeted ads per

    day 500% improvement in return on ad spend.
  99. Cycle Computing + Schrodinger

  100. 30k cores, $4200 an hour (compared to $10+ million)

  101. Marketshare + Ticketmaster Optimize live event pricing

  102. Reduced developer infrastructure management time by 3 hours a day

  103. Thank you!

  104. Q & A matthew@amazon.com @mza on Twitter