Under the Covers of DynamoDB

Under the Covers of DynamoDB

Data modeling and performance tuning for scalable applications with DynamoDB. Includes tips and tricks from a real world example from Localytics.

39488f9d172ab92fd352f2cd7b73258d?s=128

Matt Wood

April 18, 2013
Tweet

Transcript

  1. Under the Covers of DynamoDB Matt Wood Principal Data Scientist

    @mza
  2. Hello.

  3. 1. Getting started 2. Data modeling 3. Partitioning 4. Replication

    & Analytics Overview 5. Customer story: Localytics
  4. Getting started 1

  5. DynamoDB is a managed NoSQL database service. Store and retrieve

    any amount of data. Serve any level of request traffic.
  6. Without the operational burden.

  7. Consistent, predictable performance. Single digit millisecond latency. Backed on solid-state

    drives.
  8. Flexible data model. Key/attribute pairs. No schema required. Easy to

    create. Easy to adjust.
  9. Seamless scalability. No table size limits. Unlimited storage. No downtime.

  10. Durable. Consistent, disk only writes. Replication across data centers and

    availability zones.
  11. Without the operational burden.

  12. Focus on your app.

  13. Two decisions + three clicks = ready for use

  14. Two decisions + three clicks = ready for use Primary

    keys Level of throughput
  15. Two decisions + three clicks = ready for use Primary

    keys Level of throughput
  16. Provisioned throughput. Reserve IOPS for reads and writes. Scale up

    for down at any time.
  17. Pay per capacity unit. Priced per hour of provisioned throughput.

  18. Write throughput. Size of item x writes per second $0.0065

    for 10 write units
  19. Consistent writes. Atomic increment and decrement. Optimistic concurrency control: conditional

    writes.
  20. Transactions. Item level transactions only. Puts, updates and deletes are

    ACID.
  21. Read throughput. Strong or eventual consistency

  22. Read throughput. Strong or eventual consistency Provisioned units = size

    of item x reads per second $0.0065 per hour for 50 units
  23. Read throughput. Strong or eventual consistency Provisioned units = size

    of item x reads per second $0.0065 per hour for 100 units 2
  24. Read throughput. Strong or eventual consistency Same latency expectations. Mix

    and match at ‘read time’.
  25. Provisioned throughput is managed by DynamoDB.

  26. Data is partitioned and managed by DynamoDB.

  27. Indexed data storage. $0.25 per GB per month. Tiered bandwidth

    pricing: aws.amazon.com/dynamodb/pricing
  28. Reserved capacity. Up to 53% for 1 year reservation. Up

    to 76% for 3 year reservation.
  29. Authentication. Session based to minimize latency. Uses the Amazon Security

    Token Service. Handled by AWS SDKs. Integrates with IAM.
  30. Monitoring. CloudWatch metrics: latency, consumed read and write throughput, errors

    and throttling.
  31. Libraries, mappers and mocks. ColdFusion, Django, Erlang, Java, .Net, Node.js,

    Perl, PHP, Python, Ruby http://j.mp/dynamodb-libs
  32. Data modeling 2

  33. id = 100 date = 2012-05-16-09-00-10 total = 25.00 id

    = 101 date = 2012-05-15-15-00-11 total = 35.00 id = 101 date = 2012-05-16-12-00-10 total = 100.00
  34. id = 100 date = 2012-05-16-09-00-10 total = 25.00 id

    = 101 date = 2012-05-15-15-00-11 total = 35.00 id = 101 date = 2012-05-16-12-00-10 total = 100.00 Table
  35. id = 100 date = 2012-05-16-09-00-10 total = 25.00 id

    = 101 date = 2012-05-15-15-00-11 total = 35.00 id = 101 date = 2012-05-16-12-00-10 total = 100.00 Item
  36. id = 100 date = 2012-05-16-09-00-10 total = 25.00 id

    = 101 date = 2012-05-15-15-00-11 total = 35.00 id = 101 date = 2012-05-16-12-00-10 total = 100.00 Attribute
  37. Where is the schema? Tables do not require a formal

    schema. Items are an arbitrarily sized hash.
  38. Indexing. Items are indexed by primary and secondary keys. Primary

    keys can be composite. Secondary keys are local to the table.
  39. ID Date Total

  40. ID Date Total Hash key

  41. ID Date Total Hash key Range key Composite primary key

  42. ID Date Total Hash key Range key Secondary range key

  43. Programming DynamoDB. Small but perfectly formed API.

  44. CreateTable UpdateTable DeleteTable DescribeTable ListTables Query Scan PutItem GetItem UpdateItem

    DeleteItem BatchGetItem BatchWriteItem
  45. CreateTable UpdateTable DeleteTable DescribeTable ListTables Query Scan PutItem GetItem UpdateItem

    DeleteItem BatchGetItem BatchWriteItem
  46. CreateTable UpdateTable DeleteTable DescribeTable ListTables Query Scan PutItem GetItem UpdateItem

    DeleteItem BatchGetItem BatchWriteItem
  47. Conditional updates. PutItem, UpdateItem, DeleteItem can take optional conditions for

    operation. UpdateItem performs atomic increments.
  48. One API call, multiple items BatchGet returns multiple items by

    key. Throughput is measured by IO, not API calls. BatchWrite performs up to 25 put or delete operations.
  49. CreateTable UpdateTable DeleteTable DescribeTable ListTables Query Scan PutItem GetItem UpdateItem

    DeleteItem BatchGetItem BatchWriteItem
  50. Query vs Scan Query returns items by key. Scan reads

    the whole table sequentially.
  51. Query patterns Retrieve all items by hash key. Range key

    conditions: ==, <, >, >=, <=, begins with, between. Counts. Top and bottom n values. Paged responses.
  52. Mapping relationships. E X A M P L E 1

    :
  53. Players user_id = mza location = Cambridge joined = 2011-07-04

    user_id = jeffbarr location = Seattle joined = 2012-01-20 user_id = werner location = Worldwide joined = 2011-05-15
  54. Players user_id = mza location = Cambridge joined = 2011-07-04

    user_id = jeffbarr location = Seattle joined = 2012-01-20 user_id = werner location = Worldwide joined = 2011-05-15 Scores user_id = mza game = angry-birds score = 11,000 user_id = mza game = tetris score = 1,223,000 user_id = werner location = bejewelled score = 55,000
  55. Players user_id = mza location = Cambridge joined = 2011-07-04

    user_id = jeffbarr location = Seattle joined = 2012-01-20 user_id = werner location = Worldwide joined = 2011-05-15 Scores Leader boards user_id = mza game = angry-birds score = 11,000 user_id = mza game = tetris score = 1,223,000 user_id = werner location = bejewelled score = 55,000 game = angry-birds score = 11,000 user_id = mza game = tetris score = 1,223,000 user_id = mza game = tetris score = 9,000,000 user_id = jeffbarr
  56. Players user_id = mza location = Cambridge joined = 2011-07-04

    user_id = jeffbarr location = Seattle joined = 2012-01-20 user_id = werner location = Worldwide joined = 2011-05-15 user_id = mza game = angry-birds score = 11,000 user_id = mza game = tetris score = 1,223,000 user_id = werner location = bejewelled score = 55,000 Scores game = angry-birds score = 11,000 user_id = mza game = tetris score = 1,223,000 user_id = mza game = tetris score = 9,000,000 user_id = jeffbarr Leader boards Query for scores by user
  57. Players user_id = mza location = Cambridge joined = 2011-07-04

    user_id = jeffbarr location = Seattle joined = 2012-01-20 user_id = werner location = Worldwide joined = 2011-05-15 Scores Leader boards user_id = mza game = angry-birds score = 11,000 user_id = mza game = tetris score = 1,223,000 user_id = werner location = bejewelled score = 55,000 game = angry-birds score = 11,000 user_id = mza game = tetris score = 1,223,000 user_id = mza game = tetris score = 9,000,000 user_id = jeffbarr High scores by game
  58. Storing large items. E X A M P L E

    2 :
  59. Unlimited storage. Unlimited attributes per item. Unlimited items per table.

    Maximum of 64k per item.
  60. message_id = 1 part = 1 message = <first 64k>

    message_id = 1 part = 2 message = <second 64k> message_id = 1 part = 3 joined = <third 64k> Split across items.
  61. message_id = 1 message = http://s3.amazonaws.com... message_id = 2 message

    = http://s3.amazonaws.com... message_id = 3 message = http://s3.amazonaws.com... Store a pointer to S3.
  62. Time series data E X A M P L E

    3 :
  63. event_id = 1000 timestamp = 2013-04-16-09-59-01 key = value event_id

    = 1001 timestamp = 2013-04-16-09-59-02 key = value event_id = 1002 timestamp = 2013-04-16-09-59-02 key = value Hot and cold tables. April March event_id = 1000 timestamp = 2013-03-01-09-59-01 key = value event_id = 1001 timestamp = 2013-03-01-09-59-02 key = value event_id = timestamp = key =
  64. April March February January December

  65. Archive data. Move old data to S3: lower cost. Still

    available for analytics. Run queries across hot and cold data with Elastic MapReduce.
  66. Partitioning 3

  67. Uniform workload. Data stored across multiple partitions. Data is primarily

    distributed by primary key. Provisioned throughput is divided evenly across partitions.
  68. To achieve and maintain full provisioned throughput, spread workload evenly

    across hash keys.
  69. Non-Uniform workload. Might be throttled, even at high levels of

    throughput.
  70. Distinct values for hash keys. B E S T P

    R A C T I C E 1 : Hash key elements should have a high number of distinct values.
  71. user_id = mza first_name = Matt last_name = Wood user_id

    = jeffbarr first_name = Jeff last_name = Barr user_id = werner first_name = Werner last_name = Vogels user_id = simone first_name = Simone last_name = Brunozzi ... ... ... Lots of users with unique user_id. Workload well distributed across hash key.
  72. Avoid limited hash key values. B E S T P

    R A C T I C E 2 : Hash key elements should have a high number of distinct values.
  73. status = 200 date = 2012-04-01-00-00-01 status = 404 date

    = 2012-04-01-00-00-01 status 404 date = 2012-04-01-00-00-01 status = 404 date = 2012-04-01-00-00-01 Small number of status codes. Unevenly, non-uniform workload.
  74. Model for even distribution. B E S T P R

    A C T I C E 3 : Access by hash key value should be evenly distributed across the dataset.
  75. mobile_id = 100 access_date = 2012-04-01-00-00-01 mobile_id = 100 access_date

    = 2012-04-01-00-00-02 mobile_id = 100 access_date = 2012-04-01-00-00-03 mobile_id = 100 access_date = 2012-04-01-00-00-04 ... ... Large number of devices. Small number which are much more popular than others. Workload unevenly distributed.
  76. mobile_id = 100.1 access_date = 2012-04-01-00-00-01 mobile_id = 100.2 access_date

    = 2012-04-01-00-00-02 mobile_id = 100.3 access_date = 2012-04-01-00-00-03 mobile_id = 100.4 access_date = 2012-04-01-00-00-04 ... ... Sample access pattern. Workload randomized by hash key.
  77. Replication & Analytics 4

  78. Seamless scale. Scalable methods for data processing. Scalable methods for

    backup/restore.
  79. Amazon Elastic MapReduce. Managed Hadoop service for data-intensive workflows. aws.amazon.com/emr

  80. create external table items_db (id string, votes bigint, views bigint)

    stored by 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' tblproperties ("dynamodb.table.name" = "items", "dynamodb.column.mapping" = "id:id,votes:votes,views:views");
  81. select id, likes, views from items_db order by views desc;

  82. 5

  83. Mohit Dilawari Director of Engineering @mdilawari DynamoDB @ Localytics

  84. About Localytics 84 • Mobile App Analytics Service • 750+

    Million Devices and over 20,000 Apps • Customers Include: …and many more.
  85. About the Development Team 85 • Small team of four

    managing entire AWS infrastructure - 100 EC2 Instances • Experts in BigData • Leveraging Amazon's service has been the key to our success • Large scale users of: • SQS • S3 • ELB • RDS • Route53 • Elastic Cache • EMR …and of course DynamoDB
  86. Why DynamoDB? 86 Set it and Forget it

  87. Our use-case: Dedup Data 87 • Each datapoint includes a

    globally unique ID • Mobile traffic over 2G/3G will upload periodic duplicate data • We accept data up to a 28 day window
  88. First Design for Dedup table 88 Unique ID: aaaaaaaaaaaaaaaaaaaaaaaaa333333333333333 Table

    Name = dedup_table ID aaaaaaaaaaaaaaaaaaaaaaaaa111111111111111 aaaaaaaaaaaaaaaaaaaaaaaaa222222222222222 "Test and Set" in a single operation aaaaaaaaaaaaaaaaaaaaaaaaa333333333333333
  89. Optimization One - Data Aging 89 • Partition by Month

    • Create new table day before the month • Need to keep two months of data
  90. Optimization One - Data Aging 90 Unique ID: bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333 Check

    Previous month: Table Name = March2013_dedup ID aaaaaaaaaaaaaaaaaaaaaaaaa111111111111111 aaaaaaaaaaaaaaaaaaaaaaaaa222222222222222 Not Here!
  91. Optimization One - Data Aging 91 Unique ID: bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333 Test

    and Set in current month: Inserted Table Name = April2013_dedup ID bbbbbbbbbbbbbbbbbbbbbbbbb111111111111111 bbbbbbbbbbbbbbbbbbbbbbbbb222222222222222 bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333
  92. Optimization Two 92 • Reduce the index size - Reduces

    costs • Each item has a 100 byte overhead which is substantial • Combine multiple IDs together to one record • Split each ID into two halves o First half is the key. Second Half is added to the set
  93. Optimization Two - Use Sets 93 Unique ID: ccccccccccccccccccccccccccc999999999999999 Prefix

    Values aaaaaaaaaaaaaaaaaaaaaaaaa [111111111111111, 222222222222222, 333333333333333] bbbbbbbbbbbbbbbbbbbbbbbbb [444444444444444, 555555555555555, 666666666666666] ccccccccccccccccccccccccccc [777777777777777, 888888888888888, ] ccccccccccccccccccccccccccc 999999999999999
  94. Optimization Three - Combine Months 94 • Go back to

    a single table Prefix March2013 April2013 aaaaaaaaaa... [111111111111111, 22222222222... [1212121212121212, 3434343434.... bbbbbbbbbb... [444444444444444, 555555555.... [4545454545454545, 6767676767..... ccccccccccc... [777777777777777, 888888888... [8989898989898989, 1313131313.... One Operation 1. Delete February2013 Field 2. Check ID in March2013 3. Test and Set into April 2013
  95. Recap 95 Compare Plans for 20 Billion IDs per month

    Plan Storage Costs Read Costs Write Costs Total Savings Naive (after a year) $8400 0 $4000 $12400 Data Age $900 $350 $4000 $5250 57% Using Sets $150 $350 $4000 $4500 64% Multiple Months $150 0 $4000 $4150 67%
  96. 96 Thank You @mdilawari

  97. 1. Getting started 2. Data modeling 3. Partitioning 4. Replication

    & Analytics Summary 5. Customer story: Localytics
  98. Free tier.

  99. aws.amazon.com/dynamodb

  100. Thank you! matthew@amazon.com @mza