Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Under the Covers of DynamoDB

Under the Covers of DynamoDB

Data modeling and performance tuning for scalable applications with DynamoDB. Includes tips and tricks from a real world example from Localytics.

Matt Wood

April 18, 2013
Tweet

More Decks by Matt Wood

Other Decks in Technology

Transcript

  1. 1. Getting started 2. Data modeling 3. Partitioning 4. Replication

    & Analytics Overview 5. Customer story: Localytics
  2. DynamoDB is a managed NoSQL database service. Store and retrieve

    any amount of data. Serve any level of request traffic.
  3. Read throughput. Strong or eventual consistency Provisioned units = size

    of item x reads per second $0.0065 per hour for 50 units
  4. Read throughput. Strong or eventual consistency Provisioned units = size

    of item x reads per second $0.0065 per hour for 100 units 2
  5. Indexed data storage. $0.25 per GB per month. Tiered bandwidth

    pricing: aws.amazon.com/dynamodb/pricing
  6. Authentication. Session based to minimize latency. Uses the Amazon Security

    Token Service. Handled by AWS SDKs. Integrates with IAM.
  7. Libraries, mappers and mocks. ColdFusion, Django, Erlang, Java, .Net, Node.js,

    Perl, PHP, Python, Ruby http://j.mp/dynamodb-libs
  8. id = 100 date = 2012-05-16-09-00-10 total = 25.00 id

    = 101 date = 2012-05-15-15-00-11 total = 35.00 id = 101 date = 2012-05-16-12-00-10 total = 100.00
  9. id = 100 date = 2012-05-16-09-00-10 total = 25.00 id

    = 101 date = 2012-05-15-15-00-11 total = 35.00 id = 101 date = 2012-05-16-12-00-10 total = 100.00 Table
  10. id = 100 date = 2012-05-16-09-00-10 total = 25.00 id

    = 101 date = 2012-05-15-15-00-11 total = 35.00 id = 101 date = 2012-05-16-12-00-10 total = 100.00 Item
  11. id = 100 date = 2012-05-16-09-00-10 total = 25.00 id

    = 101 date = 2012-05-15-15-00-11 total = 35.00 id = 101 date = 2012-05-16-12-00-10 total = 100.00 Attribute
  12. Where is the schema? Tables do not require a formal

    schema. Items are an arbitrarily sized hash.
  13. Indexing. Items are indexed by primary and secondary keys. Primary

    keys can be composite. Secondary keys are local to the table.
  14. One API call, multiple items BatchGet returns multiple items by

    key. Throughput is measured by IO, not API calls. BatchWrite performs up to 25 put or delete operations.
  15. Query patterns Retrieve all items by hash key. Range key

    conditions: ==, <, >, >=, <=, begins with, between. Counts. Top and bottom n values. Paged responses.
  16. Players user_id = mza location = Cambridge joined = 2011-07-04

    user_id = jeffbarr location = Seattle joined = 2012-01-20 user_id = werner location = Worldwide joined = 2011-05-15
  17. Players user_id = mza location = Cambridge joined = 2011-07-04

    user_id = jeffbarr location = Seattle joined = 2012-01-20 user_id = werner location = Worldwide joined = 2011-05-15 Scores user_id = mza game = angry-birds score = 11,000 user_id = mza game = tetris score = 1,223,000 user_id = werner location = bejewelled score = 55,000
  18. Players user_id = mza location = Cambridge joined = 2011-07-04

    user_id = jeffbarr location = Seattle joined = 2012-01-20 user_id = werner location = Worldwide joined = 2011-05-15 Scores Leader boards user_id = mza game = angry-birds score = 11,000 user_id = mza game = tetris score = 1,223,000 user_id = werner location = bejewelled score = 55,000 game = angry-birds score = 11,000 user_id = mza game = tetris score = 1,223,000 user_id = mza game = tetris score = 9,000,000 user_id = jeffbarr
  19. Players user_id = mza location = Cambridge joined = 2011-07-04

    user_id = jeffbarr location = Seattle joined = 2012-01-20 user_id = werner location = Worldwide joined = 2011-05-15 user_id = mza game = angry-birds score = 11,000 user_id = mza game = tetris score = 1,223,000 user_id = werner location = bejewelled score = 55,000 Scores game = angry-birds score = 11,000 user_id = mza game = tetris score = 1,223,000 user_id = mza game = tetris score = 9,000,000 user_id = jeffbarr Leader boards Query for scores by user
  20. Players user_id = mza location = Cambridge joined = 2011-07-04

    user_id = jeffbarr location = Seattle joined = 2012-01-20 user_id = werner location = Worldwide joined = 2011-05-15 Scores Leader boards user_id = mza game = angry-birds score = 11,000 user_id = mza game = tetris score = 1,223,000 user_id = werner location = bejewelled score = 55,000 game = angry-birds score = 11,000 user_id = mza game = tetris score = 1,223,000 user_id = mza game = tetris score = 9,000,000 user_id = jeffbarr High scores by game
  21. message_id = 1 part = 1 message = <first 64k>

    message_id = 1 part = 2 message = <second 64k> message_id = 1 part = 3 joined = <third 64k> Split across items.
  22. message_id = 1 message = http://s3.amazonaws.com... message_id = 2 message

    = http://s3.amazonaws.com... message_id = 3 message = http://s3.amazonaws.com... Store a pointer to S3.
  23. event_id = 1000 timestamp = 2013-04-16-09-59-01 key = value event_id

    = 1001 timestamp = 2013-04-16-09-59-02 key = value event_id = 1002 timestamp = 2013-04-16-09-59-02 key = value Hot and cold tables. April March event_id = 1000 timestamp = 2013-03-01-09-59-01 key = value event_id = 1001 timestamp = 2013-03-01-09-59-02 key = value event_id = timestamp = key =
  24. Archive data. Move old data to S3: lower cost. Still

    available for analytics. Run queries across hot and cold data with Elastic MapReduce.
  25. Uniform workload. Data stored across multiple partitions. Data is primarily

    distributed by primary key. Provisioned throughput is divided evenly across partitions.
  26. Distinct values for hash keys. B E S T P

    R A C T I C E 1 : Hash key elements should have a high number of distinct values.
  27. user_id = mza first_name = Matt last_name = Wood user_id

    = jeffbarr first_name = Jeff last_name = Barr user_id = werner first_name = Werner last_name = Vogels user_id = simone first_name = Simone last_name = Brunozzi ... ... ... Lots of users with unique user_id. Workload well distributed across hash key.
  28. Avoid limited hash key values. B E S T P

    R A C T I C E 2 : Hash key elements should have a high number of distinct values.
  29. status = 200 date = 2012-04-01-00-00-01 status = 404 date

    = 2012-04-01-00-00-01 status 404 date = 2012-04-01-00-00-01 status = 404 date = 2012-04-01-00-00-01 Small number of status codes. Unevenly, non-uniform workload.
  30. Model for even distribution. B E S T P R

    A C T I C E 3 : Access by hash key value should be evenly distributed across the dataset.
  31. mobile_id = 100 access_date = 2012-04-01-00-00-01 mobile_id = 100 access_date

    = 2012-04-01-00-00-02 mobile_id = 100 access_date = 2012-04-01-00-00-03 mobile_id = 100 access_date = 2012-04-01-00-00-04 ... ... Large number of devices. Small number which are much more popular than others. Workload unevenly distributed.
  32. mobile_id = 100.1 access_date = 2012-04-01-00-00-01 mobile_id = 100.2 access_date

    = 2012-04-01-00-00-02 mobile_id = 100.3 access_date = 2012-04-01-00-00-03 mobile_id = 100.4 access_date = 2012-04-01-00-00-04 ... ... Sample access pattern. Workload randomized by hash key.
  33. create external table items_db (id string, votes bigint, views bigint)

    stored by 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' tblproperties ("dynamodb.table.name" = "items", "dynamodb.column.mapping" = "id:id,votes:votes,views:views");
  34. 5

  35. About Localytics 84 • Mobile App Analytics Service • 750+

    Million Devices and over 20,000 Apps • Customers Include: …and many more.
  36. About the Development Team 85 • Small team of four

    managing entire AWS infrastructure - 100 EC2 Instances • Experts in BigData • Leveraging Amazon's service has been the key to our success • Large scale users of: • SQS • S3 • ELB • RDS • Route53 • Elastic Cache • EMR …and of course DynamoDB
  37. Our use-case: Dedup Data 87 • Each datapoint includes a

    globally unique ID • Mobile traffic over 2G/3G will upload periodic duplicate data • We accept data up to a 28 day window
  38. First Design for Dedup table 88 Unique ID: aaaaaaaaaaaaaaaaaaaaaaaaa333333333333333 Table

    Name = dedup_table ID aaaaaaaaaaaaaaaaaaaaaaaaa111111111111111 aaaaaaaaaaaaaaaaaaaaaaaaa222222222222222 "Test and Set" in a single operation aaaaaaaaaaaaaaaaaaaaaaaaa333333333333333
  39. Optimization One - Data Aging 89 • Partition by Month

    • Create new table day before the month • Need to keep two months of data
  40. Optimization One - Data Aging 90 Unique ID: bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333 Check

    Previous month: Table Name = March2013_dedup ID aaaaaaaaaaaaaaaaaaaaaaaaa111111111111111 aaaaaaaaaaaaaaaaaaaaaaaaa222222222222222 Not Here!
  41. Optimization One - Data Aging 91 Unique ID: bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333 Test

    and Set in current month: Inserted Table Name = April2013_dedup ID bbbbbbbbbbbbbbbbbbbbbbbbb111111111111111 bbbbbbbbbbbbbbbbbbbbbbbbb222222222222222 bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333
  42. Optimization Two 92 • Reduce the index size - Reduces

    costs • Each item has a 100 byte overhead which is substantial • Combine multiple IDs together to one record • Split each ID into two halves o First half is the key. Second Half is added to the set
  43. Optimization Two - Use Sets 93 Unique ID: ccccccccccccccccccccccccccc999999999999999 Prefix

    Values aaaaaaaaaaaaaaaaaaaaaaaaa [111111111111111, 222222222222222, 333333333333333] bbbbbbbbbbbbbbbbbbbbbbbbb [444444444444444, 555555555555555, 666666666666666] ccccccccccccccccccccccccccc [777777777777777, 888888888888888, ] ccccccccccccccccccccccccccc 999999999999999
  44. Optimization Three - Combine Months 94 • Go back to

    a single table Prefix March2013 April2013 aaaaaaaaaa... [111111111111111, 22222222222... [1212121212121212, 3434343434.... bbbbbbbbbb... [444444444444444, 555555555.... [4545454545454545, 6767676767..... ccccccccccc... [777777777777777, 888888888... [8989898989898989, 1313131313.... One Operation 1. Delete February2013 Field 2. Check ID in March2013 3. Test and Set into April 2013
  45. Recap 95 Compare Plans for 20 Billion IDs per month

    Plan Storage Costs Read Costs Write Costs Total Savings Naive (after a year) $8400 0 $4000 $12400 Data Age $900 $350 $4000 $5250 57% Using Sets $150 $350 $4000 $4500 64% Multiple Months $150 0 $4000 $4150 67%
  46. 1. Getting started 2. Data modeling 3. Partitioning 4. Replication

    & Analytics Summary 5. Customer story: Localytics