Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Under the Covers of DynamoDB

Under the Covers of DynamoDB

Data modeling and performance tuning for scalable applications with DynamoDB. Includes tips and tricks from a real world example from Localytics.

Matt Wood

April 18, 2013
Tweet

More Decks by Matt Wood

Other Decks in Technology

Transcript

  1. Under the Covers of DynamoDB
    Matt Wood
    Principal Data Scientist
    @mza

    View Slide

  2. Hello.

    View Slide

  3. 1. Getting started
    2. Data modeling
    3. Partitioning
    4. Replication & Analytics
    Overview
    5. Customer story: Localytics

    View Slide

  4. Getting started
    1

    View Slide

  5. DynamoDB is a managed
    NoSQL database service.
    Store and retrieve any amount of data.
    Serve any level of request traffic.

    View Slide

  6. Without the operational burden.

    View Slide

  7. Consistent, predictable performance.
    Single digit millisecond latency.
    Backed on solid-state drives.

    View Slide

  8. Flexible data model.
    Key/attribute pairs. No schema required.
    Easy to create. Easy to adjust.

    View Slide

  9. Seamless scalability.
    No table size limits. Unlimited storage.
    No downtime.

    View Slide

  10. Durable.
    Consistent, disk only writes.
    Replication across data centers and availability zones.

    View Slide

  11. Without the operational burden.

    View Slide

  12. Focus on your app.

    View Slide

  13. Two decisions + three clicks
    = ready for use

    View Slide

  14. Two decisions + three clicks
    = ready for use
    Primary keys
    Level of throughput

    View Slide

  15. Two decisions + three clicks
    = ready for use
    Primary keys
    Level of throughput

    View Slide

  16. Provisioned throughput.
    Reserve IOPS for reads and writes.
    Scale up for down at any time.

    View Slide

  17. Pay per capacity unit.
    Priced per hour of provisioned throughput.

    View Slide

  18. Write throughput.
    Size of item x writes per second
    $0.0065 for 10 write units

    View Slide

  19. Consistent writes.
    Atomic increment and decrement.
    Optimistic concurrency control: conditional writes.

    View Slide

  20. Transactions.
    Item level transactions only.
    Puts, updates and deletes are ACID.

    View Slide

  21. Read throughput.
    Strong or eventual consistency

    View Slide

  22. Read throughput.
    Strong or eventual consistency
    Provisioned units = size of item x reads per second
    $0.0065 per hour for 50 units

    View Slide

  23. Read throughput.
    Strong or eventual consistency
    Provisioned units = size of item x reads per second
    $0.0065 per hour for 100 units
    2

    View Slide

  24. Read throughput.
    Strong or eventual consistency
    Same latency expectations.
    Mix and match at ‘read time’.

    View Slide

  25. Provisioned throughput is
    managed by DynamoDB.

    View Slide

  26. Data is partitioned and
    managed by DynamoDB.

    View Slide

  27. Indexed data storage.
    $0.25 per GB per month.
    Tiered bandwidth pricing:
    aws.amazon.com/dynamodb/pricing

    View Slide

  28. Reserved capacity.
    Up to 53% for 1 year reservation.
    Up to 76% for 3 year reservation.

    View Slide

  29. Authentication.
    Session based to minimize latency.
    Uses the Amazon Security Token Service.
    Handled by AWS SDKs.
    Integrates with IAM.

    View Slide

  30. Monitoring.
    CloudWatch metrics:
    latency, consumed read and write throughput,
    errors and throttling.

    View Slide

  31. Libraries, mappers and mocks.
    ColdFusion, Django, Erlang, Java, .Net,
    Node.js, Perl, PHP, Python, Ruby
    http://j.mp/dynamodb-libs

    View Slide

  32. Data modeling
    2

    View Slide

  33. id = 100 date =
    2012-05-16-09-00-10
    total = 25.00
    id = 101 date =
    2012-05-15-15-00-11
    total = 35.00
    id = 101 date =
    2012-05-16-12-00-10
    total = 100.00

    View Slide

  34. id = 100 date =
    2012-05-16-09-00-10
    total = 25.00
    id = 101 date =
    2012-05-15-15-00-11
    total = 35.00
    id = 101 date =
    2012-05-16-12-00-10
    total = 100.00
    Table

    View Slide

  35. id = 100 date =
    2012-05-16-09-00-10
    total = 25.00
    id = 101 date =
    2012-05-15-15-00-11
    total = 35.00
    id = 101 date =
    2012-05-16-12-00-10
    total = 100.00
    Item

    View Slide

  36. id = 100 date =
    2012-05-16-09-00-10
    total = 25.00
    id = 101 date =
    2012-05-15-15-00-11
    total = 35.00
    id = 101 date =
    2012-05-16-12-00-10
    total = 100.00
    Attribute

    View Slide

  37. Where is the schema?
    Tables do not require a formal schema.
    Items are an arbitrarily sized hash.

    View Slide

  38. Indexing.
    Items are indexed by primary and secondary keys.
    Primary keys can be composite.
    Secondary keys are local to the table.

    View Slide

  39. ID Date Total

    View Slide

  40. ID Date Total
    Hash key

    View Slide

  41. ID Date Total
    Hash key Range key
    Composite primary key

    View Slide

  42. ID Date Total
    Hash key Range key Secondary range key

    View Slide

  43. Programming DynamoDB.
    Small but perfectly formed API.

    View Slide

  44. CreateTable
    UpdateTable
    DeleteTable
    DescribeTable
    ListTables
    Query
    Scan
    PutItem
    GetItem
    UpdateItem
    DeleteItem
    BatchGetItem
    BatchWriteItem

    View Slide

  45. CreateTable
    UpdateTable
    DeleteTable
    DescribeTable
    ListTables
    Query
    Scan
    PutItem
    GetItem
    UpdateItem
    DeleteItem
    BatchGetItem
    BatchWriteItem

    View Slide

  46. CreateTable
    UpdateTable
    DeleteTable
    DescribeTable
    ListTables
    Query
    Scan
    PutItem
    GetItem
    UpdateItem
    DeleteItem
    BatchGetItem
    BatchWriteItem

    View Slide

  47. Conditional updates.
    PutItem, UpdateItem, DeleteItem can take
    optional conditions for operation.
    UpdateItem performs atomic increments.

    View Slide

  48. One API call, multiple items
    BatchGet returns multiple items by key.
    Throughput is measured by IO, not API calls.
    BatchWrite performs up to 25 put or delete operations.

    View Slide

  49. CreateTable
    UpdateTable
    DeleteTable
    DescribeTable
    ListTables
    Query
    Scan
    PutItem
    GetItem
    UpdateItem
    DeleteItem
    BatchGetItem
    BatchWriteItem

    View Slide

  50. Query vs Scan
    Query returns items by key.
    Scan reads the whole table sequentially.

    View Slide

  51. Query patterns
    Retrieve all items by hash key.
    Range key conditions:
    ==, <, >, >=, <=, begins with, between.
    Counts. Top and bottom n values.
    Paged responses.

    View Slide

  52. Mapping relationships.
    E X A M P L E 1 :

    View Slide

  53. Players
    user_id =
    mza
    location =
    Cambridge
    joined =
    2011-07-04
    user_id =
    jeffbarr
    location =
    Seattle
    joined =
    2012-01-20
    user_id =
    werner
    location =
    Worldwide
    joined =
    2011-05-15

    View Slide

  54. Players
    user_id =
    mza
    location =
    Cambridge
    joined =
    2011-07-04
    user_id =
    jeffbarr
    location =
    Seattle
    joined =
    2012-01-20
    user_id =
    werner
    location =
    Worldwide
    joined =
    2011-05-15
    Scores
    user_id =
    mza
    game =
    angry-birds
    score =
    11,000
    user_id =
    mza
    game =
    tetris
    score =
    1,223,000
    user_id =
    werner
    location =
    bejewelled
    score =
    55,000

    View Slide

  55. Players
    user_id =
    mza
    location =
    Cambridge
    joined =
    2011-07-04
    user_id =
    jeffbarr
    location =
    Seattle
    joined =
    2012-01-20
    user_id =
    werner
    location =
    Worldwide
    joined =
    2011-05-15
    Scores Leader boards
    user_id =
    mza
    game =
    angry-birds
    score =
    11,000
    user_id =
    mza
    game =
    tetris
    score =
    1,223,000
    user_id =
    werner
    location =
    bejewelled
    score =
    55,000
    game =
    angry-birds
    score =
    11,000
    user_id =
    mza
    game =
    tetris
    score =
    1,223,000
    user_id =
    mza
    game =
    tetris
    score =
    9,000,000
    user_id =
    jeffbarr

    View Slide

  56. Players
    user_id =
    mza
    location =
    Cambridge
    joined =
    2011-07-04
    user_id =
    jeffbarr
    location =
    Seattle
    joined =
    2012-01-20
    user_id =
    werner
    location =
    Worldwide
    joined =
    2011-05-15
    user_id =
    mza
    game =
    angry-birds
    score =
    11,000
    user_id =
    mza
    game =
    tetris
    score =
    1,223,000
    user_id =
    werner
    location =
    bejewelled
    score =
    55,000
    Scores
    game =
    angry-birds
    score =
    11,000
    user_id =
    mza
    game =
    tetris
    score =
    1,223,000
    user_id =
    mza
    game =
    tetris
    score =
    9,000,000
    user_id =
    jeffbarr
    Leader boards
    Query for scores
    by user

    View Slide

  57. Players
    user_id =
    mza
    location =
    Cambridge
    joined =
    2011-07-04
    user_id =
    jeffbarr
    location =
    Seattle
    joined =
    2012-01-20
    user_id =
    werner
    location =
    Worldwide
    joined =
    2011-05-15
    Scores Leader boards
    user_id =
    mza
    game =
    angry-birds
    score =
    11,000
    user_id =
    mza
    game =
    tetris
    score =
    1,223,000
    user_id =
    werner
    location =
    bejewelled
    score =
    55,000
    game =
    angry-birds
    score =
    11,000
    user_id =
    mza
    game =
    tetris
    score =
    1,223,000
    user_id =
    mza
    game =
    tetris
    score =
    9,000,000
    user_id =
    jeffbarr
    High scores by game

    View Slide

  58. Storing large items.
    E X A M P L E 2 :

    View Slide

  59. Unlimited storage.
    Unlimited attributes per item.
    Unlimited items per table.
    Maximum of 64k per item.

    View Slide

  60. message_id = 1 part = 1
    message =

    message_id = 1 part = 2
    message =

    message_id = 1 part = 3
    joined =

    Split across items.

    View Slide

  61. message_id = 1
    message =
    http://s3.amazonaws.com...
    message_id = 2
    message =
    http://s3.amazonaws.com...
    message_id = 3
    message =
    http://s3.amazonaws.com...
    Store a pointer to S3.

    View Slide

  62. Time series data
    E X A M P L E 3 :

    View Slide

  63. event_id =
    1000
    timestamp =
    2013-04-16-09-59-01
    key =
    value
    event_id =
    1001
    timestamp =
    2013-04-16-09-59-02
    key =
    value
    event_id =
    1002
    timestamp =
    2013-04-16-09-59-02
    key =
    value
    Hot and cold tables.
    April
    March
    event_id =
    1000
    timestamp =
    2013-03-01-09-59-01
    key =
    value
    event_id =
    1001
    timestamp =
    2013-03-01-09-59-02
    key =
    value
    event_id = timestamp = key =

    View Slide

  64. April
    March
    February
    January
    December

    View Slide

  65. Archive data.
    Move old data to S3: lower cost.
    Still available for analytics.
    Run queries across hot and cold data
    with Elastic MapReduce.

    View Slide

  66. Partitioning
    3

    View Slide

  67. Uniform workload.
    Data stored across multiple partitions.
    Data is primarily distributed by primary key.
    Provisioned throughput is divided evenly across partitions.

    View Slide

  68. To achieve and maintain full
    provisioned throughput, spread
    workload evenly across hash keys.

    View Slide

  69. Non-Uniform workload.
    Might be throttled, even at high levels of throughput.

    View Slide

  70. Distinct values for hash keys.
    B E S T P R A C T I C E 1 :
    Hash key elements should have a
    high number of distinct values.

    View Slide

  71. user_id =
    mza
    first_name =
    Matt
    last_name =
    Wood
    user_id =
    jeffbarr
    first_name =
    Jeff
    last_name =
    Barr
    user_id =
    werner
    first_name =
    Werner
    last_name =
    Vogels
    user_id =
    simone
    first_name =
    Simone
    last_name =
    Brunozzi
    ... ... ...
    Lots of users with unique user_id.
    Workload well distributed across hash key.

    View Slide

  72. Avoid limited hash key values.
    B E S T P R A C T I C E 2 :
    Hash key elements should have a
    high number of distinct values.

    View Slide

  73. status =
    200
    date =
    2012-04-01-00-00-01
    status =
    404
    date =
    2012-04-01-00-00-01
    status
    404
    date =
    2012-04-01-00-00-01
    status =
    404
    date =
    2012-04-01-00-00-01
    Small number of status codes.
    Unevenly, non-uniform workload.

    View Slide

  74. Model for even distribution.
    B E S T P R A C T I C E 3 :
    Access by hash key value should be evenly
    distributed across the dataset.

    View Slide

  75. mobile_id =
    100
    access_date =
    2012-04-01-00-00-01
    mobile_id =
    100
    access_date =
    2012-04-01-00-00-02
    mobile_id =
    100
    access_date =
    2012-04-01-00-00-03
    mobile_id =
    100
    access_date =
    2012-04-01-00-00-04
    ... ...
    Large number of devices.
    Small number which are much more popular than others.
    Workload unevenly distributed.

    View Slide

  76. mobile_id =
    100.1
    access_date =
    2012-04-01-00-00-01
    mobile_id =
    100.2
    access_date =
    2012-04-01-00-00-02
    mobile_id =
    100.3
    access_date =
    2012-04-01-00-00-03
    mobile_id =
    100.4
    access_date =
    2012-04-01-00-00-04
    ... ...
    Sample access pattern.
    Workload randomized by hash key.

    View Slide

  77. Replication & Analytics
    4

    View Slide

  78. Seamless scale.
    Scalable methods for data processing.
    Scalable methods for backup/restore.

    View Slide

  79. Amazon Elastic MapReduce.
    Managed Hadoop service for
    data-intensive workflows.
    aws.amazon.com/emr

    View Slide

  80. create external table items_db
    (id string, votes bigint, views bigint) stored by
    'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
    tblproperties
    ("dynamodb.table.name" = "items",
    "dynamodb.column.mapping" =
    "id:id,votes:votes,views:views");

    View Slide

  81. select id, likes, views
    from items_db
    order by views desc;

    View Slide

  82. 5

    View Slide

  83. Mohit Dilawari
    Director of Engineering
    @mdilawari
    DynamoDB @ Localytics

    View Slide

  84. About Localytics
    84
    • Mobile App Analytics Service
    • 750+ Million Devices and over 20,000 Apps
    • Customers Include:
    …and many more.

    View Slide

  85. About the Development Team
    85
    • Small team of four managing entire AWS infrastructure - 100 EC2
    Instances
    • Experts in BigData
    • Leveraging Amazon's service has been the key to our success
    • Large scale users of:
    • SQS
    • S3
    • ELB
    • RDS
    • Route53
    • Elastic Cache
    • EMR
    …and of course DynamoDB

    View Slide

  86. Why DynamoDB?
    86
    Set it and Forget it

    View Slide

  87. Our use-case: Dedup Data
    87
    • Each datapoint includes a globally unique ID
    • Mobile traffic over 2G/3G will upload periodic duplicate data
    • We accept data up to a 28 day window

    View Slide

  88. First Design for Dedup table
    88
    Unique ID: aaaaaaaaaaaaaaaaaaaaaaaaa333333333333333
    Table Name = dedup_table
    ID
    aaaaaaaaaaaaaaaaaaaaaaaaa111111111111111
    aaaaaaaaaaaaaaaaaaaaaaaaa222222222222222
    "Test and Set" in a single operation
    aaaaaaaaaaaaaaaaaaaaaaaaa333333333333333

    View Slide

  89. Optimization One - Data Aging
    89
    • Partition by Month
    • Create new table day before the month
    • Need to keep two months of data

    View Slide

  90. Optimization One - Data Aging
    90
    Unique ID: bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333
    Check Previous month:
    Table Name = March2013_dedup
    ID
    aaaaaaaaaaaaaaaaaaaaaaaaa111111111111111
    aaaaaaaaaaaaaaaaaaaaaaaaa222222222222222
    Not Here!

    View Slide

  91. Optimization One - Data Aging
    91
    Unique ID: bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333
    Test and Set in current month:
    Inserted
    Table Name = April2013_dedup
    ID
    bbbbbbbbbbbbbbbbbbbbbbbbb111111111111111
    bbbbbbbbbbbbbbbbbbbbbbbbb222222222222222
    bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333

    View Slide

  92. Optimization Two
    92
    • Reduce the index size - Reduces costs
    • Each item has a 100 byte overhead which is substantial
    • Combine multiple IDs together to one record
    • Split each ID into two halves
    o First half is the key. Second Half is added to the set

    View Slide

  93. Optimization Two - Use Sets
    93
    Unique ID: ccccccccccccccccccccccccccc999999999999999
    Prefix Values
    aaaaaaaaaaaaaaaaaaaaaaaaa [111111111111111, 222222222222222, 333333333333333]
    bbbbbbbbbbbbbbbbbbbbbbbbb [444444444444444, 555555555555555, 666666666666666]
    ccccccccccccccccccccccccccc [777777777777777, 888888888888888, ]
    ccccccccccccccccccccccccccc 999999999999999

    View Slide

  94. Optimization Three - Combine Months
    94
    • Go back to a single table
    Prefix March2013 April2013
    aaaaaaaaaa... [111111111111111, 22222222222... [1212121212121212, 3434343434....
    bbbbbbbbbb... [444444444444444, 555555555.... [4545454545454545, 6767676767.....
    ccccccccccc... [777777777777777, 888888888... [8989898989898989, 1313131313....
    One Operation 1. Delete February2013 Field
    2. Check ID in March2013
    3. Test and Set into April 2013

    View Slide

  95. Recap
    95
    Compare Plans for 20 Billion IDs per month
    Plan Storage
    Costs
    Read
    Costs
    Write Costs Total Savings
    Naive (after a
    year)
    $8400 0 $4000 $12400
    Data Age $900 $350 $4000 $5250 57%
    Using Sets $150 $350 $4000 $4500 64%
    Multiple Months $150 0 $4000 $4150 67%

    View Slide

  96. 96
    Thank You
    @mdilawari

    View Slide

  97. 1. Getting started
    2. Data modeling
    3. Partitioning
    4. Replication & Analytics
    Summary
    5. Customer story: Localytics

    View Slide

  98. Free tier.

    View Slide

  99. aws.amazon.com/dynamodb

    View Slide

  100. Thank you!
    [email protected]
    @mza

    View Slide