Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Redis Cluster for Write Intensive Workloads

Redis Cluster for Write Intensive Workloads

NDC London 2021 (Remote)

When you are working with Redis for your write-intensive workloads, Redis Cluster is your friend. It gives you a built-in way to partition your data across instances to allow you to scale your writes without being bound to how much load a single instance can handle. However, data partitioning is always a challenge and Redis Cluster’s approach is no exception to that. At Deliveroo, we are using Redis Cluster in anger, for handing write intensive workloads (e.g. one use case has 10K writes per seconds, while simultaneously handling 300K reads per seconds). During the time we have been using Redis Cluster, we have gained learnings on how the basics of Redis sharding works, how upfront design choices can have tremendous impact on your performance to how resharding is handled both on the Redis Cluster side as well as through the Redis clients. In this session, I would like to share those invaluable learnings based on our battle-tested, real world experiences. At the end of the session, you should have a far better idea on how you can scale Redis for your write intensive workloads, and what type of surprises might be waiting for you.

https://ndc-london.com/agenda/redis-cluster-for-write-intensive-workloads-0xcp/0e9ytrmxsf1

Tugberk Ugurlu

January 29, 2021
Tweet

More Decks by Tugberk Ugurlu

Other Decks in Programming

Transcript

  1. Redis Cluster for Write
    Intensive Workloads

    View Slide

  2. Hello,
    I’m Tugberk!

    View Slide

  3. View Slide

  4. Sign up and get £10!
    https://roo.it/tugberku-dgfd

    View Slide

  5. careers.deliveroo.co.uk
    We’re Growing!
    Unique
    challenges,
    amazing people
    and great food!

    View Slide

  6. View Slide

  7. Deliveroo Home Feed
    ● Dense areas with a lot of
    restaurants
    ● Making it hard for users to choose
    from the large selection
    ● Each user's needs are different

    View Slide

  8. Jonny Brooks-Bartlett- How Deliveroo improved the ranking of restaurants
    - PyData London 2019
    youtube.com/watch?v=bG95RmVOn0E
    ● Already algorithmically ranking the
    Restaurant List through a
    rudimentary Linear Regression
    model
    ● Desire to personalize this ranking
    for each user's needs
    ● Predicting which restaurant a user
    is more likely to order from

    View Slide

  9. ● Access to the aggregated user
    specific data from the ranking
    service on production
    ● Costly to aggregate on production
    ● Needs to be in-sync with the
    training pipeline and model
    serving.
    ● Need a way to retrieve this data in
    optimum time for millions of
    users, while sustaining >1K rps,
    and keep this data up to date
    within a reasonable data
    consistency lag.

    View Slide

  10. View Slide

  11. Canoe pipeline kicks in,
    aggregates the data for
    each user and serializes
    the data based on a
    protobuf format in 20
    user bundles.
    Canoe
    Aggregating User
    Features
    Storing Protobuf{ed}
    Features
    From the Canoe pipeline,
    we pick up files which
    has protobuf data for 20
    users and upload them
    to S3
    S3
    Queuing the Work For
    Each S3 File
    Mapping between S3 and
    SQS allows us to queue
    messages into SQS
    whenever there is a file
    upload on the S3 bucket
    SQS
    indexing each user
    features to Redis Cluster
    Lambda is kicked off by the
    event source mapping
    between SQS and the
    Lambda, which handles the
    Lambda
    Storing the data for O(1)
    access per user
    Redis Cluster is available
    to serve reads and writes
    with 3 primary shards and
    each having 1 replica
    Redis
    Cluster
    Reading the data from
    the Redis Cluster
    On production, we can
    access the user specific
    feature by issuing an O(1)
    query to redis cluster.
    Access

    View Slide

  12. ● Data aggregation pipeline bundles
    50 records per proto file, and
    uploads to a known S3 bucket
    ● S3 object creation notification is
    enqueued to SQS
    ● Lambda instances dequeues from
    from SQS, and writes to Redis
    Cluster

    View Slide

  13. View Slide

  14. Allows you to scale the writes as well as the reads, which are good especially for unpredictable
    write workloads
    Allows you to increase the capacity with zero-downtime by adding new shard(s) and performing
    online resharding
    Reduces your blast radius, i.e. when a shard goes down, it only affects the portion of your data
    surface until a failover happens

    View Slide

  15. ● Redis installation where data is
    sharded across multiple Redis
    nodes
    ● These nodes still have the same
    capabilities as a normal Redis
    node, and they can have their own
    replica sets
    ● Redis assigns "slot" ranges (a.k.a.
    hash slots) for each master node
    within the cluster

    View Slide

  16. tugberkugurlu/redis-cluster usage
    https://github.com/tugberkugurlu/redis-cluster

    View Slide

  17. ● Redis comes with some out of the
    box commands to help you
    manage your cluster setup

    View Slide

  18. ● For a given Redis key, the hash slot
    for that key is the result of
    CRC16(key) modulo 16384, where
    CRC16 here is the implementation
    of the CRC16 hash function
    ● Redis clients can query which
    node is assigned to which slot
    range by using the CLUSTER
    SLOTS command

    View Slide

  19. View Slide

  20. View Slide

  21. View Slide

  22. https://www.tugberkugurlu.com/archive/redis-
    cluster-benefits-of-sharding-and-how-it-works

    View Slide

  23. View Slide

  24. ● Gives a managed support for
    Redis Cluster mode (e.g. you don't
    need to worry about operational
    handling for resharding, failover,
    etc.)
    ● Integrates well with our existing
    infrastructure stack at Deliveroo
    (e.g. AWS, Terraform, etc.)

    View Slide

  25. https://docs.aws.amazon.com/AmazonElastiCache/latest/red
    -ug/Replication.Redis-RedisCluster.html

    View Slide

  26. View Slide

  27. View Slide

  28. ● READONLY command enables
    read queries for a connection to a
    Redis Cluster replica node.
    ● RouteRandomly config option
    allows routing read-only
    commands to the random master
    or slave node.
    ● These configurations allows us to
    distribute the read load across the
    master and all replicas in a
    random way at the cost of
    potentially increased data
    consistency gap.

    View Slide

  29. ● Having tight timeouts allows us to
    reduce the impact of potential
    issues with the Redis to the rest of
    the application
    ● If we know the expectations from
    the redis cluster in terms of
    response time, we can tune the
    timeout to fail early, allowing the
    rest of the application to keep
    executing in case of potential
    issues.
    ● Timeout tuning is a half scientific
    and half finger in the air process...

    View Slide

  30. View Slide

  31. View Slide

  32. ● Simple Redis Set command
    ● The client knows which node to
    send this write request to thanks
    to its Redis Cluster knowledge

    View Slide

  33. ● Simple Redis Get command
    ● The contract between write and
    read side the is the userID
    ● Checking on Redis error whether
    it's of type "redis.Nil" which
    indicates absence of the key.

    View Slide

  34. View Slide

  35. View Slide

  36. View Slide

  37. View Slide

  38. View Slide

  39. View Slide

  40. staurant features

    View Slide

  41. ● Multi-command operations such
    as MGET can only succeed if all of
    the keys belong to same slot
    https://www.tugberkugurlu.com/archive/redis-cluster-benefits-of-sharding-and-how-
    it-works#hash-tags

    View Slide

  42. ● Hash tags allow us to force certain
    keys to be stored in the same hash
    slot.
    ● when the Redis key contains "{...}"
    pattern only the substring between
    { and } is hashed in order to obtain
    the hash slot.
    https://www.tugberkugurlu.com/archive/redis-cluster-benefits-of-sharding-and-how-
    it-works#hash-tags

    View Slide

  43. ● None of the access pattern needs
    was requiring us to go across city
    boundary
    ● Therefore, used City ID as the hash
    tag value

    View Slide

  44. ● Same as the write side, we use
    City ID as the hash tag here to
    influence the shard selection to
    route us to the same node
    ● Bundling all Redis Get commands
    within a single TCP connection to
    improve the performance by
    saving from the round trip
    ● Pipeline requests run in order but
    they are not blocking other
    connections unlike MGET

    View Slide

  45. ● around ~850-1K queries per
    second
    ● ~9.72ms max p95 latency for
    entire pipeline query

    View Slide

  46. View Slide

  47. View Slide

  48. ● Increasing the number of node
    groups for your Elasticache
    Cluster will kick off an online
    resharding operation
    ● This will inherit the same number
    of replications as the other node
    groups

    View Slide

  49. View Slide

  50. ● You can increase/decrease the
    replica count independent of the
    shard count
    ● Note that there was a bug on
    Terraform regarding this but it has
    been fixed, see
    github.com/hashicorp/terraform-provider-aw
    s/issues/6184
    https://docs.aws.amazon.com/AmazonElastiCache/latest
    /APIReference/API_IncreaseReplicaCount.html

    View Slide

  51. View Slide

  52. View Slide

  53. https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/AutoFailover.html#auto-failover-test

    View Slide

  54. View Slide

  55. 1
    4
    2
    5
    3
    6

    View Slide

  56. 56
    Software Engineer - Mid, Senior, Staff-level
    Engineering Manager
    Senior Software Engineer, Infrastructure
    Machine Learning Engineer - Mid, Senior, Staff-level
    Data Engineer
    Data Scientist - Mid, Senior, Staff-level
    Data Science Manager
    Locations: London, Remote UK, Remote Poland
    See the complete list at https://careers.deliveroo.co.uk/ !

    View Slide

  57. View Slide

  58. View Slide