Slide 1

Slide 1 text

Redis Cluster for Write Intensive Workloads

Slide 2

Slide 2 text

Hello, I’m Tugberk!

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Sign up and get £10! https://roo.it/tugberku-dgfd

Slide 5

Slide 5 text

careers.deliveroo.co.uk We’re Growing! Unique challenges, amazing people and great food!

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Deliveroo Home Feed ● Dense areas with a lot of restaurants ● Making it hard for users to choose from the large selection ● Each user's needs are different

Slide 8

Slide 8 text

Jonny Brooks-Bartlett- How Deliveroo improved the ranking of restaurants - PyData London 2019 youtube.com/watch?v=bG95RmVOn0E ● Already algorithmically ranking the Restaurant List through a rudimentary Linear Regression model ● Desire to personalize this ranking for each user's needs ● Predicting which restaurant a user is more likely to order from

Slide 9

Slide 9 text

● Access to the aggregated user specific data from the ranking service on production ● Costly to aggregate on production ● Needs to be in-sync with the training pipeline and model serving. ● Need a way to retrieve this data in optimum time for millions of users, while sustaining >1K rps, and keep this data up to date within a reasonable data consistency lag.

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Canoe pipeline kicks in, aggregates the data for each user and serializes the data based on a protobuf format in 20 user bundles. Canoe Aggregating User Features Storing Protobuf{ed} Features From the Canoe pipeline, we pick up files which has protobuf data for 20 users and upload them to S3 S3 Queuing the Work For Each S3 File Mapping between S3 and SQS allows us to queue messages into SQS whenever there is a file upload on the S3 bucket SQS indexing each user features to Redis Cluster Lambda is kicked off by the event source mapping between SQS and the Lambda, which handles the Lambda Storing the data for O(1) access per user Redis Cluster is available to serve reads and writes with 3 primary shards and each having 1 replica Redis Cluster Reading the data from the Redis Cluster On production, we can access the user specific feature by issuing an O(1) query to redis cluster. Access

Slide 12

Slide 12 text

● Data aggregation pipeline bundles 50 records per proto file, and uploads to a known S3 bucket ● S3 object creation notification is enqueued to SQS ● Lambda instances dequeues from from SQS, and writes to Redis Cluster

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Allows you to scale the writes as well as the reads, which are good especially for unpredictable write workloads Allows you to increase the capacity with zero-downtime by adding new shard(s) and performing online resharding Reduces your blast radius, i.e. when a shard goes down, it only affects the portion of your data surface until a failover happens

Slide 15

Slide 15 text

● Redis installation where data is sharded across multiple Redis nodes ● These nodes still have the same capabilities as a normal Redis node, and they can have their own replica sets ● Redis assigns "slot" ranges (a.k.a. hash slots) for each master node within the cluster

Slide 16

Slide 16 text

tugberkugurlu/redis-cluster usage https://github.com/tugberkugurlu/redis-cluster

Slide 17

Slide 17 text

● Redis comes with some out of the box commands to help you manage your cluster setup

Slide 18

Slide 18 text

● For a given Redis key, the hash slot for that key is the result of CRC16(key) modulo 16384, where CRC16 here is the implementation of the CRC16 hash function ● Redis clients can query which node is assigned to which slot range by using the CLUSTER SLOTS command

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

https://www.tugberkugurlu.com/archive/redis- cluster-benefits-of-sharding-and-how-it-works

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

● Gives a managed support for Redis Cluster mode (e.g. you don't need to worry about operational handling for resharding, failover, etc.) ● Integrates well with our existing infrastructure stack at Deliveroo (e.g. AWS, Terraform, etc.)

Slide 25

Slide 25 text

https://docs.aws.amazon.com/AmazonElastiCache/latest/red -ug/Replication.Redis-RedisCluster.html

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

● READONLY command enables read queries for a connection to a Redis Cluster replica node. ● RouteRandomly config option allows routing read-only commands to the random master or slave node. ● These configurations allows us to distribute the read load across the master and all replicas in a random way at the cost of potentially increased data consistency gap.

Slide 29

Slide 29 text

● Having tight timeouts allows us to reduce the impact of potential issues with the Redis to the rest of the application ● If we know the expectations from the redis cluster in terms of response time, we can tune the timeout to fail early, allowing the rest of the application to keep executing in case of potential issues. ● Timeout tuning is a half scientific and half finger in the air process...

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

● Simple Redis Set command ● The client knows which node to send this write request to thanks to its Redis Cluster knowledge

Slide 33

Slide 33 text

● Simple Redis Get command ● The contract between write and read side the is the userID ● Checking on Redis error whether it's of type "redis.Nil" which indicates absence of the key.

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

staurant features

Slide 41

Slide 41 text

● Multi-command operations such as MGET can only succeed if all of the keys belong to same slot https://www.tugberkugurlu.com/archive/redis-cluster-benefits-of-sharding-and-how- it-works#hash-tags

Slide 42

Slide 42 text

● Hash tags allow us to force certain keys to be stored in the same hash slot. ● when the Redis key contains "{...}" pattern only the substring between { and } is hashed in order to obtain the hash slot. https://www.tugberkugurlu.com/archive/redis-cluster-benefits-of-sharding-and-how- it-works#hash-tags

Slide 43

Slide 43 text

● None of the access pattern needs was requiring us to go across city boundary ● Therefore, used City ID as the hash tag value

Slide 44

Slide 44 text

● Same as the write side, we use City ID as the hash tag here to influence the shard selection to route us to the same node ● Bundling all Redis Get commands within a single TCP connection to improve the performance by saving from the round trip ● Pipeline requests run in order but they are not blocking other connections unlike MGET

Slide 45

Slide 45 text

● around ~850-1K queries per second ● ~9.72ms max p95 latency for entire pipeline query

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

● Increasing the number of node groups for your Elasticache Cluster will kick off an online resharding operation ● This will inherit the same number of replications as the other node groups

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

● You can increase/decrease the replica count independent of the shard count ● Note that there was a bug on Terraform regarding this but it has been fixed, see github.com/hashicorp/terraform-provider-aw s/issues/6184 https://docs.aws.amazon.com/AmazonElastiCache/latest /APIReference/API_IncreaseReplicaCount.html

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/AutoFailover.html#auto-failover-test

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

1 4 2 5 3 6

Slide 56

Slide 56 text

56 Software Engineer - Mid, Senior, Staff-level Engineering Manager Senior Software Engineer, Infrastructure Machine Learning Engineer - Mid, Senior, Staff-level Data Engineer Data Scientist - Mid, Senior, Staff-level Data Science Manager Locations: London, Remote UK, Remote Poland See the complete list at https://careers.deliveroo.co.uk/ !

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

No content