Making Your Controllers Resilient: Workqueue To The Rescue

Slide 1

Slide 1 text

Making Your Controllers Resilient: Workqueue To The Rescue Madhav Jivrajani, Arsh Sharma - VMware

Slide 2

Slide 2 text

Queue

Slide 3

Slide 3 text

What types of faults can occur when you operate in a distributed setting?

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

In distributed systems, failure is the norm.

Slide 7

Slide 7 text

In distributed systems, failure is the norm. Which is why we have to design for failure

Slide 8

Slide 8 text

What is the cause of these failures?

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Transient errors: “weelllll, maybe if I try again, it’ll work ﬁne”

Slide 14

Slide 14 text

The concept of retrying is extremely useful in Distributed Systems. It can make or break your system.

Slide 15

Slide 15 text

Retry storms!

Slide 16

Slide 16 text

client-go • client-go is a library used to communicate with a k8s cluster. • https://github.com/kubernetes/client-go

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

SharedInformer • We can stay informed about when events like pod creation, node joining, etc. are triggered by using a primitive exposed by Kubernetes and the client-go called SharedInformer, inside the cache package. • Previously, each controller had its own informer cache that it would use.

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

The Big Picture https://github.com/kubernetes/sample-controller/blob/master/docs/controller-client-go.md

Slide 21

Slide 21 text

Enter Queues Now here we just had a print statement in our handler function, but most of the time your `AddFunc` will just be pushing events to a work queue.

Slide 22

Slide 22 text

What is the workqueue and why is it important?

Slide 23

Slide 23 text

workqueue package https://pkg.go.dev/k8s.io/client-go/util/workqueue

Slide 24

Slide 24 text

To try and understand this functionality provided, we’ll try and look at the following 2 things: ● How does the queue itself work? ● What are the different extensions to this queue that are provided?

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

How does enqueuing work with these 2 sets?

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

Due to this queue pattern, you can have: ● Multiple producers producing items to be processed, and multiple consumers that pop these items out and process them. ● This allows for “parallelizing” the work that needs to be done. ● Next question is: what happens when an item is done processing?

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

● It is possible that we have multiple instances of the same item that are to be processed. How do we ensure that we process this item just once? ● Due to multiple concurrent consumers, how do we prevent the same item being processed multiple times, concurrently? ○ This is important because if we duplicate process an item - best case you end up doing additional work to rectify the mistake and worst case you have a thrashing effect of processing and retrying.

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

What if another item ‘1’ gets added?

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

A few things to note: ● Format of the key: / ● If no namespace, it’ll just be

Slide 56

Slide 56 text

What type of queues does the workqueue package have?

Slide 57

Slide 57 text

● Delaying queue ○ Extends the queue with the ability to add an element after speciﬁed duration. ● Rate limiting queue ○ It uses the delaying queue to rate limit items being added to the queue. ○ The default rate limiter is a simple exponential rate limiter that rate limits per key. ■ If a key had n re-queues, it will be added after 2^n * someBaseDelay back to the queue.

Slide 58

Slide 58 text

https://engineering.bitnami.com/articles/a-deep-dive-into-kubernetes-controllers.html

Slide 59

Slide 59 text

A few practices to keep in mind: ● Always handle errors outside your business logic ○ Handling errors typically consists of requeueing a work item. ● Start workers for for processing items from workqueue only after cache is in sync successfully.

Slide 60

Slide 60 text

Metrics that are exposed by the workqueue package as part of the /metrics endpoint

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

Resources and References ● Kubernetes Controllers ● Kubernetes client-go workqueue example ● Kubernetes sample-controller ● Workqueue package

Slide 63

Slide 63 text

Thank You! Arsh Madhav Twitter: @RinkiyaKeDad Twitter: @MadhavJivrajani K8s slack: @arsh K8s slack: @madhav