Making Your Controllers Resilient: Workqueue To The Rescue

Making Your Controllers Resilient: Workqueue To The Rescue Madhav Jivrajani,
Arsh Sharma - VMware

What types of faults can occur when you operate in
a distributed setting?

In distributed systems, failure is the norm.

In distributed systems, failure is the norm. Which is why
we have to design for failure

What is the cause of these failures?

Transient errors: “weelllll, maybe if I try again, it’ll work
ﬁne”

The concept of retrying is extremely useful in Distributed Systems.
It can make or break your system.

Retry storms!

client-go • client-go is a library used to communicate with
a k8s cluster. • https://github.com/kubernetes/client-go

SharedInformer • We can stay informed about when events like
pod creation, node joining, etc. are triggered by using a primitive exposed by Kubernetes and the client-go called SharedInformer, inside the cache package. • Previously, each controller had its own informer cache that it would use.

The Big Picture https://github.com/kubernetes/sample-controller/blob/master/docs/controller-client-go.md

Enter Queues Now here we just had a print statement
in our handler function, but most of the time your `AddFunc` will just be pushing events to a work queue.

What is the workqueue and why is it important?

workqueue package https://pkg.go.dev/k8s.io/client-go/util/workqueue

To try and understand this functionality provided, we’ll try and
look at the following 2 things: • How does the queue itself work? • What are the different extensions to this queue that are provided?

How does enqueuing work with these 2 sets?

Due to this queue pattern, you can have: • Multiple
producers producing items to be processed, and multiple consumers that pop these items out and process them. • This allows for “parallelizing” the work that needs to be done. • Next question is: what happens when an item is done processing?

• It is possible that we have multiple instances of
the same item that are to be processed. How do we ensure that we process this item just once? • Due to multiple concurrent consumers, how do we prevent the same item being processed multiple times, concurrently? ◦ This is important because if we duplicate process an item - best case you end up doing additional work to rectify the mistake and worst case you have a thrashing effect of processing and retrying.

What if another item ‘1’ gets added?

A few things to note: • Format of the key:
<resource_namespace>/<resource_name> • If no namespace, it’ll just be <resource_name>

What type of queues does the workqueue package have?

• Delaying queue ◦ Extends the queue with the ability
to add an element after speciﬁed duration. • Rate limiting queue ◦ It uses the delaying queue to rate limit items being added to the queue. ◦ The default rate limiter is a simple exponential rate limiter that rate limits per key. ▪ If a key had n re-queues, it will be added after 2^n * someBaseDelay back to the queue.

https://engineering.bitnami.com/articles/a-deep-dive-into-kubernetes-controllers.html

A few practices to keep in mind: • Always handle
errors outside your business logic ◦ Handling errors typically consists of requeueing a work item. • Start workers for for processing items from workqueue only after cache is in sync successfully.

Metrics that are exposed by the workqueue package as part
of the /metrics endpoint

Resources and References • Kubernetes Controllers • Kubernetes client-go workqueue
example • Kubernetes sample-controller • Workqueue package

Thank You! Arsh Madhav Twitter: @RinkiyaKeDad Twitter: @MadhavJivrajani K8s slack:
@arsh K8s slack: @madhav

Making Your Controllers Resilient: Workqueue To...

Making Your Controllers Resilient: Workqueue To The Rescue

Madhav Jivrajani

More Decks by Madhav Jivrajani

Other Decks in Technology

Featured

Transcript