AWS SQS queues & Kubernetes Autoscaling Pitfalls Stories

Slide 1

Slide 1 text

AWS SQS queues & Kubernetes Autoscaling Pitfalls Stories Cloud Native Foundation meetup @dcard.tw @eric_khun

Slide 2

Slide 2 text

Make it work, Make it right, Make it fast kent beck (agile manifesto - extreme programming)

Slide 3

Slide 3 text

Make it work, Make it right, Make it fast kent beck (agile manifesto - extreme programming)

Slide 4

Slide 4 text

Make it work, Make it right, Make it fast kent beck (agile manifesto - extreme programming)

Slide 5

Slide 5 text

Buffer

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Buffer • 80 employees , 12 time zones, all remote

Slide 8

Slide 8 text

Quick intro

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Main pipelines flow

Slide 11

Slide 11 text

it can look like ... golang Talk @Maicoin :

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

How do we send posts to social medias?

Slide 15

Slide 15 text

A bit of history... 2010 -> 2012: Joel (founder/ceo) 1 cronjob on a Linode server $20/mo 512 mb of RAM 2012 -> 2017 : Sunil (ex-CTO) Crons running on AWS ElasticBeanstalk / supervisord 2017 -> now: Kubernetes / CronJob controller

Slide 16

Slide 16 text

AWS Elastic Beanstalk: Kubernetes:

Slide 17

Slide 17 text

At what scale? ~ 3 million SQS messages per hour

Slide 18

Slide 18 text

Different patterns for many queues

Slide 19

Slide 19 text

Are our workers (consumers of the SQS queues ) efficients?

Slide 20

Slide 20 text

Are our workers efficients?

Slide 21

Slide 21 text

Are our workers efficients?

Slide 22

Slide 22 text

Empty messages? > Workers tries to pull messages from SQS, but receive “nothing” to process

Slide 23

Slide 23 text

Number of empty messages per queue

Slide 24

Slide 24 text

Sum of empty messages on all queues

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

1,000,000 API calls to AWS costs 0.40$ We have 7,2B calls/month for “empty messages” It costs ~$25k/year > Me:

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

AWS SQS Doc

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

Or in the AWS console

Slide 31

Slide 31 text

Results?

Slide 32

Slide 32 text

empty messages

Slide 33

Slide 33 text

AWS

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

$120 > $50 saved daily > $2000 / month > $25,000 / year (it’s USD, not TWD)

Slide 36

Slide 36 text

Paid for querying “nothing”

Slide 37

Slide 37 text

(for the past 8 years )

Slide 38

Slide 38 text

Benefits - Saving money - Less CPU usage (less empty requests) - Less throttling (misleading) - Less containers > Better resources allocation: memory/cpu request

Slide 39

Slide 39 text

Why did that happen?

Slide 40

Slide 40 text

Default options

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

Never questioning what’s working decently or the way it’s been always done

Slide 43

Slide 43 text

What could have helped? Infra as code (explicit options / standardization) SLI/SLOs (keep re-evaluating what’s important) AWS architecture reviews (taging/recommendations from aws solutions architects)

Slide 44

Slide 44 text

Make it work, Make it right, Make it fast

Slide 45

Slide 45 text

Make it work, Make it right, Make it fast

Slide 46

Slide 46 text

Do you remember?

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

Need to analytics on Twitter/FB/IG/LKD… on millions on posts faster

Slide 51

Slide 51 text

workers consuming time

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

What’s the problem?

Slide 54

Slide 54 text

Resources allocated and not doing anything most of the time Developer trying to put find compromises on the number of workers

Slide 55

Slide 55 text

How to solve it?

Slide 56

Slide 56 text

Autoscaling! (with Keda.sh) Supported by IBM / Redhat / Microsoft

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

Results

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

But notice anything?

Slide 61

Slide 61 text

Before autoscaling

Slide 62

Slide 62 text

After autoscaling

Slide 63

Slide 63 text

After autoscaling

Slide 64

Slide 64 text

What’s happening?

Slide 65

Slide 65 text

Downscaling

Slide 66

Slide 66 text

Why?

Slide 67

Slide 67 text

delete pod lifecycle

Slide 68

Slide 68 text

what went wrong - Workers didn’t manage SIGTERM sent by k8s - Kept processing messages - Messages were halfway processed and killed - Messages were sent back to the the queue again - Less workers because of downscaling

Slide 69

Slide 69 text

solution - When receiving SIGTERM stop processing new messages - Set a graceful period long enough to process the current message if (SIGTERM) { // finish current processing and stop receiving new messages }