AWS SQS queues & Kubernetes Autoscaling Pitfalls Stories

AWS SQS queues & Kubernetes Autoscaling Pitfalls Stories Cloud Native
Foundation meetup @dcard.tw @eric_khun

Make it work, Make it right, Make it fast kent
beck (agile manifesto - extreme programming)

Buffer

Buffer • 80 employees , 12 time zones, all remote

Quick intro

Main pipelines flow

it can look like ... golang Talk @Maicoin :

How do we send posts to social medias?

A bit of history... 2010 -> 2012: Joel (founder/ceo) 1
cronjob on a Linode server $20/mo 512 mb of RAM 2012 -> 2017 : Sunil (ex-CTO) Crons running on AWS ElasticBeanstalk / supervisord 2017 -> now: Kubernetes / CronJob controller

AWS Elastic Beanstalk: Kubernetes:

At what scale? ~ 3 million SQS messages per hour

Different patterns for many queues

Are our workers (consumers of the SQS queues ) efficients?

Are our workers efficients?

Empty messages? > Workers tries to pull messages from SQS,
but receive “nothing” to process

Number of empty messages per queue

Sum of empty messages on all queues

1,000,000 API calls to AWS costs 0.40$ We have 7,2B
calls/month for “empty messages” It costs ~$25k/year > Me:

AWS SQS Doc

Or in the AWS console

Results?

empty messages

$120 > $50 saved daily > $2000 / month >
$25,000 / year (it’s USD, not TWD)

Paid for querying “nothing”

(for the past 8 years )

Benefits - Saving money - Less CPU usage (less empty
requests) - Less throttling (misleading) - Less containers > Better resources allocation: memory/cpu request

Why did that happen?

Default options

Never questioning what’s working decently or the way it’s been
always done

What could have helped? Infra as code (explicit options /
standardization) SLI/SLOs (keep re-evaluating what’s important) AWS architecture reviews (taging/recommendations from aws solutions architects)

Make it work, Make it right, Make it fast

Do you remember?

Need to analytics on Twitter/FB/IG/LKD… on millions on posts faster

workers consuming time

What’s the problem?

Resources allocated and not doing anything most of the time
Developer trying to put find compromises on the number of workers

How to solve it?

Autoscaling! (with Keda.sh) Supported by IBM / Redhat / Microsoft

Results

But notice anything?

Before autoscaling

After autoscaling

What’s happening?

Downscaling

delete pod lifecycle

what went wrong - Workers didn’t manage SIGTERM sent by
k8s - Kept processing messages - Messages were halfway processed and killed - Messages were sent back to the the queue again - Less workers because of downscaling

solution - When receiving SIGTERM stop processing new messages -
Set a graceful period long enough to process the current message if (SIGTERM) { // finish current processing and stop receiving new messages }

And it can also help with sqs empty messages

Make it work, Make it right, Make it fast

Thanks!

Questions? monitory.io taiwangoldcard.com travelhustlers.co ✈

AWS SQS queues & Kubernetes Autoscaling Pitfall...

AWS SQS queues & Kubernetes Autoscaling Pitfalls Stories

More Decks by Eric Khun

Other Decks in Programming

Featured

Transcript