Make it work, Make it right, Make it fast
kent beck (agile manifesto - extreme programming)
Slide 3
Slide 3 text
Make it work, Make it right, Make it fast
kent beck (agile manifesto - extreme programming)
Slide 4
Slide 4 text
Make it work, Make it right, Make it fast
kent beck (agile manifesto - extreme programming)
Slide 5
Slide 5 text
Buffer
Slide 6
Slide 6 text
No content
Slide 7
Slide 7 text
Buffer
• 80 employees , 12 time zones, all remote
Slide 8
Slide 8 text
Quick intro
Slide 9
Slide 9 text
No content
Slide 10
Slide 10 text
Main pipelines flow
Slide 11
Slide 11 text
it can look like ...
golang Talk
@Maicoin :
Slide 12
Slide 12 text
No content
Slide 13
Slide 13 text
No content
Slide 14
Slide 14 text
How do we send posts to
social medias?
Slide 15
Slide 15 text
A bit of history...
2010 -> 2012: Joel (founder/ceo) 1 cronjob on a Linode server
$20/mo 512 mb of RAM
2012 -> 2017 : Sunil (ex-CTO) Crons running on
AWS ElasticBeanstalk / supervisord
2017 -> now: Kubernetes / CronJob controller
Slide 16
Slide 16 text
AWS Elastic Beanstalk:
Kubernetes:
Slide 17
Slide 17 text
At what scale?
~ 3 million SQS messages per hour
Slide 18
Slide 18 text
Different patterns for many queues
Slide 19
Slide 19 text
Are our workers
(consumers of the SQS queues )
efficients?
Slide 20
Slide 20 text
Are our workers efficients?
Slide 21
Slide 21 text
Are our workers efficients?
Slide 22
Slide 22 text
Empty messages?
> Workers tries to pull messages from SQS,
but receive “nothing” to process
Slide 23
Slide 23 text
Number of empty messages per queue
Slide 24
Slide 24 text
Sum of empty messages on all queues
Slide 25
Slide 25 text
No content
Slide 26
Slide 26 text
1,000,000 API calls to AWS costs 0.40$
We have 7,2B calls/month for “empty messages”
It costs ~$25k/year
> Me:
Slide 27
Slide 27 text
No content
Slide 28
Slide 28 text
AWS SQS Doc
Slide 29
Slide 29 text
No content
Slide 30
Slide 30 text
Or in the AWS console
Slide 31
Slide 31 text
Results?
Slide 32
Slide 32 text
empty messages
Slide 33
Slide 33 text
AWS
Slide 34
Slide 34 text
No content
Slide 35
Slide 35 text
$120 > $50 saved daily
> $2000 / month
> $25,000 / year
(it’s USD, not TWD)
Slide 36
Slide 36 text
Paid for querying “nothing”
Slide 37
Slide 37 text
(for the past 8 years )
Slide 38
Slide 38 text
Benefits
- Saving money
- Less CPU usage (less empty requests)
- Less throttling (misleading)
- Less containers
> Better resources allocation: memory/cpu request
Slide 39
Slide 39 text
Why did that happen?
Slide 40
Slide 40 text
Default options
Slide 41
Slide 41 text
No content
Slide 42
Slide 42 text
Never questioning what’s
working decently or the
way it’s been always done
Slide 43
Slide 43 text
What could have helped?
Infra as code (explicit options / standardization)
SLI/SLOs (keep re-evaluating what’s important)
AWS architecture reviews (taging/recommendations
from aws solutions architects)
Slide 44
Slide 44 text
Make it work, Make it right, Make it fast
Slide 45
Slide 45 text
Make it work, Make it right, Make it fast
Slide 46
Slide 46 text
Do you remember?
Slide 47
Slide 47 text
No content
Slide 48
Slide 48 text
No content
Slide 49
Slide 49 text
No content
Slide 50
Slide 50 text
Need to analytics on Twitter/FB/IG/LKD… on millions
on posts faster
Slide 51
Slide 51 text
workers consuming time
Slide 52
Slide 52 text
No content
Slide 53
Slide 53 text
What’s the problem?
Slide 54
Slide 54 text
Resources allocated and not doing anything most of
the time
Developer trying to put find compromises on the
number of workers
Slide 55
Slide 55 text
How to solve it?
Slide 56
Slide 56 text
Autoscaling!
(with Keda.sh)
Supported by IBM / Redhat / Microsoft
Slide 57
Slide 57 text
No content
Slide 58
Slide 58 text
Results
Slide 59
Slide 59 text
No content
Slide 60
Slide 60 text
But notice anything?
Slide 61
Slide 61 text
Before autoscaling
Slide 62
Slide 62 text
After autoscaling
Slide 63
Slide 63 text
After autoscaling
Slide 64
Slide 64 text
What’s happening?
Slide 65
Slide 65 text
Downscaling
Slide 66
Slide 66 text
Why?
Slide 67
Slide 67 text
delete pod lifecycle
Slide 68
Slide 68 text
what went wrong
- Workers didn’t manage SIGTERM sent by k8s
- Kept processing messages
- Messages were halfway processed and killed
- Messages were sent back to the the queue again
- Less workers because of downscaling
Slide 69
Slide 69 text
solution
- When receiving SIGTERM stop processing new
messages
- Set a graceful period long enough to process the
current message
if (SIGTERM) {
// finish current processing and stop receiving new messages
}