A bit of history... 2010 -> 2012: Joel (founder/ceo) 1 cronjob on a Linode server $20/mo 512 mb of RAM 2012 -> 2017 : Sunil (ex-CTO) Crons running on AWS ElasticBeanstalk / supervisord 2017 -> now: Kubernetes / CronJob controller
What could have helped? Infra as code (explicit options / standardization) SLI/SLOs (keep re-evaluating what’s important) AWS architecture reviews (taging/recommendations from aws solutions architects)
what went wrong - Workers didn’t manage SIGTERM sent by k8s - Kept processing messages - Messages were halfway processed and killed - Messages were sent back to the the queue again - Less workers because of downscaling
solution - When receiving SIGTERM stop processing new messages - Set a graceful period long enough to process the current message if (SIGTERM) { // finish current processing and stop receiving new messages }