Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AWS SQS queues & Kubernetes Autoscaling Pitfalls Stories

Eric Khun
October 26, 2020

AWS SQS queues & Kubernetes Autoscaling Pitfalls Stories

Talk at the Cloud Native Computing Foundation meetup @dcard.tw

Eric Khun

October 26, 2020
Tweet

More Decks by Eric Khun

Other Decks in Programming

Transcript

  1. AWS SQS queues & Kubernetes Autoscaling
    Pitfalls Stories
    Cloud Native Foundation meetup
    @dcard.tw
    @eric_khun

    View full-size slide

  2. Make it work, Make it right, Make it fast
    kent beck (agile manifesto - extreme programming)

    View full-size slide

  3. Make it work, Make it right, Make it fast
    kent beck (agile manifesto - extreme programming)

    View full-size slide

  4. Make it work, Make it right, Make it fast
    kent beck (agile manifesto - extreme programming)

    View full-size slide

  5. Buffer
    • 80 employees , 12 time zones, all remote

    View full-size slide

  6. Main pipelines flow

    View full-size slide

  7. it can look like ...
    golang Talk
    @Maicoin :

    View full-size slide

  8. How do we send posts to
    social medias?

    View full-size slide

  9. A bit of history...
    2010 -> 2012: Joel (founder/ceo) 1 cronjob on a Linode server
    $20/mo 512 mb of RAM
    2012 -> 2017 : Sunil (ex-CTO) Crons running on
    AWS ElasticBeanstalk / supervisord
    2017 -> now: Kubernetes / CronJob controller

    View full-size slide

  10. AWS Elastic Beanstalk:
    Kubernetes:

    View full-size slide

  11. At what scale?
    ~ 3 million SQS messages per hour

    View full-size slide

  12. Different patterns for many queues

    View full-size slide

  13. Are our workers
    (consumers of the SQS queues )
    efficients?

    View full-size slide

  14. Are our workers efficients?

    View full-size slide

  15. Are our workers efficients?

    View full-size slide

  16. Empty messages?
    > Workers tries to pull messages from SQS,
    but receive “nothing” to process

    View full-size slide


  17. Number of empty messages per queue

    View full-size slide


  18. Sum of empty messages on all queues

    View full-size slide

  19. 1,000,000 API calls to AWS costs 0.40$
    We have 7,2B calls/month for “empty messages”
    It costs ~$25k/year
    > Me:

    View full-size slide

  20. Or in the AWS console

    View full-size slide

  21. empty messages

    View full-size slide

  22. $120 > $50 saved daily
    > $2000 / month
    > $25,000 / year
    (it’s USD, not TWD)

    View full-size slide

  23. Paid for querying “nothing”

    View full-size slide

  24. (for the past 8 years )

    View full-size slide

  25. Benefits
    - Saving money
    - Less CPU usage (less empty requests)
    - Less throttling (misleading)
    - Less containers
    > Better resources allocation: memory/cpu request

    View full-size slide

  26. Why did that happen?

    View full-size slide

  27. Default options

    View full-size slide

  28. Never questioning what’s
    working decently or the
    way it’s been always done

    View full-size slide

  29. What could have helped?
    Infra as code (explicit options / standardization)
    SLI/SLOs (keep re-evaluating what’s important)
    AWS architecture reviews (taging/recommendations
    from aws solutions architects)

    View full-size slide

  30. Make it work, Make it right, Make it fast

    View full-size slide

  31. Make it work, Make it right, Make it fast

    View full-size slide

  32. Do you remember?

    View full-size slide

  33. Need to analytics on Twitter/FB/IG/LKD… on millions
    on posts faster

    View full-size slide

  34. workers consuming time

    View full-size slide

  35. What’s the problem?

    View full-size slide

  36. Resources allocated and not doing anything most of
    the time
    Developer trying to put find compromises on the
    number of workers

    View full-size slide

  37. How to solve it?

    View full-size slide

  38. Autoscaling!
    (with Keda.sh)
    Supported by IBM / Redhat / Microsoft

    View full-size slide

  39. But notice anything?

    View full-size slide

  40. Before autoscaling

    View full-size slide

  41. After autoscaling

    View full-size slide

  42. After autoscaling

    View full-size slide

  43. What’s happening?

    View full-size slide

  44. delete pod lifecycle

    View full-size slide

  45. what went wrong
    - Workers didn’t manage SIGTERM sent by k8s
    - Kept processing messages
    - Messages were halfway processed and killed
    - Messages were sent back to the the queue again
    - Less workers because of downscaling

    View full-size slide

  46. solution
    - When receiving SIGTERM stop processing new
    messages
    - Set a graceful period long enough to process the
    current message
    if (SIGTERM) {
    // finish current processing and stop receiving new messages
    }

    View full-size slide

  47. And it can also help with
    sqs empty messages

    View full-size slide

  48. Make it work, Make it right, Make it fast

    View full-size slide

  49. Make it work, Make it right, Make it fast

    View full-size slide

  50. Questions?
    monitory.io
    taiwangoldcard.com
    travelhustlers.co ✈

    View full-size slide