Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AWS SQS queues & Kubernetes Autoscaling Pitfalls Stories

Eric Khun
October 26, 2020

AWS SQS queues & Kubernetes Autoscaling Pitfalls Stories

Talk at the Cloud Native Computing Foundation meetup @dcard.tw

Eric Khun

October 26, 2020
Tweet

More Decks by Eric Khun

Other Decks in Programming

Transcript

  1. AWS SQS queues & Kubernetes Autoscaling
    Pitfalls Stories
    Cloud Native Foundation meetup
    @dcard.tw
    @eric_khun

    View Slide

  2. Make it work, Make it right, Make it fast
    kent beck (agile manifesto - extreme programming)

    View Slide

  3. Make it work, Make it right, Make it fast
    kent beck (agile manifesto - extreme programming)

    View Slide

  4. Make it work, Make it right, Make it fast
    kent beck (agile manifesto - extreme programming)

    View Slide

  5. Buffer

    View Slide

  6. View Slide

  7. Buffer
    • 80 employees , 12 time zones, all remote

    View Slide

  8. Quick intro

    View Slide

  9. View Slide

  10. Main pipelines flow

    View Slide

  11. it can look like ...
    golang Talk
    @Maicoin :

    View Slide

  12. View Slide

  13. View Slide

  14. How do we send posts to
    social medias?

    View Slide

  15. A bit of history...
    2010 -> 2012: Joel (founder/ceo) 1 cronjob on a Linode server
    $20/mo 512 mb of RAM
    2012 -> 2017 : Sunil (ex-CTO) Crons running on
    AWS ElasticBeanstalk / supervisord
    2017 -> now: Kubernetes / CronJob controller

    View Slide

  16. AWS Elastic Beanstalk:
    Kubernetes:

    View Slide

  17. At what scale?
    ~ 3 million SQS messages per hour

    View Slide

  18. Different patterns for many queues

    View Slide

  19. Are our workers
    (consumers of the SQS queues )
    efficients?

    View Slide

  20. Are our workers efficients?

    View Slide

  21. Are our workers efficients?

    View Slide

  22. Empty messages?
    > Workers tries to pull messages from SQS,
    but receive “nothing” to process

    View Slide


  23. Number of empty messages per queue

    View Slide


  24. Sum of empty messages on all queues

    View Slide

  25. View Slide

  26. 1,000,000 API calls to AWS costs 0.40$
    We have 7,2B calls/month for “empty messages”
    It costs ~$25k/year
    > Me:

    View Slide

  27. View Slide

  28. AWS SQS Doc

    View Slide

  29. View Slide

  30. Or in the AWS console

    View Slide

  31. Results?

    View Slide

  32. empty messages

    View Slide

  33. AWS

    View Slide

  34. View Slide

  35. $120 > $50 saved daily
    > $2000 / month
    > $25,000 / year
    (it’s USD, not TWD)

    View Slide

  36. Paid for querying “nothing”

    View Slide

  37. (for the past 8 years )

    View Slide

  38. Benefits
    - Saving money
    - Less CPU usage (less empty requests)
    - Less throttling (misleading)
    - Less containers
    > Better resources allocation: memory/cpu request

    View Slide

  39. Why did that happen?

    View Slide

  40. Default options

    View Slide

  41. View Slide

  42. Never questioning what’s
    working decently or the
    way it’s been always done

    View Slide

  43. What could have helped?
    Infra as code (explicit options / standardization)
    SLI/SLOs (keep re-evaluating what’s important)
    AWS architecture reviews (taging/recommendations
    from aws solutions architects)

    View Slide

  44. Make it work, Make it right, Make it fast

    View Slide

  45. Make it work, Make it right, Make it fast

    View Slide

  46. Do you remember?

    View Slide

  47. View Slide

  48. View Slide

  49. View Slide

  50. Need to analytics on Twitter/FB/IG/LKD… on millions
    on posts faster

    View Slide

  51. workers consuming time

    View Slide

  52. View Slide

  53. What’s the problem?

    View Slide

  54. Resources allocated and not doing anything most of
    the time
    Developer trying to put find compromises on the
    number of workers

    View Slide

  55. How to solve it?

    View Slide

  56. Autoscaling!
    (with Keda.sh)
    Supported by IBM / Redhat / Microsoft

    View Slide

  57. View Slide

  58. Results

    View Slide

  59. View Slide

  60. But notice anything?

    View Slide

  61. Before autoscaling

    View Slide

  62. After autoscaling

    View Slide

  63. After autoscaling

    View Slide

  64. What’s happening?

    View Slide

  65. Downscaling

    View Slide

  66. Why?

    View Slide

  67. delete pod lifecycle

    View Slide

  68. what went wrong
    - Workers didn’t manage SIGTERM sent by k8s
    - Kept processing messages
    - Messages were halfway processed and killed
    - Messages were sent back to the the queue again
    - Less workers because of downscaling

    View Slide

  69. solution
    - When receiving SIGTERM stop processing new
    messages
    - Set a graceful period long enough to process the
    current message
    if (SIGTERM) {
    // finish current processing and stop receiving new messages
    }

    View Slide

  70. View Slide

  71. View Slide

  72. And it can also help with
    sqs empty messages

    View Slide

  73. Make it work, Make it right, Make it fast

    View Slide

  74. Make it work, Make it right, Make it fast

    View Slide

  75. Thanks!

    View Slide

  76. Questions?
    monitory.io
    taiwangoldcard.com
    travelhustlers.co ✈

    View Slide