Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Badge Poser v3.0 - A DevOps Journey

Badge Poser v3.0 - A DevOps Journey

Sharing the whole journey experience. Starting with the handover of the keys of the pandora box, wandering around the deep dark forest of uncertainty and instability of the rushed deployed systems. Trying to declutter and reach a stable stage where the order reigns over chaos, where the poor guy can finally sleep at night and the pager eventually goes silent for a while. At the end we'll be reaching the so-desired level of confidence to not be worried about experimenting, changing things and upgrading infrastructure.

Fabio Cicerchia

November 25, 2020
Tweet

More Decks by Fabio Cicerchia

Other Decks in Programming

Transcript

  1. View Slide

  2. View Slide

  3. Hello!
    I AM FABIO CICERCHIA
    SW & Cloud Engineer @
    You can find me at: @fabiocicerchia

    View Slide

  4. Disclaimer

    View Slide

  5. View Slide

  6. View Slide

  7. View Slide

  8. Let’s Start

    View Slide

  9. View Slide

  10. https://en.wikipedia.org/wiki/Bianco,_rosso_e_Verdone

    View Slide

  11. View Slide

  12. View Slide

  13. Step #1
    What I got myself into?!

    View Slide

  14. Information Gathering

    View Slide

  15. Describe VM Config
    RAM: 2GB
    CPU: 2
    HDD: 50GB
    Software: Apache 2.4.10, PHP 5.6.19, Redis 2.8.17, MySQL 5.5.47

    View Slide

  16. ● Apache v2.4.10
    ○ Released on 2014-07-19: Age 6 years
    ○ Available v2.4.43
    ● PHP v5.6.19
    ○ Released on 2016-03-03: Age 4 years
    ○ Available v7.4.5
    ○ EOL: 2018-12-31
    http://archive.apache.org/dist/httpd/
    https://www.php.net/releases/index.php
    https://www.php.net/supported-versions.php
    https://github.com/redis/redis
    https://docs.redislabs.com/latest/rs/administering/product-lifecycle/
    Describe VM Config - Notes
    ● Redis v2.8.17
    ○ Released on 2014-09-19: Age 6 years
    ○ Available v6.0.1
    ● MySQL v5.5.47
    ○ Released on 2015-12-07: Age 5 years
    ○ Available v8.0.20

    View Slide

  17. Step #2
    What do I need to do?!

    View Slide

  18. Define a “plan”

    View Slide

  19. Step #3
    Find time to do it

    View Slide

  20. View Slide

  21. ● Nginx v1.18.0
    ● PHP v7.4.7
    ● Redis v4.0.10
    Just Start!
    https://www.nginx.com/
    https://www.php.net/
    https://redis.io/

    View Slide

  22. View Slide

  23. ● Ansible → Provisioning
    ● Ansible Galaxy → Ansible’s Recipes Repo
    ● AWS CloudFormation → Infrastructure as Code*
    ● Let’s Encrypt → SSLTLS Certificate**
    * Terraform is way cooler
    **Yes, SSL is deprecated
    ...Then Refine
    https://www.ansible.com/
    https://galaxy.ansible.com/
    https://aws.amazon.com/cloudformation/
    https://letsencrypt.org/

    View Slide

  24. View Slide

  25. https://github.com/PUGX/badge-poser/blob/master/sys/cloudformation/alpine-stack.yaml

    View Slide

  26. https://medium.com/@wintonjkt/ansible-101-getting-started-1daaff872b64

    View Slide

  27. Ansible: What’s for?
    - Ansible is perfect for VMs (for example EC2 in our scenario).
    - It is redundant for ECS with Fargate, since the underlying layer is fully
    managed by AWS.
    - It could be useful for ECS without Fargate, so it’ll provision the EC2 where
    the containers will run.
    - Useful for deploy and rollback.

    View Slide

  28. https://github.com/PUGX/badge-poser/blob/54cd440ebc91245cda4735db86dca897d024a838/sys/ansible/playbooks/setup.yml

    View Slide

  29. Wait for it...

    View Slide

  30. View Slide

  31. Step #4
    Start Fixing

    View Slide

  32. View Slide

  33. Start Throwing a Bunch of Things At It

    View Slide

  34. ● pm.max_children = 150
    pm.start_servers = 5
    pm.min_spare_servers = 5
    pm.max_spare_servers = 35
    ● emergency_restart_threshold 10
    emergency_restart_interval 1m
    process_control_timeout 10s
    ● memory_limit = 192M
    Workaround #1: Not Quite There Yet

    View Slide

  35. View Slide

  36. Added Logz.io & Filebeat
    Added UptimeRobot
    It Keeps Crashing: Need Visibility
    https://logz.io/
    https://www.elastic.co/beats/filebeat
    https://uptimerobot.com/

    View Slide

  37. https://medium.com/@mirzapour/centralized-logging-with-elasticsearch-kibana-logstash-and-filebeat-57fea01be5e7

    View Slide

  38. https://github.com/PUGX/badge-poser/blob/54cd440ebc91245cda4735db86dca897d024a838/sys/filebeat/filebeat.yml

    View Slide

  39. View Slide

  40. View Slide

  41. View Slide

  42. Step #5
    Shit Happens

    View Slide

  43. https://en.wikipedia.org/wiki/The_IT_Crowd
    FIRE!
    FIRE!
    FIRE!

    View Slide

  44. Moved to StatusCake

    View Slide

  45. Redis Down: OOM Killer
    http://turnoff.us/geek/oom-killer/

    View Slide

  46. https://en.wikipedia.org/wiki/Boris_(TV_series)

    View Slide

  47. View Slide

  48. Handle Redis Daemon via Supervisor
    Redis Down: OOM Killer: Workaround #2

    View Slide

  49. Zero CPU Credits

    View Slide

  50. Zero CPU Credits
    CPU capped at 20%
    https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-credits-baseline-concepts.html

    View Slide

  51. View Slide

  52. View Slide

  53. CPU capped at 20%
    http://nginx.org/en/docs/http/ngx_http_fastcgi_module.html

    View Slide

  54. Zero CPU Credits

    View Slide

  55. View Slide

  56. Step #6
    Where Are We At Now?

    View Slide

  57. View Slide

  58. Step #8
    Ditch Everything

    View Slide

  59. View Slide

  60. Step #9
    Start Over

    View Slide

  61. ● AWS ECS
    ● AWS ECR
    Container - Part 1
    https://aws.amazon.com/ecs/
    https://aws.amazon.com/ecr/

    View Slide

  62. Shit Happens (Again)

    View Slide

  63. OOM Killer - The Revenge
    http://turnoff.us/geek/oom-killer/

    View Slide

  64. View Slide

  65. OOM Killer - The Revenge: Workaround #3
    ● Set Autoscaling fixed to min 1 running container

    View Slide

  66. ● Split All-In-One Container in Multi Container
    ● Use Alpine
    Container - Part 2

    View Slide

  67. OOM Killer - Highlander

    View Slide

  68. View Slide

  69. One Step Back

    View Slide

  70. View Slide

  71. View Slide

  72. https://github.com/aws/amazon-ecs-agent/issues/1187

    View Slide

  73. View Slide

  74. View Slide

  75. Despite the Working Fix...

    View Slide

  76. ...Alpine Wasn’t Quite Stable

    View Slide

  77. View Slide

  78. Since the multi-container on Alpine was unstable just switched back to the good
    ol’ working one-container-has-all on Debian.
    Switch Back to All-in-One Debian

    View Slide

  79. View Slide

  80. View Slide

  81. Alpine: Trial & Errors

    View Slide

  82. View Slide

  83. MADNESS
    ALPINE
    NGINX+LUA

    View Slide

  84. https://github.com/fabiocicerchia/nginx-lua

    View Slide

  85. Caching to the rescue

    View Slide

  86. NO STALE!

    View Slide

  87. MISS – The response was not found in the cache and so was fetched from an origin server.
    The response might then have been cached.
    BYPASS – The response was fetched from the origin server instead of served from the
    cache because the request matched a proxy_cache_bypass directive (see Can I Punch a Hole
    Through My Cache? below.) The response might then have been cached.
    EXPIRED – The entry in the cache has expired. The response contains fresh content from
    the origin server.
    Cache Statuses
    https://www.nginx.com/blog/nginx-caching-guide/

    View Slide

  88. Cache Statuses
    STALE – The content is stale because the origin server is not responding correctly, and
    proxy_cache_use_stale was configured.
    UPDATING – The content is stale because the entry is currently being updated in response
    to a previous request, and proxy_cache_use_stale updating is configured.
    REVALIDATED – The proxy_cache_revalidate directive was enabled and NGINX verified
    that the current cached content was still valid (If-Modified-Since or If-None-Match).
    HIT – The response contains valid, fresh content direct from the cache.
    https://www.nginx.com/blog/nginx-caching-guide/

    View Slide

  89. Step #10
    Observability

    View Slide

  90. Moving away from EC2 and from Logz.io.
    Again?

    View Slide

  91. Need to know the traffic trend
    CloudWatch

    View Slide

  92. Get More Metrics & Desiderata

    View Slide

  93. 0
    0

    View Slide

  94. Interlude #1
    Serverless

    View Slide

  95. https://bref.sh/

    View Slide

  96. View Slide

  97. https://github.com/brefphp/bref/issues/497

    View Slide

  98. View Slide

  99. https://aws.amazon.com/blogs/compute/introducing-the-new-serverless-lamp-stack/

    View Slide

  100. View Slide

  101. View Slide

  102. View Slide

  103. View Slide

  104. View Slide

  105. Interlude #2
    PHP8.0.0RC*

    View Slide

  106. https://wiki.php.net/todo/php80

    View Slide

  107. Rolling Updates
    https://dzone.com/articles/take-release-automation-to-the-next-level-episode-2

    View Slide

  108. Dark Canary
    10% / 25%
    100%
    https://landing.google.com/sre/workbook/chapters/canarying-releases/

    View Slide

  109. View Slide

  110. View Slide

  111. View Slide

  112. FORKED TRAFFIC

    View Slide

  113. View Slide

  114. View Slide

  115. https://github.com/PUGX/badge-poser/pull/431

    View Slide

  116. Interlude #3
    Full Page Caching w/ Redis

    View Slide

  117. https://github.com/fabiocicerchia/go-proxy-cache

    View Slide

  118. Step #11
    Uptime

    View Slide

  119. View Slide

  120. View Slide

  121. View Slide

  122. View Slide

  123. View Slide

  124. ...but at the end....

    View Slide

  125. View Slide

  126. View Slide

  127. View Slide

  128. https://uptime.is/99.97

    View Slide

  129. Deploying during breakfast

    View Slide

  130. Confidence Level

    View Slide

  131. Step #12
    Billing

    View Slide

  132. View Slide

  133. View Slide

  134. Elastic Static IP with Global Accelerator

    View Slide

  135. https://www.vice.com/it/article/evdyj4/hackerino-computer-militare-video

    View Slide

  136. Auto Refreshing Dashboard

    View Slide

  137. View Slide

  138. https://www.vice.com/it/article/evdyj4/hackerino-computer-militare-video

    View Slide

  139. Reduce Costs!

    View Slide

  140. So what did I learn?!

    View Slide

  141. Key Takeaways
    - Never trust code
    - Never trust yourself
    - Do small steps
    - It’ll help you figuring out what went wrong
    - Version everything
    - Commit as often as possible
    - Never use latest tag
    - Use specific versions
    - Think outside the box
    - Don’t stick to playing by the manual
    - Prefer quick and easy fixes
    - Reduce the odds of breaking things
    - Use the tools to make your life easier
    - So choose them carefully
    - Monitor & Benchmark!
    - Your best friends for troubleshooting
    * random order

    View Slide

  142. Questions?

    View Slide

  143. Thank You!

    View Slide