Save 37% off PRO during our Black Friday Sale! »

Badge Poser v3.0 - A DevOps Journey

Badge Poser v3.0 - A DevOps Journey

Sharing the whole journey experience. Starting with the handover of the keys of the pandora box, wandering around the deep dark forest of uncertainty and instability of the rushed deployed systems. Trying to declutter and reach a stable stage where the order reigns over chaos, where the poor guy can finally sleep at night and the pager eventually goes silent for a while. At the end we'll be reaching the so-desired level of confidence to not be worried about experimenting, changing things and upgrading infrastructure.

944ea93d87f2872251a6c05a68c3d4be?s=128

Fabio Cicerchia

November 25, 2020
Tweet

Transcript

  1. None
  2. None
  3. Hello! I AM FABIO CICERCHIA SW & Cloud Engineer @

    You can find me at: @fabiocicerchia
  4. Disclaimer

  5. None
  6. None
  7. None
  8. Let’s Start

  9. None
  10. https://en.wikipedia.org/wiki/Bianco,_rosso_e_Verdone

  11. None
  12. None
  13. Step #1 What I got myself into?!

  14. Information Gathering

  15. Describe VM Config RAM: 2GB CPU: 2 HDD: 50GB Software:

    Apache 2.4.10, PHP 5.6.19, Redis 2.8.17, MySQL 5.5.47
  16. • Apache v2.4.10 ◦ Released on 2014-07-19: Age 6 years

    ◦ Available v2.4.43 • PHP v5.6.19 ◦ Released on 2016-03-03: Age 4 years ◦ Available v7.4.5 ◦ EOL: 2018-12-31 http://archive.apache.org/dist/httpd/ https://www.php.net/releases/index.php https://www.php.net/supported-versions.php https://github.com/redis/redis https://docs.redislabs.com/latest/rs/administering/product-lifecycle/ Describe VM Config - Notes • Redis v2.8.17 ◦ Released on 2014-09-19: Age 6 years ◦ Available v6.0.1 • MySQL v5.5.47 ◦ Released on 2015-12-07: Age 5 years ◦ Available v8.0.20
  17. Step #2 What do I need to do?!

  18. Define a “plan”

  19. Step #3 Find time to do it

  20. None
  21. • Nginx v1.18.0 • PHP v7.4.7 • Redis v4.0.10 Just

    Start! https://www.nginx.com/ https://www.php.net/ https://redis.io/
  22. None
  23. • Ansible → Provisioning • Ansible Galaxy → Ansible’s Recipes

    Repo • AWS CloudFormation → Infrastructure as Code* • Let’s Encrypt → SSLTLS Certificate** * Terraform is way cooler **Yes, SSL is deprecated ...Then Refine https://www.ansible.com/ https://galaxy.ansible.com/ https://aws.amazon.com/cloudformation/ https://letsencrypt.org/
  24. None
  25. https://github.com/PUGX/badge-poser/blob/master/sys/cloudformation/alpine-stack.yaml

  26. https://medium.com/@wintonjkt/ansible-101-getting-started-1daaff872b64

  27. Ansible: What’s for? - Ansible is perfect for VMs (for

    example EC2 in our scenario). - It is redundant for ECS with Fargate, since the underlying layer is fully managed by AWS. - It could be useful for ECS without Fargate, so it’ll provision the EC2 where the containers will run. - Useful for deploy and rollback.
  28. https://github.com/PUGX/badge-poser/blob/54cd440ebc91245cda4735db86dca897d024a838/sys/ansible/playbooks/setup.yml

  29. Wait for it...

  30. None
  31. Step #4 Start Fixing

  32. None
  33. Start Throwing a Bunch of Things At It

  34. • pm.max_children = 150 pm.start_servers = 5 pm.min_spare_servers = 5

    pm.max_spare_servers = 35 • emergency_restart_threshold 10 emergency_restart_interval 1m process_control_timeout 10s • memory_limit = 192M Workaround #1: Not Quite There Yet
  35. None
  36. Added Logz.io & Filebeat Added UptimeRobot It Keeps Crashing: Need

    Visibility https://logz.io/ https://www.elastic.co/beats/filebeat https://uptimerobot.com/
  37. https://medium.com/@mirzapour/centralized-logging-with-elasticsearch-kibana-logstash-and-filebeat-57fea01be5e7

  38. https://github.com/PUGX/badge-poser/blob/54cd440ebc91245cda4735db86dca897d024a838/sys/filebeat/filebeat.yml

  39. None
  40. None
  41. None
  42. Step #5 Shit Happens

  43. https://en.wikipedia.org/wiki/The_IT_Crowd FIRE! FIRE! FIRE!

  44. Moved to StatusCake

  45. Redis Down: OOM Killer http://turnoff.us/geek/oom-killer/

  46. https://en.wikipedia.org/wiki/Boris_(TV_series)

  47. None
  48. Handle Redis Daemon via Supervisor Redis Down: OOM Killer: Workaround

    #2
  49. Zero CPU Credits

  50. Zero CPU Credits CPU capped at 20% https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-credits-baseline-concepts.html

  51. None
  52. None
  53. CPU capped at 20% http://nginx.org/en/docs/http/ngx_http_fastcgi_module.html

  54. Zero CPU Credits

  55. None
  56. Step #6 Where Are We At Now?

  57. None
  58. Step #8 Ditch Everything

  59. None
  60. Step #9 Start Over

  61. • AWS ECS • AWS ECR Container - Part 1

    https://aws.amazon.com/ecs/ https://aws.amazon.com/ecr/
  62. Shit Happens (Again)

  63. OOM Killer - The Revenge http://turnoff.us/geek/oom-killer/

  64. None
  65. OOM Killer - The Revenge: Workaround #3 • Set Autoscaling

    fixed to min 1 running container
  66. • Split All-In-One Container in Multi Container • Use Alpine

    Container - Part 2
  67. OOM Killer - Highlander

  68. None
  69. One Step Back

  70. None
  71. None
  72. https://github.com/aws/amazon-ecs-agent/issues/1187

  73. None
  74. None
  75. Despite the Working Fix...

  76. ...Alpine Wasn’t Quite Stable

  77. None
  78. Since the multi-container on Alpine was unstable just switched back

    to the good ol’ working one-container-has-all on Debian. Switch Back to All-in-One Debian
  79. None
  80. None
  81. Alpine: Trial & Errors

  82. None
  83. MADNESS ALPINE NGINX+LUA

  84. https://github.com/fabiocicerchia/nginx-lua

  85. Caching to the rescue

  86. NO STALE!

  87. MISS – The response was not found in the cache

    and so was fetched from an origin server. The response might then have been cached. BYPASS – The response was fetched from the origin server instead of served from the cache because the request matched a proxy_cache_bypass directive (see Can I Punch a Hole Through My Cache? below.) The response might then have been cached. EXPIRED – The entry in the cache has expired. The response contains fresh content from the origin server. Cache Statuses https://www.nginx.com/blog/nginx-caching-guide/
  88. Cache Statuses STALE – The content is stale because the

    origin server is not responding correctly, and proxy_cache_use_stale was configured. UPDATING – The content is stale because the entry is currently being updated in response to a previous request, and proxy_cache_use_stale updating is configured. REVALIDATED – The proxy_cache_revalidate directive was enabled and NGINX verified that the current cached content was still valid (If-Modified-Since or If-None-Match). HIT – The response contains valid, fresh content direct from the cache. https://www.nginx.com/blog/nginx-caching-guide/
  89. Step #10 Observability

  90. Moving away from EC2 and from Logz.io. Again?

  91. Need to know the traffic trend CloudWatch

  92. Get More Metrics & Desiderata

  93. 0 0

  94. Interlude #1 Serverless

  95. https://bref.sh/

  96. None
  97. https://github.com/brefphp/bref/issues/497

  98. None
  99. https://aws.amazon.com/blogs/compute/introducing-the-new-serverless-lamp-stack/

  100. None
  101. None
  102. None
  103. None
  104. None
  105. Interlude #2 PHP8.0.0RC*

  106. https://wiki.php.net/todo/php80

  107. Rolling Updates https://dzone.com/articles/take-release-automation-to-the-next-level-episode-2

  108. Dark Canary 10% / 25% 100% https://landing.google.com/sre/workbook/chapters/canarying-releases/

  109. None
  110. None
  111. None
  112. FORKED TRAFFIC

  113. None
  114. None
  115. https://github.com/PUGX/badge-poser/pull/431

  116. Interlude #3 Full Page Caching w/ Redis

  117. https://github.com/fabiocicerchia/go-proxy-cache

  118. Step #11 Uptime

  119. None
  120. None
  121. None
  122. None
  123. None
  124. ...but at the end....

  125. None
  126. None
  127. None
  128. https://uptime.is/99.97

  129. Deploying during breakfast

  130. Confidence Level

  131. Step #12 Billing

  132. None
  133. None
  134. Elastic Static IP with Global Accelerator

  135. https://www.vice.com/it/article/evdyj4/hackerino-computer-militare-video

  136. Auto Refreshing Dashboard

  137. None
  138. https://www.vice.com/it/article/evdyj4/hackerino-computer-militare-video

  139. Reduce Costs!

  140. So what did I learn?!

  141. Key Takeaways - Never trust code - Never trust yourself

    - Do small steps - It’ll help you figuring out what went wrong - Version everything - Commit as often as possible - Never use latest tag - Use specific versions - Think outside the box - Don’t stick to playing by the manual - Prefer quick and easy fixes - Reduce the odds of breaking things - Use the tools to make your life easier - So choose them carefully - Monitor & Benchmark! - Your best friends for troubleshooting * random order
  142. Questions?

  143. Thank You!