Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

Hello! I AM FABIO CICERCHIA SW & Cloud Engineer @ You can find me at: @fabiocicerchia

Slide 4

Slide 4 text

Disclaimer

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Let’s Start

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

https://en.wikipedia.org/wiki/Bianco,_rosso_e_Verdone

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Step #1 What I got myself into?!

Slide 14

Slide 14 text

Information Gathering

Slide 15

Slide 15 text

Describe VM Config RAM: 2GB CPU: 2 HDD: 50GB Software: Apache 2.4.10, PHP 5.6.19, Redis 2.8.17, MySQL 5.5.47

Slide 16

Slide 16 text

● Apache v2.4.10 ○ Released on 2014-07-19: Age 6 years ○ Available v2.4.43 ● PHP v5.6.19 ○ Released on 2016-03-03: Age 4 years ○ Available v7.4.5 ○ EOL: 2018-12-31 http://archive.apache.org/dist/httpd/ https://www.php.net/releases/index.php https://www.php.net/supported-versions.php https://github.com/redis/redis https://docs.redislabs.com/latest/rs/administering/product-lifecycle/ Describe VM Config - Notes ● Redis v2.8.17 ○ Released on 2014-09-19: Age 6 years ○ Available v6.0.1 ● MySQL v5.5.47 ○ Released on 2015-12-07: Age 5 years ○ Available v8.0.20

Slide 17

Slide 17 text

Step #2 What do I need to do?!

Slide 18

Slide 18 text

Define a “plan”

Slide 19

Slide 19 text

Step #3 Find time to do it

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

● Nginx v1.18.0 ● PHP v7.4.7 ● Redis v4.0.10 Just Start! https://www.nginx.com/ https://www.php.net/ https://redis.io/

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

● Ansible → Provisioning ● Ansible Galaxy → Ansible’s Recipes Repo ● AWS CloudFormation → Infrastructure as Code* ● Let’s Encrypt → SSLTLS Certificate** * Terraform is way cooler **Yes, SSL is deprecated ...Then Refine https://www.ansible.com/ https://galaxy.ansible.com/ https://aws.amazon.com/cloudformation/ https://letsencrypt.org/

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

https://github.com/PUGX/badge-poser/blob/master/sys/cloudformation/alpine-stack.yaml

Slide 26

Slide 26 text

https://medium.com/@wintonjkt/ansible-101-getting-started-1daaff872b64

Slide 27

Slide 27 text

Ansible: What’s for? - Ansible is perfect for VMs (for example EC2 in our scenario). - It is redundant for ECS with Fargate, since the underlying layer is fully managed by AWS. - It could be useful for ECS without Fargate, so it’ll provision the EC2 where the containers will run. - Useful for deploy and rollback.

Slide 28

Slide 28 text

https://github.com/PUGX/badge-poser/blob/54cd440ebc91245cda4735db86dca897d024a838/sys/ansible/playbooks/setup.yml

Slide 29

Slide 29 text

Wait for it...

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

Step #4 Start Fixing

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

Start Throwing a Bunch of Things At It

Slide 34

Slide 34 text

● pm.max_children = 150 pm.start_servers = 5 pm.min_spare_servers = 5 pm.max_spare_servers = 35 ● emergency_restart_threshold 10 emergency_restart_interval 1m process_control_timeout 10s ● memory_limit = 192M Workaround #1: Not Quite There Yet

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

Added Logz.io & Filebeat Added UptimeRobot It Keeps Crashing: Need Visibility https://logz.io/ https://www.elastic.co/beats/filebeat https://uptimerobot.com/

Slide 37

Slide 37 text

https://medium.com/@mirzapour/centralized-logging-with-elasticsearch-kibana-logstash-and-filebeat-57fea01be5e7

Slide 38

Slide 38 text

https://github.com/PUGX/badge-poser/blob/54cd440ebc91245cda4735db86dca897d024a838/sys/filebeat/filebeat.yml

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

Step #5 Shit Happens

Slide 43

Slide 43 text

https://en.wikipedia.org/wiki/The_IT_Crowd FIRE! FIRE! FIRE!

Slide 44

Slide 44 text

Moved to StatusCake

Slide 45

Slide 45 text

Redis Down: OOM Killer http://turnoff.us/geek/oom-killer/

Slide 46

Slide 46 text

https://en.wikipedia.org/wiki/Boris_(TV_series)

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

Handle Redis Daemon via Supervisor Redis Down: OOM Killer: Workaround #2

Slide 49

Slide 49 text

Zero CPU Credits

Slide 50

Slide 50 text

Zero CPU Credits CPU capped at 20% https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-credits-baseline-concepts.html

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

CPU capped at 20% http://nginx.org/en/docs/http/ngx_http_fastcgi_module.html

Slide 54

Slide 54 text

Zero CPU Credits

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

Step #6 Where Are We At Now?

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

Step #8 Ditch Everything

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

Step #9 Start Over

Slide 61

Slide 61 text

● AWS ECS ● AWS ECR Container - Part 1 https://aws.amazon.com/ecs/ https://aws.amazon.com/ecr/

Slide 62

Slide 62 text

Shit Happens (Again)

Slide 63

Slide 63 text

OOM Killer - The Revenge http://turnoff.us/geek/oom-killer/

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

OOM Killer - The Revenge: Workaround #3 ● Set Autoscaling fixed to min 1 running container

Slide 66

Slide 66 text

● Split All-In-One Container in Multi Container ● Use Alpine Container - Part 2

Slide 67

Slide 67 text

OOM Killer - Highlander

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

One Step Back

Slide 70

Slide 70 text

No content

Slide 71

Slide 71 text

No content

Slide 72

Slide 72 text

https://github.com/aws/amazon-ecs-agent/issues/1187

Slide 73

Slide 73 text

No content

Slide 74

Slide 74 text

No content

Slide 75

Slide 75 text

Despite the Working Fix...

Slide 76

Slide 76 text

...Alpine Wasn’t Quite Stable

Slide 77

Slide 77 text

No content

Slide 78

Slide 78 text

Since the multi-container on Alpine was unstable just switched back to the good ol’ working one-container-has-all on Debian. Switch Back to All-in-One Debian

Slide 79

Slide 79 text

No content

Slide 80

Slide 80 text

No content

Slide 81

Slide 81 text

Alpine: Trial & Errors

Slide 82

Slide 82 text

No content

Slide 83

Slide 83 text

MADNESS ALPINE NGINX+LUA

Slide 84

Slide 84 text

https://github.com/fabiocicerchia/nginx-lua

Slide 85

Slide 85 text

Caching to the rescue

Slide 86

Slide 86 text

NO STALE!

Slide 87

Slide 87 text

MISS – The response was not found in the cache and so was fetched from an origin server. The response might then have been cached. BYPASS – The response was fetched from the origin server instead of served from the cache because the request matched a proxy_cache_bypass directive (see Can I Punch a Hole Through My Cache? below.) The response might then have been cached. EXPIRED – The entry in the cache has expired. The response contains fresh content from the origin server. Cache Statuses https://www.nginx.com/blog/nginx-caching-guide/

Slide 88

Slide 88 text

Cache Statuses STALE – The content is stale because the origin server is not responding correctly, and proxy_cache_use_stale was configured. UPDATING – The content is stale because the entry is currently being updated in response to a previous request, and proxy_cache_use_stale updating is configured. REVALIDATED – The proxy_cache_revalidate directive was enabled and NGINX verified that the current cached content was still valid (If-Modified-Since or If-None-Match). HIT – The response contains valid, fresh content direct from the cache. https://www.nginx.com/blog/nginx-caching-guide/

Slide 89

Slide 89 text

Step #10 Observability

Slide 90

Slide 90 text

Moving away from EC2 and from Logz.io. Again?

Slide 91

Slide 91 text

Need to know the traffic trend CloudWatch

Slide 92

Slide 92 text

Get More Metrics & Desiderata

Slide 93

Slide 93 text

0 0

Slide 94

Slide 94 text

Interlude #1 Serverless

Slide 95

Slide 95 text

https://bref.sh/

Slide 96

Slide 96 text

No content

Slide 97

Slide 97 text

https://github.com/brefphp/bref/issues/497

Slide 98

Slide 98 text

No content

Slide 99

Slide 99 text

https://aws.amazon.com/blogs/compute/introducing-the-new-serverless-lamp-stack/

Slide 100

Slide 100 text

No content

Slide 101

Slide 101 text

No content

Slide 102

Slide 102 text

No content

Slide 103

Slide 103 text

No content

Slide 104

Slide 104 text

No content

Slide 105

Slide 105 text

Interlude #2 PHP8.0.0RC*

Slide 106

Slide 106 text

https://wiki.php.net/todo/php80

Slide 107

Slide 107 text

Rolling Updates https://dzone.com/articles/take-release-automation-to-the-next-level-episode-2

Slide 108

Slide 108 text

Dark Canary 10% / 25% 100% https://landing.google.com/sre/workbook/chapters/canarying-releases/

Slide 109

Slide 109 text

No content

Slide 110

Slide 110 text

No content

Slide 111

Slide 111 text

No content

Slide 112

Slide 112 text

FORKED TRAFFIC

Slide 113

Slide 113 text

No content

Slide 114

Slide 114 text

No content

Slide 115

Slide 115 text

https://github.com/PUGX/badge-poser/pull/431

Slide 116

Slide 116 text

Interlude #3 Full Page Caching w/ Redis

Slide 117

Slide 117 text

https://github.com/fabiocicerchia/go-proxy-cache

Slide 118

Slide 118 text

Step #11 Uptime

Slide 119

Slide 119 text

No content

Slide 120

Slide 120 text

No content

Slide 121

Slide 121 text

No content

Slide 122

Slide 122 text

No content

Slide 123

Slide 123 text

No content

Slide 124

Slide 124 text

...but at the end....

Slide 125

Slide 125 text

No content

Slide 126

Slide 126 text

No content

Slide 127

Slide 127 text

No content

Slide 128

Slide 128 text

https://uptime.is/99.97

Slide 129

Slide 129 text

Deploying during breakfast

Slide 130

Slide 130 text

Confidence Level

Slide 131

Slide 131 text

Step #12 Billing

Slide 132

Slide 132 text

No content

Slide 133

Slide 133 text

No content

Slide 134

Slide 134 text

Elastic Static IP with Global Accelerator

Slide 135

Slide 135 text

https://www.vice.com/it/article/evdyj4/hackerino-computer-militare-video

Slide 136

Slide 136 text

Auto Refreshing Dashboard

Slide 137

Slide 137 text

No content

Slide 138

Slide 138 text

https://www.vice.com/it/article/evdyj4/hackerino-computer-militare-video

Slide 139

Slide 139 text

Reduce Costs!

Slide 140

Slide 140 text

So what did I learn?!

Slide 141

Slide 141 text

Key Takeaways - Never trust code - Never trust yourself - Do small steps - It’ll help you figuring out what went wrong - Version everything - Commit as often as possible - Never use latest tag - Use specific versions - Think outside the box - Don’t stick to playing by the manual - Prefer quick and easy fixes - Reduce the odds of breaking things - Use the tools to make your life easier - So choose them carefully - Monitor & Benchmark! - Your best friends for troubleshooting * random order

Slide 142

Slide 142 text

Questions?

Slide 143

Slide 143 text

Thank You!