Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Hello! I AM FABIO CICERCHIA SW & Cloud Engineer @ You can find me at: @fabiocicerchia

Slide 3

Slide 3 text


Slide 4

Slide 4 text

What can we learn from the latest major cloud incident (ie. burning OVH datacenter)? Do not put all your eggs in one basket!

Slide 5

Slide 5 text

The only certain thing is that it's not a matter of IF there'll be a disaster, but rather WHEN. So better be not caught off guard.

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

So here are the details of each step I took to make my website disaster-proof while keeping my cloud spending on a tight leash (so you could do this too). Shit happens, deal with it! Better safe than sorry!

Slide 16

Slide 16 text

I'm running a bunch of very small websites (with very simple infrastructure topology) and I wanted to put in practice something on a budget. So, I've decided to go multi-cloud.

Slide 17

Slide 17 text

I've started with an infrastructure that looked like this: Evolution

Slide 18

Slide 18 text

Evolution Then, I ended up with something like this:

Slide 19

Slide 19 text

OUTLINE This is the outline plan I followed to upgrade my infrastructure: 1. Secure the Database 2. Secure the Storage 3. Redundancy of Database 4. Redundancy of Storage 5. Redundancy of Web Servers 6. Redundancy of DNS 7. Billing Impact 8. Manual Configurations 9. Disaster Recovery Plan 10. Play with Providers

Slide 20

Slide 20 text

DISCLAIMER I wrote this ebook:

Slide 21

Slide 21 text


Slide 22

Slide 22 text

Start doing the DB backups (with mysqldump or xtrabackup) and define a policy for RTO and RPO, so you'll know what is the accepted loss (there's always loss - even if very minimal). RTO defines how long can the infrastructure can be down, and RPO defines how much data can you afford to lose (ie. how old the latest backup is). #1: Database - Backups

Slide 23

Slide 23 text

#1: Database - ROTATION To rotate the DB backups we could simply use logrotate. We could simply start with a basic daily backup rotation (or any interval you have defined as RPO): /var/backups/daily/alldb.sql.gz { notifempty daily rotate 7 nocompress create 640 root adm dateext dateformat -%Y%m%d-%s postrotate mysqldump -u$USER -p$PASSWD --single-transaction --all-databases | gzip -9f > /var/backups/daily/alldb.sql.gz endscript } This will create the rotated DB backups on the same server where logrotate is running (most likely the same DB instance). We have seen that this is very wrong, so you must always store the backups somewhere else (and also offline).

Slide 24

Slide 24 text

#1: Database - Remote Storage With a simple change, we can upload to an AWS S3 bucket (with cold storage access set to rarely-used): lastaction BUCKET="..." REGION="eu-west-1" aws s3 sync /var/backups/hourly "s3://$BUCKET/daily/" --region $REGION --exclude "*" --include "*.gz-$FORMAT*" --storage-class GLACIER endscript

Slide 25

Slide 25 text

#1: Database - Local Storage Just do a rsync (better if scheduled) to download it locally to an external hard-drive: rsync -e "ssh -i $HOME/.ssh/id_rsa" --progress -auv @:/var/backups ./path/to/backups There you go, you have now backups on-site (for faster restore), remote on another provider (for more reliability), offline (for more peace of mind).

Slide 26

Slide 26 text

#1: Database - Security Remember the good practices, and do not forget about GDPR, the backups must be stored encrypted at-rest (and use a key instead of a plain password).

Slide 27

Slide 27 text

#1: Database - Restore Once everything is backed up, you need to think about how to restore the dump properly, or at least switch the connection to the other node. I'll cover this in the Disaster Recovery Plan post.

Slide 28

Slide 28 text


Slide 29

Slide 29 text

Let's back up them on an external (cloud) storage disk. Why not offline? Because the burden of re-uploading all file stored in a shared folder (which usually are not-so-few) will make the restore process very slow. #2: Storage - Backups

Slide 30

Slide 30 text

#2: Storage - Option #1: Remote VM Let's use a simple cronjob every hour to sync the whole shared folder to a remote location: rsync -auv --progress /path/to/shared/folder :/path/to/shared/folder Some provider can offer pluggable storage and it would be perfect to detach it and reattach it to another node (only if using the same provider). Alternatively, the VM could be exported and mounted as NFS (with some performance degradation). By using some cheap storages you could leverage the cost of a cloud-native one. Some providers can offer 2TB for ~$10/month, like TransIP or AlphaVPS. If you combine them together you'll end up with a slightly higher cost (than using only one) but have definitively greater redundancy.

Slide 31

Slide 31 text

#2: Storage - Option #2: Cloud-Native Storage Still, with a simple cronjob we could sync the whole shared folder to an S3 bucket (using cold storage access): aws s3 sync --storage-class GLACIER /path/to/shared/folder s3:/// It is free to send data into AWS S3 but to take it out you need to pay roughly an extra $0.09 per GB, so in case you have lots of data, you might want to consider this very carefully: to restore 1TB of data it could costs you ~$23/month + ~$90 to restore it. A cheaper provider for Cloud-Native Storage is Scaleway with ~0.002€/GB/month (1TB = ~€2.5). You need to consider the loss of permissions when saving to AWS S3, so when restoring you need to double-check it to verify they are correct.

Slide 32

Slide 32 text

#2: Storage - Restore Once everything is backed up, you need to think about how to restore the data properly, or at least switch the access on-the-fly. I'll cover this in the Disaster Recovery Plan post.

Slide 33

Slide 33 text

SAVE YOURSELF FROM A DISASTER #3 Redundancy of Database

Slide 34

Slide 34 text

Create a cluster to have at least a structure like master/slave primary/secondary, 3 nodes will be recommended so we'll have the flexibility to do planned maintenance without suffering and/or affecting the performance of the whole cluster. #3: Database - Redundancy

Slide 35

Slide 35 text

#3: Database - Spin up a secondary node Create another VM somewhere else (better if in another availability zone/region/provider), then configure a MySQL/MariaDB/Percona/... instance and plug it in as a secondary node. We can set it up even with fewer resources and make it the write-only node (in case we have less writing activity, otherwise the read-only one).

Slide 36

Slide 36 text

#3: Database - Balancing requests I prefer to use something like HAProxy as a TCP load balancer, or (even better) using ProxySQL (which has a nice query caching capability). I'd go with ProxySQL load balancing the 2 nodes created, then just change the database connection string in the application and the setup is done (we could even partition the queries and define to which node they should be sent). In my case, a primary/secondary topology could more than enough, but I went for a primary/primary configuration (you can follow a simple tutorial or a more structured configuration) without balancing (because each web node will access their local DB instance).

Slide 37

Slide 37 text

#3: Database - Security The replica must be done over a secure connection, so you need to generate a certificate and use it.

Slide 38

Slide 38 text

SAVE YOURSELF FROM A DISASTER #4 Redundancy of Storage

Slide 39

Slide 39 text

Although we could use some distributed filesystems like Ceph, DRBD, GlusterFS, or ZFS, then it won't be on a budget and also the complexity introduced by those tools will need to addressed properly. I will not cover it here due to the costs of extra nodes and extra configuration needed - you're time have a cost too (but if your filesystem changes frequently this is your only option). #4: Storage - Distributed Storage

Slide 40

Slide 40 text

#4: Storage - Ad-Hoc Solutions ● How to build a Ceph Distributed Storage Cluster on CentOS 7 ● How to Setup DRBD to Replicate Storage on Two CentOS 7 Servers ● How To Create a Redundant Storage Pool Using GlusterFS on Ubuntu 18.04 ● An Introduction to the Z File System (ZFS) for Linux

Slide 41

Slide 41 text

#4: Storage - Quick & Dirty: Cross Sync Let's use a simple cronjob every hour to sync the whole shared folder to all remote locations. Server #1: rsync -e "ssh -i $HOME/.ssh/somekey" -auv --progress /path/to/shared/folder/ syncer@:/path/to/shared/folder Server #2: rsync -e "ssh -i $HOME/.ssh/somekey" -auv --progress /path/to/shared/folder/ syncer@:/path/to/shared/folder Remember, this is not a proper distributed solution, rsync looks like an old-fashioned solution, it did save me lots of times. This approach is not feasible for "real-time" synchronization, they are just for (very) infrequent changes. Distributed filesystems like GlusterFS (or Ceph, or DRBD) are solutions for the long run.

Slide 42

Slide 42 text

#4: Storage - Security Remember to secure the connection between one host to the others (eg. with a firewall).

Slide 43

Slide 43 text

SAVE YOURSELF FROM A DISASTER #5 Redundancy of Web Servers

Slide 44

Slide 44 text

Nowadays many cloud providers (also virtualization platforms) are giving you the possibility to take a snapshot of the VM and then restore/clone it. I'll not cover it in this tutorial as we'll increase the overall cost of the infrastructure. Although, sometimes (based on the application) it can be very time-saving doing a clone of the VM compared to the other method I'm proposing here below. #5: Web Servers - Duplicate VM

Slide 45

Slide 45 text

#5: Web Servers - Docker We live in 2021, everyone is running containers and wishing to have a k8s cluster to play with. So, let's convert the simple applications into containers, there are a lot of already-ready containers on Docker Hub.

Slide 46

Slide 46 text

#5: Web Servers - Docker Swarm Let's start nice and easy, with Docker Swarm (which eliminates the extra complexity of Kubernetes) on ONE node (then we can scale out as much as we like). First, setup your nodes, I'm going to use standard images for my dockerized infrastructure, no custom images (for now - I've got pretty simple configurations). I've picked bitnami images, as they cover a lot of scenarios and provide pre-packaged images for most of the popular server software (more reasons why pick them). If you really want to start using custom images you could publish them publicly for free on Docker Hub (but has got recently some limitations) or on Canister. After the announcement from Docker Hub about limiting the rates of pull, AWS decided to offer public repositories (and they are almost free if you don't exceed 500GB/month when not logged or 5TB/month when logged).

Slide 47

Slide 47 text

#5: Web Servers - Docker Compose This is an example of a WordPress website configured with docker-compose: version: "3.9" services: wordpress: image: wordpress:5.7.0 ports: - 8000:80 deploy: replicas: 1 restart_policy: condition: on-failure extra_hosts: - "host.docker.internal:host-gateway" environment: WORDPRESS_DB_HOST: host.docker.internal:3306 WORDPRESS_DB_USER: *** WORDPRESS_DB_PASSWORD: *** WORDPRESS_DB_NAME: *** volumes: - /path/to/wp-content:/var/www/html/wp-content healthcheck: test: ["CMD", "curl", "-f", "http://localhost"] interval: 30s timeout: 10s retries: 3

Slide 48

Slide 48 text

When using Docker Swarm with lots of containers and services (which bounds a dedicated port), you'll need an ingress system to sort the requests to the right service. You could use one of the 2 most used solutions: Nginx or Traefik. #5: Web Servers - Ingress

Slide 49

Slide 49 text

#5: Web Servers - Ingress I decided to use a simple bitnami/nginx with a custom config (pretty straightforward proxy): version: "3.9" services: client: image: bitnami/nginx:1.19.8 ports: - 80:8080 - 443:8443 deploy: replicas: 2 restart_policy: condition: on-failure extra_hosts: - "host.docker.internal:host-gateway" volumes: - /root/docker-compose/nginx/lb.conf:/opt/bitnami/nginx/conf/server_blocks/lb.conf:ro - /etc/letsencrypt:/etc/letsencrypt

Slide 50

Slide 50 text

#5: Web Servers - TLS Termination This is the tricky part. If you have already bought the certificates (eg. from SSLs) you're good for 1 year (at least). If you don't want to buy them and want to rely on Let's Encrypt, you'll need to be ready to sweat a bit to set it up. Setting it up on one node is pretty simple, but if you need to replicate it on multiple nodes then you need to start being creative. One proposed solution would be having a primary node that generates (or renews) the certificate(s) and then it'll spread them to the other servers: rsync -e "ssh -i $HOME/.ssh/somekey" -auv --progress /etc/letsencrypt/ syncerssl@:/etc/letsencrypt rsync -e "ssh -i $HOME/.ssh/somekey" -auv --progress /etc/letsencrypt/ syncerssl@:/etc/letsencrypt

Slide 51

Slide 51 text

#5: Web Servers - Kubernetes Kubernetes is more complex and require more time to configure it, but once done there could be no vendor lock-in for you (as many providers are offering managed k8s), also it is more extensible (but more complex than swarm). If you have already a Docker Swarm cluster and want to migrate try following these guides: ● From Docker-Swarm to Kubernetes – the Easy Way! ● Translate a Docker Compose File to Kubernetes Resources Remember to either use a dockerized database or rely on cloud-native managed solutions.

Slide 52

Slide 52 text


Slide 53

Slide 53 text

This is not really a practical solution because whenever one server is down the traffic will still be routed to that server, and your customers will be affected. If you have 2 records A pointing to 2 different servers you could potentially lose 50% of your traffic. In case you need to remove (manually) quickly the unresponsive server, you need to take into account the DNS TTL. If it is set to a high value (like 24h or - even worse - a week) you cannot do anything to change that, other than wait. There are pro and cons for setting either a low or high TTL. Usually, the DNS propagation time is around 24 hours, but it could also be around 72 hours, this is because ISP can override the TTL you have specified and the time for your changes to propagate can be longer than expected. #6: DNS - DNS Round-Robin

Slide 54

Slide 54 text

By having multiple nameservers you can have a fallback in case your DNS provider is having issues (very unlikely but possible). Generally, you need to maintain manually the records aligned between the two providers. Sometimes the DNS provider will give you the ability to manage those records by pulling the data from your primary provider or by giving you API access so you can do it programmatically. The RFC 1035 (Domain Names - Implementation And Specification), in fact, proposes to have more than nameserver configured. #6: DNS - Secondary DNS

Slide 55

Slide 55 text

#6: DNS - Secondary DNS First of all, you need to verify that your registrar has got a nice and good DNS management panel. Some services that are offering such functionality are for example FreeDNS (premium version ~$5/year), DNSMadeEasy, and many more. Cloudflare can act as Secondary DNS but the setup seems quite long, DNSimple has out-of-the-box integration with it (but you cannot use any of the functionality offered by CF - which makes it a bit of a loss). I went with PremiumDNS which claims that it "keeps your website running, even when flooded with traffic. It secures the very deepest level of the Domain Name System (DNS), preventing Distributed Denial of Service (DDoS) attacks, and giving you 100% uptime, guaranteed." Great point is that you can buy it even for 3rd party domain. You can check the nameservers by running: dig +short NS

Slide 56

Slide 56 text

#6: DNS - Inspecting TTLs PremiumDNS has a TTL on the NS records of 30m, so you can be unavailable roughly for that amount of time (only if the ISP is not overriding the TTL). Cloudflare has a TTL on the NS records of 6 hours.

Slide 57

Slide 57 text

#6: DNS - Manual Switch When everything goes south, sometimes happens to have issues with DNS, and you know you are going to be affected for too long, the last resort is to change manually the authoritative name servers registered on the domain (you could do that via your registrar) and point them to a fallback DNS (you could set up an offline clone of your records in Cloudflare).

Slide 58

Slide 58 text


Slide 59

Slide 59 text

#7: Billing - Original Cost ● DigitalOcean VPS: $5/mo x 12 = $60 ● Setup hours: $0 Monthly Cost: $5 Annual Cost: $60

Slide 60

Slide 60 text

#7: Billing - Upgrade Cost ● PremiumDNS: $2.88/yr x 5 domains = $14.4/yr ● DigitalOcean VPS: $5/mo x 12 = $60 ● Hetzner VPS: €3.04/mo x 12 = €36.48 ($43.07) ● AWS S3: ~$0.07 x 365 = $26 ● Setup hours: $? (fill here your cost time to follow this guide) Monthly Cost: ~$12 Annual Cost: $143.47+ That's more than 2x the original price you might say, and you'll be not so wrong about it. Obviously, for different original prices, it won't necessarily be 2x.

Slide 61

Slide 61 text

Note: The following optimizations will only be shown on the annual bill, if you take action immediately you'll not reach 2x cost. Let's start cutting down what we really know is not necessary for our domains. First, we need to visualize the spending. Hetzner and DigitalOcean are the 2 biggest chunks of our spending (which was predictable). I'll try to cover some scenarios to optimize the cost. #7: Billing - Cost Optimization

Slide 62

Slide 62 text

#7: Billing - Cost Optimization

Slide 63

Slide 63 text

My retention for AWS S3 was the following: ● 24 hourly backups ● 31 daily backups ● 12 weekly backups ● 3 monthly backups I had 70 of them which were taking 3GB of space and costing just a few cents per year. #7: Billing - Reduce Backup Retention

Slide 64

Slide 64 text

This is the most space consuming on AWS S3 since it is a mirror of your websites. It is not necessary as we have redundancy on our services, it was done just a last resort in case everything burns down so at least we could serve to the user the static content to access the information (even though they cannot interact with the dynamic part of the website). #7: Billing - Avoid the Static Clone

Slide 65

Slide 65 text

We could replace our $60 spending with an additional Hetzner VPS (in another region) and move from $96.96 to $86.14 (saving $10/yr). #7: Billing - Replace DigitalOcean

Slide 66

Slide 66 text

The concern is that, even if we have a VPS in Germany and another in Finland, we are relying on ONE provider (I know, there's vendor lock-in - but I have IaC fully configured so I can switch provider in a matter of minutes): #7: Billing - Replace DigitalOcean

Slide 67

Slide 67 text

If you don't have very important, or profitable, applications/websites, and given the frequency of a DNS going down you might want to save some money on this. If you have many websites the cost can become quickly high, even if it's a few dollars per domain. If you make money, it's advisable to have a fallback DNS (or a premium service with guaranteed uptime at 100%), even because of the low cost (and impact on your bill). #7: Billing - DNS Fallback

Slide 68

Slide 68 text

#7: Billing - Sum Up I don't run critical (nor very profitable) applications, so I can give up (at the moment) on having multiple cloud providers, to bring strong HA, in favour of saving some money. This will increase my spending from $60 to $112* $88, which is not optimal (compared to the initial figure): it's an extra $28/year (~$2.5/month - it's just like a couple of coffees) to have peace of mind. *Note: I've got some free credits on AWS so the backups are for free (at least for some time - not forever).

Slide 69

Slide 69 text

#7: Billing - Sum Up

Slide 70

Slide 70 text

SAVE YOURSELF FROM A DISASTER #8 Manual Configurations

Slide 71

Slide 71 text

In our toolbox are necessary Ansible and Terraform, these two will be your best friends in documenting the infrastructure and make everything replicable to scale up/out easily. Those 2 tools are vendor-agnostic, so they can work with any provider and avoid you to lock-in with a configuration management tool, like AWS CloudFormation / CDK. Other tools for provisioning are Puppet, Chef and SaltStack. Remember to keep the Infrastructure as Code always up-to-date, avoid any configuration drifting whatsoever. #8: Manual Configs - Tools

Slide 72

Slide 72 text

#8: Manual Configs - Creating VMs For creating the infrastructure we'll use Terraform. This is an example of how to create a new VM (or like they call it a Droplet to be precise): # Create a web server resource "digitalocean_droplet" "web" { image = "ubuntu-20-04-x64" name = "web-1" region = "fra1" size = "s-1vcpu-1gb" monitoring = "true" ssh_keys = [digitalocean_ssh_key.default.fingerprint] depends_on = [ digitalocean_ssh_key.default, ] } Just like that we could simply do copy & paste and create many others (even though it is best practice to use the count argument).

Slide 73

Slide 73 text

--- - name: "Initial Provisioning" hosts: all become: true vars_files: - ../vars/init.yml roles: - oefenweb.swapfile - oefenweb.apt - ahuffman.resolv - ajsalminen.hosts - geerlingguy.ntp - geerlingguy.firewall - dev-sec.os-hardening - dev-sec.ssh-hardening - uzer.crontab tasks: - name: Add user manager ansible.builtin.user: name: "manager" shell: /bin/bash generate_ssh_key: yes ssh_key_type: rsa ssh_key_bits: 4096 #8: Manual Configs - Provisioning - name: Allow manager to have passwordless sudo lineinfile: dest: /etc/sudoers state: present insertafter: '^root' line: 'manager ALL=(ALL) NOPASSWD: ALL' validate: 'visudo -cf %s' - name: "Logrotate Configs" copy: src: "{{ item.src }}" dest: "{{ item.dst }}" with_items: "{{ app_logrotate_config_items }}" - name: Set the policy for the INPUT chain to DROP Ansible.builtin.iptables: chain: INPUT policy: DROP

Slide 74

Slide 74 text


Slide 75

Slide 75 text

Try to answer, in an honest way, the following questions: ● What are your weaknesses? ● What are your SPOF? ● What if the DNS provider will be down? ○ How do we switch name servers? ● What will you do if your HDD will fail? ● What if you get a ransomware? ○ How to make sure we don't fall into a ransom? ● What needs to be restored? ● Do we need to point the DB to a fallback node? ● How do we restore the backups? ○ Where are the backups stored? ○ Who can access them? ● How to serve static content when everything is lost? These are just some questions in order to get your head around the Disaster Recovery Plan you'll outline. #9: DRP - Plan Ahead

Slide 76

Slide 76 text

#9: DRP - Possible Failures ● Application ● Network ● Data Center ● Citywide ● Regional ● National ● Multinational

Slide 77

Slide 77 text

#9: DRP - Outline What are the RTO and RPO for your plan? ● RTO, Recovery Time Objective, it's the time needed to bring the service back online before creating too much of an unacceptable disruption for your users. ● RPO, Recovery Point Objective, it's the maximum amount of time allowed where the data is lost (a backup every hour has a RPO of 1h)

Slide 78

Slide 78 text


Slide 79

Slide 79 text

#10: Play with Providers For example, now that you have everything in containers you could migrate to Kubernetes, or maybe to cloud-native solutions for containers like AWS ECS, GCP GCE, Azure ACI, or even to serverless (since AWS allows to serve traffic from a docker image).

Slide 80

Slide 80 text

#10: Play with Providers Is it better AWS or maybe it is more convenient Azure or GCP, don't fall into vendor lock-in: mix them up. Yeah but then they don't play along out-of-the-box, who cares? Make them work FOR you, you might need to spend some more time to get it right, but in the long run (it's always about the long run - if you focus on now just don't experiment and run for covers) you'll get the benefit of ALL the services they can offer to you.

Slide 81

Slide 81 text

#10: Play with Providers It doesn't have to be a big player, great solutions can work also from not-mainstream providers. I've had experiences with bare-metal, AWS, Contabo, Hetzner, DigitalOcean, Aruba, TransIP, Scaleway, Linode, FlareVM, Heroku, Linode, OVH.

Slide 82

Slide 82 text


Slide 83

Slide 83 text


Slide 84

Slide 84 text

2 LITTLE GIFTs —————————————————————————————— FREE ebook: —————————————————————————————— GITHUB REPO: