Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Save Yourself From A Disaster

Save Yourself From A Disaster

The only certain thing is that it’s not a matter of IF there’ll be a disaster but rather WHEN, so better be not caught off guard. I’ll show and guide you through the details of each step I took to make my websites disaster-proof, while keeping my cloud spending on a tight leash (so you could do this too).

Fabio Cicerchia

May 26, 2021
Tweet

More Decks by Fabio Cicerchia

Other Decks in Technology

Transcript

  1. Hello! I AM FABIO CICERCHIA SW & Cloud Engineer @

    You can find me at: @fabiocicerchia
  2. What can we learn from the latest major cloud incident

    (ie. burning OVH datacenter)? Do not put all your eggs in one basket!
  3. The only certain thing is that it's not a matter

    of IF there'll be a disaster, but rather WHEN. So better be not caught off guard.
  4. So here are the details of each step I took

    to make my website disaster-proof while keeping my cloud spending on a tight leash (so you could do this too). Shit happens, deal with it! Better safe than sorry!
  5. I'm running a bunch of very small websites (with very

    simple infrastructure topology) and I wanted to put in practice something on a budget. So, I've decided to go multi-cloud.
  6. OUTLINE This is the outline plan I followed to upgrade

    my infrastructure: 1. Secure the Database 2. Secure the Storage 3. Redundancy of Database 4. Redundancy of Storage 5. Redundancy of Web Servers 6. Redundancy of DNS 7. Billing Impact 8. Manual Configurations 9. Disaster Recovery Plan 10. Play with Providers
  7. Start doing the DB backups (with mysqldump or xtrabackup) and

    define a policy for RTO and RPO, so you'll know what is the accepted loss (there's always loss - even if very minimal). RTO defines how long can the infrastructure can be down, and RPO defines how much data can you afford to lose (ie. how old the latest backup is). #1: Database - Backups
  8. #1: Database - ROTATION To rotate the DB backups we

    could simply use logrotate. We could simply start with a basic daily backup rotation (or any interval you have defined as RPO): /var/backups/daily/alldb.sql.gz { notifempty daily rotate 7 nocompress create 640 root adm dateext dateformat -%Y%m%d-%s postrotate mysqldump -u$USER -p$PASSWD --single-transaction --all-databases | gzip -9f > /var/backups/daily/alldb.sql.gz endscript } This will create the rotated DB backups on the same server where logrotate is running (most likely the same DB instance). We have seen that this is very wrong, so you must always store the backups somewhere else (and also offline).
  9. #1: Database - Remote Storage With a simple change, we

    can upload to an AWS S3 bucket (with cold storage access set to rarely-used): lastaction BUCKET="..." REGION="eu-west-1" aws s3 sync /var/backups/hourly "s3://$BUCKET/daily/" --region $REGION --exclude "*" --include "*.gz-$FORMAT*" --storage-class GLACIER endscript
  10. #1: Database - Local Storage Just do a rsync (better

    if scheduled) to download it locally to an external hard-drive: rsync -e "ssh -i $HOME/.ssh/id_rsa" --progress -auv <USER>@<IP>:/var/backups ./path/to/backups There you go, you have now backups on-site (for faster restore), remote on another provider (for more reliability), offline (for more peace of mind).
  11. #1: Database - Security Remember the good practices, and do

    not forget about GDPR, the backups must be stored encrypted at-rest (and use a key instead of a plain password).
  12. #1: Database - Restore Once everything is backed up, you

    need to think about how to restore the dump properly, or at least switch the connection to the other node. I'll cover this in the Disaster Recovery Plan post.
  13. Let's back up them on an external (cloud) storage disk.

    Why not offline? Because the burden of re-uploading all file stored in a shared folder (which usually are not-so-few) will make the restore process very slow. #2: Storage - Backups
  14. #2: Storage - Option #1: Remote VM Let's use a

    simple cronjob every hour to sync the whole shared folder to a remote location: rsync -auv --progress /path/to/shared/folder <IP>:/path/to/shared/folder Some provider can offer pluggable storage and it would be perfect to detach it and reattach it to another node (only if using the same provider). Alternatively, the VM could be exported and mounted as NFS (with some performance degradation). By using some cheap storages you could leverage the cost of a cloud-native one. Some providers can offer 2TB for ~$10/month, like TransIP or AlphaVPS. If you combine them together you'll end up with a slightly higher cost (than using only one) but have definitively greater redundancy.
  15. #2: Storage - Option #2: Cloud-Native Storage Still, with a

    simple cronjob we could sync the whole shared folder to an S3 bucket (using cold storage access): aws s3 sync --storage-class GLACIER /path/to/shared/folder s3://<BUCKET>/ It is free to send data into AWS S3 but to take it out you need to pay roughly an extra $0.09 per GB, so in case you have lots of data, you might want to consider this very carefully: to restore 1TB of data it could costs you ~$23/month + ~$90 to restore it. A cheaper provider for Cloud-Native Storage is Scaleway with ~0.002€/GB/month (1TB = ~€2.5). You need to consider the loss of permissions when saving to AWS S3, so when restoring you need to double-check it to verify they are correct.
  16. #2: Storage - Restore Once everything is backed up, you

    need to think about how to restore the data properly, or at least switch the access on-the-fly. I'll cover this in the Disaster Recovery Plan post.
  17. Create a cluster to have at least a structure like

    master/slave primary/secondary, 3 nodes will be recommended so we'll have the flexibility to do planned maintenance without suffering and/or affecting the performance of the whole cluster. #3: Database - Redundancy
  18. #3: Database - Spin up a secondary node Create another

    VM somewhere else (better if in another availability zone/region/provider), then configure a MySQL/MariaDB/Percona/... instance and plug it in as a secondary node. We can set it up even with fewer resources and make it the write-only node (in case we have less writing activity, otherwise the read-only one).
  19. #3: Database - Balancing requests I prefer to use something

    like HAProxy as a TCP load balancer, or (even better) using ProxySQL (which has a nice query caching capability). I'd go with ProxySQL load balancing the 2 nodes created, then just change the database connection string in the application and the setup is done (we could even partition the queries and define to which node they should be sent). In my case, a primary/secondary topology could more than enough, but I went for a primary/primary configuration (you can follow a simple tutorial or a more structured configuration) without balancing (because each web node will access their local DB instance).
  20. #3: Database - Security The replica must be done over

    a secure connection, so you need to generate a certificate and use it.
  21. Although we could use some distributed filesystems like Ceph, DRBD,

    GlusterFS, or ZFS, then it won't be on a budget and also the complexity introduced by those tools will need to addressed properly. I will not cover it here due to the costs of extra nodes and extra configuration needed - you're time have a cost too (but if your filesystem changes frequently this is your only option). #4: Storage - Distributed Storage
  22. #4: Storage - Ad-Hoc Solutions • How to build a

    Ceph Distributed Storage Cluster on CentOS 7 • How to Setup DRBD to Replicate Storage on Two CentOS 7 Servers • How To Create a Redundant Storage Pool Using GlusterFS on Ubuntu 18.04 • An Introduction to the Z File System (ZFS) for Linux
  23. #4: Storage - Quick & Dirty: Cross Sync Let's use

    a simple cronjob every hour to sync the whole shared folder to all remote locations. Server #1: rsync -e "ssh -i $HOME/.ssh/somekey" -auv --progress /path/to/shared/folder/ syncer@<IP2>:/path/to/shared/folder Server #2: rsync -e "ssh -i $HOME/.ssh/somekey" -auv --progress /path/to/shared/folder/ syncer@<IP1>:/path/to/shared/folder Remember, this is not a proper distributed solution, rsync looks like an old-fashioned solution, it did save me lots of times. This approach is not feasible for "real-time" synchronization, they are just for (very) infrequent changes. Distributed filesystems like GlusterFS (or Ceph, or DRBD) are solutions for the long run.
  24. #4: Storage - Security Remember to secure the connection between

    one host to the others (eg. with a firewall).
  25. Nowadays many cloud providers (also virtualization platforms) are giving you

    the possibility to take a snapshot of the VM and then restore/clone it. I'll not cover it in this tutorial as we'll increase the overall cost of the infrastructure. Although, sometimes (based on the application) it can be very time-saving doing a clone of the VM compared to the other method I'm proposing here below. #5: Web Servers - Duplicate VM
  26. #5: Web Servers - Docker We live in 2021, everyone

    is running containers and wishing to have a k8s cluster to play with. So, let's convert the simple applications into containers, there are a lot of already-ready containers on Docker Hub.
  27. #5: Web Servers - Docker Swarm Let's start nice and

    easy, with Docker Swarm (which eliminates the extra complexity of Kubernetes) on ONE node (then we can scale out as much as we like). First, setup your nodes, I'm going to use standard images for my dockerized infrastructure, no custom images (for now - I've got pretty simple configurations). I've picked bitnami images, as they cover a lot of scenarios and provide pre-packaged images for most of the popular server software (more reasons why pick them). If you really want to start using custom images you could publish them publicly for free on Docker Hub (but has got recently some limitations) or on Canister. After the announcement from Docker Hub about limiting the rates of pull, AWS decided to offer public repositories (and they are almost free if you don't exceed 500GB/month when not logged or 5TB/month when logged).
  28. #5: Web Servers - Docker Compose This is an example

    of a WordPress website configured with docker-compose: version: "3.9" services: wordpress: image: wordpress:5.7.0 ports: - 8000:80 deploy: replicas: 1 restart_policy: condition: on-failure extra_hosts: - "host.docker.internal:host-gateway" environment: WORDPRESS_DB_HOST: host.docker.internal:3306 WORDPRESS_DB_USER: *** WORDPRESS_DB_PASSWORD: *** WORDPRESS_DB_NAME: *** volumes: - /path/to/wp-content:/var/www/html/wp-content healthcheck: test: ["CMD", "curl", "-f", "http://localhost"] interval: 30s timeout: 10s retries: 3
  29. When using Docker Swarm with lots of containers and services

    (which bounds a dedicated port), you'll need an ingress system to sort the requests to the right service. You could use one of the 2 most used solutions: Nginx or Traefik. #5: Web Servers - Ingress
  30. #5: Web Servers - Ingress I decided to use a

    simple bitnami/nginx with a custom config (pretty straightforward proxy): version: "3.9" services: client: image: bitnami/nginx:1.19.8 ports: - 80:8080 - 443:8443 deploy: replicas: 2 restart_policy: condition: on-failure extra_hosts: - "host.docker.internal:host-gateway" volumes: - /root/docker-compose/nginx/lb.conf:/opt/bitnami/nginx/conf/server_blocks/lb.conf:ro - /etc/letsencrypt:/etc/letsencrypt
  31. #5: Web Servers - TLS Termination This is the tricky

    part. If you have already bought the certificates (eg. from SSLs) you're good for 1 year (at least). If you don't want to buy them and want to rely on Let's Encrypt, you'll need to be ready to sweat a bit to set it up. Setting it up on one node is pretty simple, but if you need to replicate it on multiple nodes then you need to start being creative. One proposed solution would be having a primary node that generates (or renews) the certificate(s) and then it'll spread them to the other servers: rsync -e "ssh -i $HOME/.ssh/somekey" -auv --progress /etc/letsencrypt/ syncerssl@<IP2>:/etc/letsencrypt rsync -e "ssh -i $HOME/.ssh/somekey" -auv --progress /etc/letsencrypt/ syncerssl@<IP3>:/etc/letsencrypt
  32. #5: Web Servers - Kubernetes Kubernetes is more complex and

    require more time to configure it, but once done there could be no vendor lock-in for you (as many providers are offering managed k8s), also it is more extensible (but more complex than swarm). If you have already a Docker Swarm cluster and want to migrate try following these guides: • From Docker-Swarm to Kubernetes – the Easy Way! • Translate a Docker Compose File to Kubernetes Resources Remember to either use a dockerized database or rely on cloud-native managed solutions.
  33. This is not really a practical solution because whenever one

    server is down the traffic will still be routed to that server, and your customers will be affected. If you have 2 records A pointing to 2 different servers you could potentially lose 50% of your traffic. In case you need to remove (manually) quickly the unresponsive server, you need to take into account the DNS TTL. If it is set to a high value (like 24h or - even worse - a week) you cannot do anything to change that, other than wait. There are pro and cons for setting either a low or high TTL. Usually, the DNS propagation time is around 24 hours, but it could also be around 72 hours, this is because ISP can override the TTL you have specified and the time for your changes to propagate can be longer than expected. #6: DNS - DNS Round-Robin
  34. By having multiple nameservers you can have a fallback in

    case your DNS provider is having issues (very unlikely but possible). Generally, you need to maintain manually the records aligned between the two providers. Sometimes the DNS provider will give you the ability to manage those records by pulling the data from your primary provider or by giving you API access so you can do it programmatically. The RFC 1035 (Domain Names - Implementation And Specification), in fact, proposes to have more than nameserver configured. #6: DNS - Secondary DNS
  35. #6: DNS - Secondary DNS First of all, you need

    to verify that your registrar has got a nice and good DNS management panel. Some services that are offering such functionality are for example FreeDNS (premium version ~$5/year), DNSMadeEasy, and many more. Cloudflare can act as Secondary DNS but the setup seems quite long, DNSimple has out-of-the-box integration with it (but you cannot use any of the functionality offered by CF - which makes it a bit of a loss). I went with PremiumDNS which claims that it "keeps your website running, even when flooded with traffic. It secures the very deepest level of the Domain Name System (DNS), preventing Distributed Denial of Service (DDoS) attacks, and giving you 100% uptime, guaranteed." Great point is that you can buy it even for 3rd party domain. You can check the nameservers by running: dig +short NS example.com
  36. #6: DNS - Inspecting TTLs PremiumDNS has a TTL on

    the NS records of 30m, so you can be unavailable roughly for that amount of time (only if the ISP is not overriding the TTL). Cloudflare has a TTL on the NS records of 6 hours.
  37. #6: DNS - Manual Switch When everything goes south, sometimes

    happens to have issues with DNS, and you know you are going to be affected for too long, the last resort is to change manually the authoritative name servers registered on the domain (you could do that via your registrar) and point them to a fallback DNS (you could set up an offline clone of your records in Cloudflare).
  38. #7: Billing - Original Cost • DigitalOcean VPS: $5/mo x

    12 = $60 • Setup hours: $0 Monthly Cost: $5 Annual Cost: $60
  39. #7: Billing - Upgrade Cost • PremiumDNS: $2.88/yr x 5

    domains = $14.4/yr • DigitalOcean VPS: $5/mo x 12 = $60 • Hetzner VPS: €3.04/mo x 12 = €36.48 ($43.07) • AWS S3: ~$0.07 x 365 = $26 • Setup hours: $? (fill here your cost time to follow this guide) Monthly Cost: ~$12 Annual Cost: $143.47+ That's more than 2x the original price you might say, and you'll be not so wrong about it. Obviously, for different original prices, it won't necessarily be 2x.
  40. Note: The following optimizations will only be shown on the

    annual bill, if you take action immediately you'll not reach 2x cost. Let's start cutting down what we really know is not necessary for our domains. First, we need to visualize the spending. Hetzner and DigitalOcean are the 2 biggest chunks of our spending (which was predictable). I'll try to cover some scenarios to optimize the cost. #7: Billing - Cost Optimization
  41. My retention for AWS S3 was the following: • 24

    hourly backups • 31 daily backups • 12 weekly backups • 3 monthly backups I had 70 of them which were taking 3GB of space and costing just a few cents per year. #7: Billing - Reduce Backup Retention
  42. This is the most space consuming on AWS S3 since

    it is a mirror of your websites. It is not necessary as we have redundancy on our services, it was done just a last resort in case everything burns down so at least we could serve to the user the static content to access the information (even though they cannot interact with the dynamic part of the website). #7: Billing - Avoid the Static Clone
  43. We could replace our $60 spending with an additional Hetzner

    VPS (in another region) and move from $96.96 to $86.14 (saving $10/yr). #7: Billing - Replace DigitalOcean
  44. The concern is that, even if we have a VPS

    in Germany and another in Finland, we are relying on ONE provider (I know, there's vendor lock-in - but I have IaC fully configured so I can switch provider in a matter of minutes): #7: Billing - Replace DigitalOcean
  45. If you don't have very important, or profitable, applications/websites, and

    given the frequency of a DNS going down you might want to save some money on this. If you have many websites the cost can become quickly high, even if it's a few dollars per domain. If you make money, it's advisable to have a fallback DNS (or a premium service with guaranteed uptime at 100%), even because of the low cost (and impact on your bill). #7: Billing - DNS Fallback
  46. #7: Billing - Sum Up I don't run critical (nor

    very profitable) applications, so I can give up (at the moment) on having multiple cloud providers, to bring strong HA, in favour of saving some money. This will increase my spending from $60 to $112* $88, which is not optimal (compared to the initial figure): it's an extra $28/year (~$2.5/month - it's just like a couple of coffees) to have peace of mind. *Note: I've got some free credits on AWS so the backups are for free (at least for some time - not forever).
  47. In our toolbox are necessary Ansible and Terraform, these two

    will be your best friends in documenting the infrastructure and make everything replicable to scale up/out easily. Those 2 tools are vendor-agnostic, so they can work with any provider and avoid you to lock-in with a configuration management tool, like AWS CloudFormation / CDK. Other tools for provisioning are Puppet, Chef and SaltStack. Remember to keep the Infrastructure as Code always up-to-date, avoid any configuration drifting whatsoever. #8: Manual Configs - Tools
  48. #8: Manual Configs - Creating VMs For creating the infrastructure

    we'll use Terraform. This is an example of how to create a new VM (or like they call it a Droplet to be precise): # Create a web server resource "digitalocean_droplet" "web" { image = "ubuntu-20-04-x64" name = "web-1" region = "fra1" size = "s-1vcpu-1gb" monitoring = "true" ssh_keys = [digitalocean_ssh_key.default.fingerprint] depends_on = [ digitalocean_ssh_key.default, ] } Just like that we could simply do copy & paste and create many others (even though it is best practice to use the count argument).
  49. --- - name: "Initial Provisioning" hosts: all become: true vars_files:

    - ../vars/init.yml roles: - oefenweb.swapfile - oefenweb.apt - ahuffman.resolv - ajsalminen.hosts - geerlingguy.ntp - geerlingguy.firewall - dev-sec.os-hardening - dev-sec.ssh-hardening - uzer.crontab tasks: - name: Add user manager ansible.builtin.user: name: "manager" shell: /bin/bash generate_ssh_key: yes ssh_key_type: rsa ssh_key_bits: 4096 #8: Manual Configs - Provisioning - name: Allow manager to have passwordless sudo lineinfile: dest: /etc/sudoers state: present insertafter: '^root' line: 'manager ALL=(ALL) NOPASSWD: ALL' validate: 'visudo -cf %s' - name: "Logrotate Configs" copy: src: "{{ item.src }}" dest: "{{ item.dst }}" with_items: "{{ app_logrotate_config_items }}" - name: Set the policy for the INPUT chain to DROP Ansible.builtin.iptables: chain: INPUT policy: DROP
  50. Try to answer, in an honest way, the following questions:

    • What are your weaknesses? • What are your SPOF? • What if the DNS provider will be down? ◦ How do we switch name servers? • What will you do if your HDD will fail? • What if you get a ransomware? ◦ How to make sure we don't fall into a ransom? • What needs to be restored? • Do we need to point the DB to a fallback node? • How do we restore the backups? ◦ Where are the backups stored? ◦ Who can access them? • How to serve static content when everything is lost? These are just some questions in order to get your head around the Disaster Recovery Plan you'll outline. #9: DRP - Plan Ahead
  51. #9: DRP - Possible Failures • Application • Network •

    Data Center • Citywide • Regional • National • Multinational
  52. #9: DRP - Outline What are the RTO and RPO

    for your plan? • RTO, Recovery Time Objective, it's the time needed to bring the service back online before creating too much of an unacceptable disruption for your users. • RPO, Recovery Point Objective, it's the maximum amount of time allowed where the data is lost (a backup every hour has a RPO of 1h)
  53. #10: Play with Providers For example, now that you have

    everything in containers you could migrate to Kubernetes, or maybe to cloud-native solutions for containers like AWS ECS, GCP GCE, Azure ACI, or even to serverless (since AWS allows to serve traffic from a docker image).
  54. #10: Play with Providers Is it better AWS or maybe

    it is more convenient Azure or GCP, don't fall into vendor lock-in: mix them up. Yeah but then they don't play along out-of-the-box, who cares? Make them work FOR you, you might need to spend some more time to get it right, but in the long run (it's always about the long run - if you focus on now just don't experiment and run for covers) you'll get the benefit of ALL the services they can offer to you.
  55. #10: Play with Providers It doesn't have to be a

    big player, great solutions can work also from not-mainstream providers. I've had experiences with bare-metal, AWS, Contabo, Hetzner, DigitalOcean, Aruba, TransIP, Scaleway, Linode, FlareVM, Heroku, Linode, OVH.