Save Yourself From A Disaster

Hello! I AM FABIO CICERCHIA SW & Cloud Engineer @
You can ﬁnd me at: @fabiocicerchia

s://www.ovh.com/world/news/press/cpl1787.ﬁre-our-strasbourg-site

What can we learn from the latest major cloud incident
(ie. burning OVH datacenter)? Do not put all your eggs in one basket!

The only certain thing is that it's not a matter
of IF there'll be a disaster, but rather WHEN. So better be not caught oﬀ guard.

https://slate.com/technology/2014/08/shark-attacks-threaten-google-s-undersea-internet-cables-video.html

https://www.reddit.com/r/DataHoarder/comments/bccﬂ6/forklift_accident/ekqwycj/

https://www.zdnet.com/article/company-shuts-down-because-of-ransomware-leaves-300-without-jobs-just-before-holidays/ https://www.cybersecurity-insiders.com/ransomware-might-likely-force-travelex-into-bankruptcy/ https://www.bankinfosecurity.com/hospital-ransomware-attacks-surge-so-now-what-a-8987

https://blog.cloudﬂare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-oﬄine-today/

https://www.wired.com/story/far-right-extremist-allegedly-plotted-blow-up-amazon-data-centers/

https://www.reddit.com/r/cscareerquestions/comments/6ez8ag/accidentally_destroyed_production_database_on/

https://betterprogramming.pub/how-a-cache-stampede-caused-one-of-facebooks-biggest-outages-dbb964ﬀc8ed

https://twitter.com/fabiocicerchia/status/1338465077998071809

https://twitter.com/gitlabstatus/status/826591961444384768

So here are the details of each step I took
to make my website disaster-proof while keeping my cloud spending on a tight leash (so you could do this too). Shit happens, deal with it! Better safe than sorry!

I'm running a bunch of very small websites (with very
simple infrastructure topology) and I wanted to put in practice something on a budget. So, I've decided to go multi-cloud.

I've started with an infrastructure that looked like this: Evolution

Evolution Then, I ended up with something like this:

OUTLINE This is the outline plan I followed to upgrade
my infrastructure: 1. Secure the Database 2. Secure the Storage 3. Redundancy of Database 4. Redundancy of Storage 5. Redundancy of Web Servers 6. Redundancy of DNS 7. Billing Impact 8. Manual Conﬁgurations 9. Disaster Recovery Plan 10. Play with Providers

DISCLAIMER I wrote this ebook: https://leanpub.com/savefromdisaster

SAVE YOURSELF FROM A DISASTER #1 Secure the Database

Start doing the DB backups (with mysqldump or xtrabackup) and
define a policy for RTO and RPO, so you'll know what is the accepted loss (there's always loss - even if very minimal). RTO defines how long can the infrastructure can be down, and RPO defines how much data can you afford to lose (ie. how old the latest backup is). #1: Database - Backups

#1: Database - ROTATION To rotate the DB backups we
could simply use logrotate. We could simply start with a basic daily backup rotation (or any interval you have deﬁned as RPO): /var/backups/daily/alldb.sql.gz { notifempty daily rotate 7 nocompress create 640 root adm dateext dateformat -%Y%m%d-%s postrotate mysqldump -u$USER -p$PASSWD --single-transaction --all-databases | gzip -9f > /var/backups/daily/alldb.sql.gz endscript } This will create the rotated DB backups on the same server where logrotate is running (most likely the same DB instance). We have seen that this is very wrong, so you must always store the backups somewhere else (and also oﬄine).

#1: Database - Remote Storage With a simple change, we
can upload to an AWS S3 bucket (with cold storage access set to rarely-used): lastaction BUCKET="..." REGION="eu-west-1" aws s3 sync /var/backups/hourly "s3://$BUCKET/daily/" --region $REGION --exclude "*" --include "*.gz-$FORMAT*" --storage-class GLACIER endscript

#1: Database - Local Storage Just do a rsync (better
if scheduled) to download it locally to an external hard-drive: rsync -e "ssh -i $HOME/.ssh/id_rsa" --progress -auv <USER>@<IP>:/var/backups ./path/to/backups There you go, you have now backups on-site (for faster restore), remote on another provider (for more reliability), oﬄine (for more peace of mind).

#1: Database - Security Remember the good practices, and do
not forget about GDPR, the backups must be stored encrypted at-rest (and use a key instead of a plain password).

#1: Database - Restore Once everything is backed up, you
need to think about how to restore the dump properly, or at least switch the connection to the other node. I'll cover this in the Disaster Recovery Plan post.

SAVE YOURSELF FROM A DISASTER #2 Secure the Storage

Let's back up them on an external (cloud) storage disk.
Why not oﬄine? Because the burden of re-uploading all ﬁle stored in a shared folder (which usually are not-so-few) will make the restore process very slow. #2: Storage - Backups

#2: Storage - Option #1: Remote VM Let's use a
simple cronjob every hour to sync the whole shared folder to a remote location: rsync -auv --progress /path/to/shared/folder <IP>:/path/to/shared/folder Some provider can offer pluggable storage and it would be perfect to detach it and reattach it to another node (only if using the same provider). Alternatively, the VM could be exported and mounted as NFS (with some performance degradation). By using some cheap storages you could leverage the cost of a cloud-native one. Some providers can offer 2TB for ~$10/month, like TransIP or AlphaVPS. If you combine them together you'll end up with a slightly higher cost (than using only one) but have definitively greater redundancy.

#2: Storage - Option #2: Cloud-Native Storage Still, with a
simple cronjob we could sync the whole shared folder to an S3 bucket (using cold storage access): aws s3 sync --storage-class GLACIER /path/to/shared/folder s3://<BUCKET>/ It is free to send data into AWS S3 but to take it out you need to pay roughly an extra $0.09 per GB, so in case you have lots of data, you might want to consider this very carefully: to restore 1TB of data it could costs you ~$23/month + ~$90 to restore it. A cheaper provider for Cloud-Native Storage is Scaleway with ~0.002€/GB/month (1TB = ~€2.5). You need to consider the loss of permissions when saving to AWS S3, so when restoring you need to double-check it to verify they are correct.

#2: Storage - Restore Once everything is backed up, you
need to think about how to restore the data properly, or at least switch the access on-the-ﬂy. I'll cover this in the Disaster Recovery Plan post.

SAVE YOURSELF FROM A DISASTER #3 Redundancy of Database

Create a cluster to have at least a structure like
master/slave primary/secondary, 3 nodes will be recommended so we'll have the flexibility to do planned maintenance without suffering and/or affecting the performance of the whole cluster. #3: Database - Redundancy

#3: Database - Spin up a secondary node Create another
VM somewhere else (better if in another availability zone/region/provider), then conﬁgure a MySQL/MariaDB/Percona/... instance and plug it in as a secondary node. We can set it up even with fewer resources and make it the write-only node (in case we have less writing activity, otherwise the read-only one).

#3: Database - Balancing requests I prefer to use something
like HAProxy as a TCP load balancer, or (even better) using ProxySQL (which has a nice query caching capability). I'd go with ProxySQL load balancing the 2 nodes created, then just change the database connection string in the application and the setup is done (we could even partition the queries and define to which node they should be sent). In my case, a primary/secondary topology could more than enough, but I went for a primary/primary configuration (you can follow a simple tutorial or a more structured configuration) without balancing (because each web node will access their local DB instance).

#3: Database - Security The replica must be done over
a secure connection, so you need to generate a certiﬁcate and use it.

SAVE YOURSELF FROM A DISASTER #4 Redundancy of Storage

Although we could use some distributed filesystems like Ceph, DRBD,
GlusterFS, or ZFS, then it won't be on a budget and also the complexity introduced by those tools will need to addressed properly. I will not cover it here due to the costs of extra nodes and extra configuration needed - you're time have a cost too (but if your filesystem changes frequently this is your only option). #4: Storage - Distributed Storage

#4: Storage - Ad-Hoc Solutions • How to build a
Ceph Distributed Storage Cluster on CentOS 7 • How to Setup DRBD to Replicate Storage on Two CentOS 7 Servers • How To Create a Redundant Storage Pool Using GlusterFS on Ubuntu 18.04 • An Introduction to the Z File System (ZFS) for Linux

#4: Storage - Quick & Dirty: Cross Sync Let's use
a simple cronjob every hour to sync the whole shared folder to all remote locations. Server #1: rsync -e "ssh -i $HOME/.ssh/somekey" -auv --progress /path/to/shared/folder/ syncer@<IP2>:/path/to/shared/folder Server #2: rsync -e "ssh -i $HOME/.ssh/somekey" -auv --progress /path/to/shared/folder/ syncer@<IP1>:/path/to/shared/folder Remember, this is not a proper distributed solution, rsync looks like an old-fashioned solution, it did save me lots of times. This approach is not feasible for "real-time" synchronization, they are just for (very) infrequent changes. Distributed ﬁlesystems like GlusterFS (or Ceph, or DRBD) are solutions for the long run.

#4: Storage - Security Remember to secure the connection between
one host to the others (eg. with a ﬁrewall).

SAVE YOURSELF FROM A DISASTER #5 Redundancy of Web Servers

Nowadays many cloud providers (also virtualization platforms) are giving you
the possibility to take a snapshot of the VM and then restore/clone it. I'll not cover it in this tutorial as we'll increase the overall cost of the infrastructure. Although, sometimes (based on the application) it can be very time-saving doing a clone of the VM compared to the other method I'm proposing here below. #5: Web Servers - Duplicate VM

#5: Web Servers - Docker We live in 2021, everyone
is running containers and wishing to have a k8s cluster to play with. So, let's convert the simple applications into containers, there are a lot of already-ready containers on Docker Hub.

#5: Web Servers - Docker Swarm Let's start nice and
easy, with Docker Swarm (which eliminates the extra complexity of Kubernetes) on ONE node (then we can scale out as much as we like). First, setup your nodes, I'm going to use standard images for my dockerized infrastructure, no custom images (for now - I've got pretty simple conﬁgurations). I've picked bitnami images, as they cover a lot of scenarios and provide pre-packaged images for most of the popular server software (more reasons why pick them). If you really want to start using custom images you could publish them publicly for free on Docker Hub (but has got recently some limitations) or on Canister. After the announcement from Docker Hub about limiting the rates of pull, AWS decided to oﬀer public repositories (and they are almost free if you don't exceed 500GB/month when not logged or 5TB/month when logged).

#5: Web Servers - Docker Compose This is an example
of a WordPress website conﬁgured with docker-compose: version: "3.9" services: wordpress: image: wordpress:5.7.0 ports: - 8000:80 deploy: replicas: 1 restart_policy: condition: on-failure extra_hosts: - "host.docker.internal:host-gateway" environment: WORDPRESS_DB_HOST: host.docker.internal:3306 WORDPRESS_DB_USER: *** WORDPRESS_DB_PASSWORD: *** WORDPRESS_DB_NAME: *** volumes: - /path/to/wp-content:/var/www/html/wp-content healthcheck: test: ["CMD", "curl", "-f", "http://localhost"] interval: 30s timeout: 10s retries: 3

When using Docker Swarm with lots of containers and services
(which bounds a dedicated port), you'll need an ingress system to sort the requests to the right service. You could use one of the 2 most used solutions: Nginx or Traeﬁk. #5: Web Servers - Ingress

#5: Web Servers - Ingress I decided to use a
simple bitnami/nginx with a custom conﬁg (pretty straightforward proxy): version: "3.9" services: client: image: bitnami/nginx:1.19.8 ports: - 80:8080 - 443:8443 deploy: replicas: 2 restart_policy: condition: on-failure extra_hosts: - "host.docker.internal:host-gateway" volumes: - /root/docker-compose/nginx/lb.conf:/opt/bitnami/nginx/conf/server_blocks/lb.conf:ro - /etc/letsencrypt:/etc/letsencrypt

#5: Web Servers - TLS Termination This is the tricky
part. If you have already bought the certiﬁcates (eg. from SSLs) you're good for 1 year (at least). If you don't want to buy them and want to rely on Let's Encrypt, you'll need to be ready to sweat a bit to set it up. Setting it up on one node is pretty simple, but if you need to replicate it on multiple nodes then you need to start being creative. One proposed solution would be having a primary node that generates (or renews) the certiﬁcate(s) and then it'll spread them to the other servers: rsync -e "ssh -i $HOME/.ssh/somekey" -auv --progress /etc/letsencrypt/ syncerssl@<IP2>:/etc/letsencrypt rsync -e "ssh -i $HOME/.ssh/somekey" -auv --progress /etc/letsencrypt/ syncerssl@<IP3>:/etc/letsencrypt

#5: Web Servers - Kubernetes Kubernetes is more complex and
require more time to conﬁgure it, but once done there could be no vendor lock-in for you (as many providers are oﬀering managed k8s), also it is more extensible (but more complex than swarm). If you have already a Docker Swarm cluster and want to migrate try following these guides: • From Docker-Swarm to Kubernetes – the Easy Way! • Translate a Docker Compose File to Kubernetes Resources Remember to either use a dockerized database or rely on cloud-native managed solutions.

SAVE YOURSELF FROM A DISASTER #6 Redundancy of DNS

This is not really a practical solution because whenever one
server is down the traffic will still be routed to that server, and your customers will be affected. If you have 2 records A pointing to 2 different servers you could potentially lose 50% of your traffic. In case you need to remove (manually) quickly the unresponsive server, you need to take into account the DNS TTL. If it is set to a high value (like 24h or - even worse - a week) you cannot do anything to change that, other than wait. There are pro and cons for setting either a low or high TTL. Usually, the DNS propagation time is around 24 hours, but it could also be around 72 hours, this is because ISP can override the TTL you have specified and the time for your changes to propagate can be longer than expected. #6: DNS - DNS Round-Robin

By having multiple nameservers you can have a fallback in
case your DNS provider is having issues (very unlikely but possible). Generally, you need to maintain manually the records aligned between the two providers. Sometimes the DNS provider will give you the ability to manage those records by pulling the data from your primary provider or by giving you API access so you can do it programmatically. The RFC 1035 (Domain Names - Implementation And Speciﬁcation), in fact, proposes to have more than nameserver conﬁgured. #6: DNS - Secondary DNS

#6: DNS - Secondary DNS First of all, you need
to verify that your registrar has got a nice and good DNS management panel. Some services that are offering such functionality are for example FreeDNS (premium version ~$5/year), DNSMadeEasy, and many more. Cloudflare can act as Secondary DNS but the setup seems quite long, DNSimple has out-of-the-box integration with it (but you cannot use any of the functionality offered by CF - which makes it a bit of a loss). I went with PremiumDNS which claims that it "keeps your website running, even when flooded with traffic. It secures the very deepest level of the Domain Name System (DNS), preventing Distributed Denial of Service (DDoS) attacks, and giving you 100% uptime, guaranteed." Great point is that you can buy it even for 3rd party domain. You can check the nameservers by running: dig +short NS example.com

#6: DNS - Inspecting TTLs PremiumDNS has a TTL on
the NS records of 30m, so you can be unavailable roughly for that amount of time (only if the ISP is not overriding the TTL). Cloudﬂare has a TTL on the NS records of 6 hours.

#6: DNS - Manual Switch When everything goes south, sometimes
happens to have issues with DNS, and you know you are going to be affected for too long, the last resort is to change manually the authoritative name servers registered on the domain (you could do that via your registrar) and point them to a fallback DNS (you could set up an offline clone of your records in Cloudflare).

SAVE YOURSELF FROM A DISASTER #7 Billing Impact

#7: Billing - Original Cost • DigitalOcean VPS: $5/mo x
12 = $60 • Setup hours: $0 Monthly Cost: $5 Annual Cost: $60

#7: Billing - Upgrade Cost • PremiumDNS: $2.88/yr x 5
domains = $14.4/yr • DigitalOcean VPS: $5/mo x 12 = $60 • Hetzner VPS: €3.04/mo x 12 = €36.48 ($43.07) • AWS S3: ~$0.07 x 365 = $26 • Setup hours: $? (ﬁll here your cost time to follow this guide) Monthly Cost: ~$12 Annual Cost: $143.47+ That's more than 2x the original price you might say, and you'll be not so wrong about it. Obviously, for diﬀerent original prices, it won't necessarily be 2x.

Note: The following optimizations will only be shown on the
annual bill, if you take action immediately you'll not reach 2x cost. Let's start cutting down what we really know is not necessary for our domains. First, we need to visualize the spending. Hetzner and DigitalOcean are the 2 biggest chunks of our spending (which was predictable). I'll try to cover some scenarios to optimize the cost. #7: Billing - Cost Optimization

#7: Billing - Cost Optimization

My retention for AWS S3 was the following: • 24
hourly backups • 31 daily backups • 12 weekly backups • 3 monthly backups I had 70 of them which were taking 3GB of space and costing just a few cents per year. #7: Billing - Reduce Backup Retention

This is the most space consuming on AWS S3 since
it is a mirror of your websites. It is not necessary as we have redundancy on our services, it was done just a last resort in case everything burns down so at least we could serve to the user the static content to access the information (even though they cannot interact with the dynamic part of the website). #7: Billing - Avoid the Static Clone

We could replace our $60 spending with an additional Hetzner
VPS (in another region) and move from $96.96 to $86.14 (saving $10/yr). #7: Billing - Replace DigitalOcean

The concern is that, even if we have a VPS
in Germany and another in Finland, we are relying on ONE provider (I know, there's vendor lock-in - but I have IaC fully conﬁgured so I can switch provider in a matter of minutes): #7: Billing - Replace DigitalOcean

If you don't have very important, or proﬁtable, applications/websites, and
given the frequency of a DNS going down you might want to save some money on this. If you have many websites the cost can become quickly high, even if it's a few dollars per domain. If you make money, it's advisable to have a fallback DNS (or a premium service with guaranteed uptime at 100%), even because of the low cost (and impact on your bill). #7: Billing - DNS Fallback

#7: Billing - Sum Up I don't run critical (nor
very profitable) applications, so I can give up (at the moment) on having multiple cloud providers, to bring strong HA, in favour of saving some money. This will increase my spending from $60 to $112* $88, which is not optimal (compared to the initial figure): it's an extra $28/year (~$2.5/month - it's just like a couple of coffees) to have peace of mind. *Note: I've got some free credits on AWS so the backups are for free (at least for some time - not forever).

#7: Billing - Sum Up

SAVE YOURSELF FROM A DISASTER #8 Manual Conﬁgurations

In our toolbox are necessary Ansible and Terraform, these two
will be your best friends in documenting the infrastructure and make everything replicable to scale up/out easily. Those 2 tools are vendor-agnostic, so they can work with any provider and avoid you to lock-in with a configuration management tool, like AWS CloudFormation / CDK. Other tools for provisioning are Puppet, Chef and SaltStack. Remember to keep the Infrastructure as Code always up-to-date, avoid any configuration drifting whatsoever. #8: Manual Configs - Tools

#8: Manual Conﬁgs - Creating VMs For creating the infrastructure
we'll use Terraform. This is an example of how to create a new VM (or like they call it a Droplet to be precise): # Create a web server resource "digitalocean_droplet" "web" { image = "ubuntu-20-04-x64" name = "web-1" region = "fra1" size = "s-1vcpu-1gb" monitoring = "true" ssh_keys = [digitalocean_ssh_key.default.fingerprint] depends_on = [ digitalocean_ssh_key.default, ] } Just like that we could simply do copy & paste and create many others (even though it is best practice to use the count argument).

--- - name: "Initial Provisioning" hosts: all become: true vars_files:
- ../vars/init.yml roles: - oefenweb.swapfile - oefenweb.apt - ahuffman.resolv - ajsalminen.hosts - geerlingguy.ntp - geerlingguy.firewall - dev-sec.os-hardening - dev-sec.ssh-hardening - uzer.crontab tasks: - name: Add user manager ansible.builtin.user: name: "manager" shell: /bin/bash generate_ssh_key: yes ssh_key_type: rsa ssh_key_bits: 4096 #8: Manual Conﬁgs - Provisioning - name: Allow manager to have passwordless sudo lineinfile: dest: /etc/sudoers state: present insertafter: '^root' line: 'manager ALL=(ALL) NOPASSWD: ALL' validate: 'visudo -cf %s' - name: "Logrotate Configs" copy: src: "{{ item.src }}" dest: "{{ item.dst }}" with_items: "{{ app_logrotate_config_items }}" - name: Set the policy for the INPUT chain to DROP Ansible.builtin.iptables: chain: INPUT policy: DROP

SAVE YOURSELF FROM A DISASTER #9 Disaster Recovery Plan

Try to answer, in an honest way, the following questions:
• What are your weaknesses? • What are your SPOF? • What if the DNS provider will be down? ◦ How do we switch name servers? • What will you do if your HDD will fail? • What if you get a ransomware? ◦ How to make sure we don't fall into a ransom? • What needs to be restored? • Do we need to point the DB to a fallback node? • How do we restore the backups? ◦ Where are the backups stored? ◦ Who can access them? • How to serve static content when everything is lost? These are just some questions in order to get your head around the Disaster Recovery Plan you'll outline. #9: DRP - Plan Ahead

#9: DRP - Possible Failures • Application • Network •
Data Center • Citywide • Regional • National • Multinational

#9: DRP - Outline What are the RTO and RPO
for your plan? • RTO, Recovery Time Objective, it's the time needed to bring the service back online before creating too much of an unacceptable disruption for your users. • RPO, Recovery Point Objective, it's the maximum amount of time allowed where the data is lost (a backup every hour has a RPO of 1h)

SAVE YOURSELF FROM A DISASTER #10 Play with Providers

#10: Play with Providers For example, now that you have
everything in containers you could migrate to Kubernetes, or maybe to cloud-native solutions for containers like AWS ECS, GCP GCE, Azure ACI, or even to serverless (since AWS allows to serve traﬃc from a docker image).

#10: Play with Providers Is it better AWS or maybe
it is more convenient Azure or GCP, don't fall into vendor lock-in: mix them up. Yeah but then they don't play along out-of-the-box, who cares? Make them work FOR you, you might need to spend some more time to get it right, but in the long run (it's always about the long run - if you focus on now just don't experiment and run for covers) you'll get the beneﬁt of ALL the services they can oﬀer to you.

#10: Play with Providers It doesn't have to be a
big player, great solutions can work also from not-mainstream providers. I've had experiences with bare-metal, AWS, Contabo, Hetzner, DigitalOcean, Aruba, TransIP, Scaleway, Linode, FlareVM, Heroku, Linode, OVH.

THANK YOU

QUESTIONS?

2 LITTLE GIFTs —————————————————————————————— FREE ebook: https://leanpub.com/savefromdisaster —————————————————————————————— GITHUB REPO:
https://github.com/fabiocicerchia/save-from-disaster

Save Yourself From A Disaster

Save Yourself From A Disaster

More Decks by Fabio Cicerchia

Other Decks in Technology

Featured

Transcript