Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cloud Success: AWS Aurora for the rescue

Cloud Success: AWS Aurora for the rescue

While we were discussing at least some new server options with our hosting provider back then ...
... an AWS account manager was reaching out to me.

Johannes Nagl

February 22, 2024
Tweet

More Decks by Johannes Nagl

Other Decks in Programming

Transcript

  1. Johannes Nagl #proudDadOfAGirl, CEO, (Head of Nightmare Projects), Developer at

    heart, Product Enthusiast with Swat.io for 14 years Follow on LinkedIn ! , Follow on Bluesky: @jol.li
  2. Swat.io ✨ Bootstrapped Software as a Service company ✨ Coordinated

    Social Media Communication ✨ 45 Employees, some of them from Vienna ✨ 10 Developers, 1 Dev Ops Engineer ✨ Product almost 13 years in the making
  3. Swat.io ✨ Nature of the application: Always on ✨ Write-intensive

    work-loads ✨ 100s of incoming realtime updates per second ✨ We basically only delete data when we lose customers
  4. Swat.io ✨ Nature of the application: Always on ✨ Write-intensive

    work-loads ✨ 100s of incoming realtime updates per second ✨ We basically only delete data when we lose customers ✨ We rarely lose them
  5. The current, primary Database as of now() ✨ 3,6 TB

    ✨ Biggest table: 1,350,000,000++ rows ✨ 79 tables
  6. The history, part 1 ! We started with MySQL /

    Percona (on a shared webspace) ! We added UTF-8 / timezone-support on the fly ™ ! 9 years ago we found out: adding new indices or columns on production was a massive pain ! We also found out: Adding multiple indices in parallel might blow up your database engine
  7. The history, part 2 ! Therefore: Postgres for the win!

    ! We migrated to Postgres 9.6 and had a decent Master-/Slave Setup on physical hardware ! We upgraded RAM + SSDs, but … ! Couple of years later, we found out that: There’s a physical limit of how much RAM and SSD discs you can put into a machine (2 x 2xCPU 2.3GHz 10-core 128GB RAM at start, in the end: 384GB RAM)
  8. The problem, 2 years ago ! There was almost no

    place left on the SSDs (2.5 TB) ! We were stuck on Postgres 9.6 because ! No space for duplicating the db on the server with a newer Postgres version ! No idea of how long a in-place upgrade to a new version would mean (downtime > 6h+: impossible) ! No secondary system to duplicate Swat.io to perform upgrade tests
  9. While we were discussing at least some new server options

    with our hosting provider back then … … an AWS account manager was reaching out to me.
  10. The bright promise ! We were already AWS customer back

    then (only S3) ! While talking about life in general, the sales manager asked regarding current challenges ! When I mentioned “database is stuck on Postgres 9.6 and we need a new server” he was alerted and saw there’s a huge opportunity ! He dropped the famous sentence “Johannes, we have a tool called Database Migration service - with that you can migrate to a newer Postgres database version on AWS on the fly - without downtime” ! I was laughing out loud and said: “Sure thing”
  11. AWS Aurora Unparalleled high performance and availability at global scale

    with full MySQL and PostgreSQL compatibility https://aws.amazon.com/rds/aurora/ at 1/10th the cost of commercial databases. Aurora has 5x the throughput of MySQL and 3x of PostgreSQL. https://aws.amazon.com/rds/aurora/ built-in security, continuous backups, serverless compute, up to 15 read replicas, automated multi-Region replication
  12. AWS Aurora, translated ! It's basically Postgres (or MySQL) !

    But on steroids ! As managed service ! highly available ! with tons of AWS optimisations for the cloud
  13. The bright future ! Well, we gave it a try

    ! We teamed up with an AWS partner (tecRacer) and agreed to go all in on AWS ! We investigated the project and found out that it should be doable
  14. The process, high-level → It took us another 9 months

    to finally fully migrate to AWS → We tried multiple attempts and had some fuck ups in the process → In the end, we really upgraded to Postgres 13 without any problems (on that day) → We were happily ever after (almost)
  15. The process, in-depth Pre-Requirements 1. Set up a Site-To-Site VPN

    between old hosting environment and AWS 2. Keep a look at disk space on-prem on legacy servers
  16. The process, in-depth Runbook 1. Stop Swat.io 2. Create a

    replication slot 3. Start a dump (in total: runtime 19h) 4. Start Swat.io 5. When dump is done: Upload to S3 (1h) 6. Import Dump on AWS machine (16h) 7. AWS DMS Change Data Capture start (Replication lag of 36h) 8. When 0 latency: VACCUUM ANALYZE 9. Manually Fix Sequences (auto increments not properly set with sync mode we have chosen) 10. Turn off old app / servers 11. Turn on AWS app / servers 12. Cross your fingers
  17. The fuckups ! It took quite some tunings on the

    AWS DMS bastion host sizing and AWS DMS configuration parameters ! We had some tries where replication lag did not went down ! We did not know about the sequences in the beginning ! We had some crashes of the old app because AWS DMS was overloading the system (DoS attack basically)
  18. The advantages ! Down/Upscale Instances easily (we started with bigger

    machines in the beginning and downgraded then within minutes to a more cost-sensitive instance type) ! (also while migrating! " ) ! Snapshots all the way ! No need buying new SSDs ! Spin up a new cluster anytime for tests / upgrades ! In the meantime, we already upgraded to Aurora Postgres 15 without any problems (16 pending) ! It's really just Postgres, we did not change a single query
  19. The learnings ! Listen to sales managers, but don't always

    blindly trust them ! However, sometimes you need to aim for a moonshot ! Be careful with Aurora Pricing! ! "IO Optimized" are helpful for saving money ! Boy, this thing is expensive! ! Way higher costs " ! Aurora RDS Proxy always lags in Postgres Version support ! You should turn off AWS instances when they are not needed