How to make DR-ready system with Amazon Aurora Global Database Minoru Onda @minorun365 So#ware engineer KDDI Corpora1on & KDDI Agile Development Center Corpora1on A W S C O M M U N I T Y B U I L D E R S A PJ O P E N M I C – M AY
• Distributed instances • Fast replicaCon in 3AZ storage • Autoscaling of replicas and volumes docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Overview.html Amazon Aurora is cloud-na3ve RDB
Hosted on EKS and connecCong with many on-prem systems Where I am using Aurora VPC Tokyo region EKS on EC2 Aurora iPads on shops Private network Private network On-prem workloads DX DX
We faced huge outage of DX on 2021, and started mulC-region planning Started DR planning a?er Direct Connect outage VPC Tokyo region EKS on EC2 Aurora iPads on shops Private network Private network On-prem workloads DX DX
Select best-suit plan with your requirement • Backup & restore • Pilot light • Warm standby • MulC-site acCve/acCve Basic strategies of disaster recovery (DR) on AWS aws.amazon.com/jp/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/
Select best-suit plan with your requirement • Backup & restore • Pilot light • Warm standby • MulC-site acCve/acCve We chose it! Basic strategies of disaster recovery (DR) on AWS aws.amazon.com/jp/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/ ☝
When you have mulCple DB like... • To get High-Availability • To make DR-ready • To enable Blue/Green deployment “You build mul<-DB, you run it consistently!” Difficul3es on mul3ple database Werner Vogels on Wikipedia, however he never says above
Popular soluCons for make DBs consistent: From applica+on • 2 phase commit (2PC) • Saga pe8ern Difficul3es on mul3ple database From database • Logical replica+on • Phisical replica+on
Popular soluCons for make DBs consistent: From applica+on • 2 phase commit (2PC) • Saga pe8ern Difficul3es on mul3ple database From database • Logical replica+on • Phisical replica+on ☝ You can use it easily with Amazon Aurora!
• Storage replicaCon across regions • High performance with phisical volume replicaCon Amazon Aurora Global Database Aurora cluster (primary) Tokyo region Writer & readers Cluster volumes Aurora cluster (secondary) Osaka region Readers Cluster volumes Outbound replica4on
Supported engines: On Aurora MySQL • Ver 2.11+ (minor versions) Amazon Aurora Global Database On Aurora PostgreSQL • Ver 11.17+ (minor versions) • Ver 12.12+ (minor versions) • Ver 13.8+ (minor versions) • Ver 14.5+ (minor versions)
We have 2 opCons for failover regions in Global Database Planned failover (managed) • One-click on console • Cannot used on emergency What should you do on disaster? Unplanned failover • Manual opera9on with steps • Available even on disaster
We have 2 opCons for failover regions in Global Database Planned failover (managed) • One-click on console • Cannot used on emergency What should you do on disaster? Unplanned failover • Manual opera9on with steps • Available even on disaster 👇 Use it first!
When you run Global Database on Tokyo (primary) and Osaka (secondary) ... Opera3on steps on disaster Aurora cluster (primary) Tokyo region Aurora cluster (secondary) Osaka region Primary
Then you shoud remove secondary cluster from Global Database Opera3on steps on disaster Aurora cluster (primary) Tokyo region Aurora cluster (standalone) Osaka region Remove from Global DB
A,er disaster past, you can rebuild Global Database from Osaka (new-primary) Opera3on steps a?er disaster Aurora cluster (old) Tokyo region Aurora cluster (primary) Osaka region Aurora cluster (secondary) Rebuild GDB Primary
A,er disaster past, you can rebuild Global Database from Osaka (new-primary) Opera3on steps a?er disaster Aurora cluster (old) Tokyo region Aurora cluster (primary) Osaka region Aurora cluster (secondary) Rebuild GDB Primary
On peaceful day, you can switch back regions with managed planned failover Opera3on steps a?er disaster Aurora cluster (old) Tokyo region Aurora cluster (secondary) Osaka region Aurora cluster (primary) Managed failover Primary
Make DR opera3on flowchart beforehand! Disaster occured! Decide to ac4vate DR Check health of Osaka cluster using SQL Shut requests out by enabling redirect
Make DR opera3on flowchart beforehand! Disaster occured! Decide to ac4vate DR Check health of Osaka cluster using SQL Shut requests out by enabling redirect Check replica4on lag
Make DR opera3on flowchart beforehand! Disaster occured! Decide to ac4vate DR Check health of Osaka cluster using SQL Shut requests out by enabling redirect Check replica4on lag Modify DNS record to switch Aurora endpoints
Make DR opera3on flowchart beforehand! Disaster occured! Decide to ac4vate DR Check health of Osaka cluster using SQL Shut requests out by enabling redirect Check replica4on lag Modify DNS record to switch Aurora endpoints Unplanned failover on Global Database
Make DR opera3on flowchart beforehand! Disaster occured! Decide to ac4vate DR Check health of Osaka cluster using SQL Shut requests out by enabling redirect Check replica4on lag Modify DNS record to switch Aurora endpoints Unplanned failover on Global Database Chack health of app on Osaka
Make DR opera3on flowchart beforehand! Disaster occured! Decide to ac4vate DR Check health of Osaka cluster using SQL Shut requests out by enabling redirect Check replica4on lag Modify DNS record to switch Aurora endpoints Unplanned failover on Global Database Chack health of app on Osaka within RTO
In my case... • Automated all the operaCons with Exastro (Japanese so,ware) • Separeted operaCons by group, making it easy to go flexible with situaCon • Using Ansible for included operaCons of on-prem network components You will make mistake on emergency. Automate it!
If you can, plan regular training with all the stakeholders • OperaCons team • Infrastructure developers (including DBA) • ApplicaCon developers • Management (who can decide to acCvate DR) And prac3ce DR opera3ons regularly!