How to make DR-ready system with Amazon Aurora Global Database

How to make DR-ready system with Amazon Aurora Global Database
Minoru Onda @minorun365 So#ware engineer KDDI Corpora1on & KDDI Agile Development Center Corpora1on A W S C O M M U N I T Y B U I L D E R S A PJ O P E N M I C – M AY

> Minoru Onda @minorun365 So,ware Engineer / KDDI & KAG
(Concurrently) Co-lead in communiCes • JAWS-UG SRE Branch • JAWS DAYS 2022 Awards • KDDI Cloud SAMURAI 2021 • KDDI Cloud Ambassadors 2021 $ whoami

Do you love Amazon Aurora?

Oﬀ course I love! 😍

• Distributed instances • Fast replicaCon in 3AZ storage •
Autoscaling of replicas and volumes docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Overview.html Amazon Aurora is cloud-na3ve RDB

Customer service applicaCon in mobile shops Where I am using
Aurora

Hosted on EKS and connecCong with many on-prem systems Where
I am using Aurora VPC Tokyo region EKS on EC2 Aurora iPads on shops Private network Private network On-prem workloads DX DX

We faced huge outage of DX on 2021, and started
mulC-region planning Started DR planning a?er Direct Connect outage VPC Tokyo region EKS on EC2 Aurora iPads on shops Private network Private network On-prem workloads DX DX

© 2023, Amazon Web Services, Inc. or its aﬃliates. All
rights reserved. Then, how to enable BCP?

Clarify your requirements! • Recovery Cme objecCve (RTO) • Recovery
point objecCve (RPO) Basic strategies of disaster recovery (DR) on AWS aws.amazon.com/jp/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/

Select best-suit plan with your requirement • Backup & restore
• Pilot light • Warm standby • MulC-site acCve/acCve Basic strategies of disaster recovery (DR) on AWS aws.amazon.com/jp/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/

Select best-suit plan with your requirement • Backup & restore
• Pilot light • Warm standby • MulC-site acCve/acCve We chose it! Basic strategies of disaster recovery (DR) on AWS aws.amazon.com/jp/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/ ☝

And don’t worry! It’s easy to make your system mul/-region
😚

And don’t worry! It’s easy to make your system mul/-region
😚 ...excluding Database 🥶

rights reserved. Why database is diﬃcult on mul:-region?

When you have mulCple DB like... • To get High-Availability
• To make DR-ready • To enable Blue/Green deployment “You build mul<-DB, you run it consistently!” Diﬃcul3es on mul3ple database Werner Vogels on Wikipedia, however he never says above

Popular soluCons for make DBs consistent: From applica+on • 2
phase commit (2PC) • Saga pe8ern Diﬃcul3es on mul3ple database From database • Logical replica+on • Phisical replica+on

Popular soluCons for make DBs consistent: From applica+on • 2
phase commit (2PC) • Saga pe8ern Diﬃcul3es on mul3ple database From database • Logical replica+on • Phisical replica+on ☝ You can use it easily with Amazon Aurora!

Yes that’s Global Database

rights reserved. What is Global Database?

• Storage replicaCon across regions • High performance with phisical
volume replicaCon Amazon Aurora Global Database Aurora cluster (primary) Tokyo region Writer & readers Cluster volumes Aurora cluster (secondary) Osaka region Readers Cluster volumes ‎ ‎ ‎ Outbound replica4on

Supported engines: On Aurora MySQL • Ver 2.11+ (minor versions)
Amazon Aurora Global Database On Aurora PostgreSQL • Ver 11.17+ (minor versions) • Ver 12.12+ (minor versions) • Ver 13.8+ (minor versions) • Ver 14.5+ (minor versions)

Just click “Add AWS Region” on exisCng cluster, that’s it
👍 How to use Global Database

rights reserved. Demo 1: Enable Global Database

Conguratula*ons 🎉 It’s global now. ...is that all?? 🥳

No, it’s just a beginning... What you shoud do on
actual disaster maBers!

rights reserved. Two op:ons for failover Global Database

We have 2 opCons for failover regions in Global Database
Planned failover (managed) • One-click on console • Cannot used on emergency What should you do on disaster? Unplanned failover • Manual opera9on with steps • Available even on disaster

We have 2 opCons for failover regions in Global Database
Planned failover (managed) • One-click on console • Cannot used on emergency What should you do on disaster? Unplanned failover • Manual opera9on with steps • Available even on disaster 👇 Use it ﬁrst!

When you run Global Database on Tokyo (primary) and Osaka
(secondary) ... Opera3on steps on disaster Aurora cluster (primary) Tokyo region Aurora cluster (secondary) Osaka region ‎ ‎ ‎ Primary

Disaster occured! Opera3on steps on disaster Aurora cluster (primary) Tokyo
region Osaka region ‎ ‎ ‎ Aurora cluster (secondary)

Then you shoud remove secondary cluster from Global Database Opera3on
steps on disaster Aurora cluster (primary) Tokyo region Aurora cluster (standalone) Osaka region Remove from Global DB

A,er disaster past, you can rebuild Global Database from Osaka
(new-primary) Opera3on steps a?er disaster Aurora cluster (old) Tokyo region Aurora cluster (primary) Osaka region Aurora cluster (secondary) ‏ ‏ ‏ Rebuild GDB Primary

rights reserved. Demo 2: Unplanned failover on emergency

A,er disaster past, you can rebuild Global Database from Osaka
(new-primary) Opera3on steps a?er disaster Aurora cluster (old) Tokyo region Aurora cluster (primary) Osaka region Aurora cluster (secondary) ‏ ‏ ‏ Rebuild GDB Primary

On peaceful day, you can switch back regions with managed
planned failover Opera3on steps a?er disaster Aurora cluster (old) Tokyo region Aurora cluster (secondary) Osaka region Aurora cluster (primary) ‎ ‎ ‎ Managed failover Primary

rights reserved. Demo 3: Planned failover (managed)

rights reserved. What I learned by using Global Database on produc:on

Make DR opera3on ﬂowchart beforehand! Disaster occured!

Make DR opera3on ﬂowchart beforehand! Disaster occured! Decide to ac4vate
DR

DR Check health of Osaka cluster using SQL

DR Check health of Osaka cluster using SQL Shut requests out by enabling redirect

DR Check health of Osaka cluster using SQL Shut requests out by enabling redirect Check replica4on lag

DR Check health of Osaka cluster using SQL Shut requests out by enabling redirect Check replica4on lag Modify DNS record to switch Aurora endpoints

DR Check health of Osaka cluster using SQL Shut requests out by enabling redirect Check replica4on lag Modify DNS record to switch Aurora endpoints Unplanned failover on Global Database

DR Check health of Osaka cluster using SQL Shut requests out by enabling redirect Check replica4on lag Modify DNS record to switch Aurora endpoints Unplanned failover on Global Database Chack health of app on Osaka

DR Check health of Osaka cluster using SQL Shut requests out by enabling redirect Check replica4on lag Modify DNS record to switch Aurora endpoints Unplanned failover on Global Database Chack health of app on Osaka within RTO

In my case... • Automated all the operaCons with Exastro
(Japanese so,ware) • Separeted operaCons by group, making it easy to go ﬂexible with situaCon • Using Ansible for included operaCons of on-prem network components You will make mistake on emergency. Automate it!

If you can, plan regular training with all the stakeholders
• OperaCons team • Infrastructure developers (including DBA) • ApplicaCon developers • Management (who can decide to acCvate DR) And prac3ce DR opera3ons regularly!

Thank you! Minoru Onda How to make DR-ready system with
Amazon Aurora Global Database Give me your feedback!

How to make DR-ready system with Amazon Aurora ...

How to make DR-ready system with Amazon Aurora Global Database

More Decks by みのるん

Other Decks in Technology

Featured

Transcript