Slide 1

Slide 1 text

How to make DR-ready system with Amazon Aurora Global Database Minoru Onda @minorun365 So#ware engineer KDDI Corpora1on & KDDI Agile Development Center Corpora1on A W S C O M M U N I T Y B U I L D E R S A PJ O P E N M I C – M AY

Slide 2

Slide 2 text

> Minoru Onda @minorun365 So,ware Engineer / KDDI & KAG (Concurrently) Co-lead in communiCes • JAWS-UG SRE Branch • JAWS DAYS 2022 Awards • KDDI Cloud SAMURAI 2021 • KDDI Cloud Ambassadors 2021 $ whoami

Slide 3

Slide 3 text

Do you love Amazon Aurora?

Slide 4

Slide 4 text

Off course I love! 😍

Slide 5

Slide 5 text

• Distributed instances • Fast replicaCon in 3AZ storage • Autoscaling of replicas and volumes docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Overview.html Amazon Aurora is cloud-na3ve RDB

Slide 6

Slide 6 text

Customer service applicaCon in mobile shops Where I am using Aurora

Slide 7

Slide 7 text

Hosted on EKS and connecCong with many on-prem systems Where I am using Aurora VPC Tokyo region EKS on EC2 Aurora iPads on shops Private network Private network On-prem workloads DX DX

Slide 8

Slide 8 text

We faced huge outage of DX on 2021, and started mulC-region planning Started DR planning a?er Direct Connect outage VPC Tokyo region EKS on EC2 Aurora iPads on shops Private network Private network On-prem workloads DX DX

Slide 9

Slide 9 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Then, how to enable BCP?

Slide 10

Slide 10 text

Clarify your requirements! • Recovery Cme objecCve (RTO) • Recovery point objecCve (RPO) Basic strategies of disaster recovery (DR) on AWS aws.amazon.com/jp/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/

Slide 11

Slide 11 text

Select best-suit plan with your requirement • Backup & restore • Pilot light • Warm standby • MulC-site acCve/acCve Basic strategies of disaster recovery (DR) on AWS aws.amazon.com/jp/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/

Slide 12

Slide 12 text

Select best-suit plan with your requirement • Backup & restore • Pilot light • Warm standby • MulC-site acCve/acCve We chose it! Basic strategies of disaster recovery (DR) on AWS aws.amazon.com/jp/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/ ☝

Slide 13

Slide 13 text

And don’t worry! It’s easy to make your system mul/-region 😚

Slide 14

Slide 14 text

And don’t worry! It’s easy to make your system mul/-region 😚 ...excluding Database 🥶

Slide 15

Slide 15 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why database is difficult on mul:-region?

Slide 16

Slide 16 text

When you have mulCple DB like... • To get High-Availability • To make DR-ready • To enable Blue/Green deployment “You build mul<-DB, you run it consistently!” Difficul3es on mul3ple database Werner Vogels on Wikipedia, however he never says above

Slide 17

Slide 17 text

Popular soluCons for make DBs consistent: From applica+on • 2 phase commit (2PC) • Saga pe8ern Difficul3es on mul3ple database From database • Logical replica+on • Phisical replica+on

Slide 18

Slide 18 text

Popular soluCons for make DBs consistent: From applica+on • 2 phase commit (2PC) • Saga pe8ern Difficul3es on mul3ple database From database • Logical replica+on • Phisical replica+on ☝ You can use it easily with Amazon Aurora!

Slide 19

Slide 19 text

Yes that’s Global Database

Slide 20

Slide 20 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is Global Database?

Slide 21

Slide 21 text

• Storage replicaCon across regions • High performance with phisical volume replicaCon Amazon Aurora Global Database Aurora cluster (primary) Tokyo region Writer & readers Cluster volumes Aurora cluster (secondary) Osaka region Readers Cluster volumes ‎ ‎ ‎ Outbound replica4on

Slide 22

Slide 22 text

Supported engines: On Aurora MySQL • Ver 2.11+ (minor versions) Amazon Aurora Global Database On Aurora PostgreSQL • Ver 11.17+ (minor versions) • Ver 12.12+ (minor versions) • Ver 13.8+ (minor versions) • Ver 14.5+ (minor versions)

Slide 23

Slide 23 text

Just click “Add AWS Region” on exisCng cluster, that’s it 👍 How to use Global Database

Slide 24

Slide 24 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo 1: Enable Global Database

Slide 25

Slide 25 text

Conguratula*ons 🎉 It’s global now. ...is that all?? 🥳

Slide 26

Slide 26 text

No, it’s just a beginning... What you shoud do on actual disaster maBers!

Slide 27

Slide 27 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Two op:ons for failover Global Database

Slide 28

Slide 28 text

We have 2 opCons for failover regions in Global Database Planned failover (managed) • One-click on console • Cannot used on emergency What should you do on disaster? Unplanned failover • Manual opera9on with steps • Available even on disaster

Slide 29

Slide 29 text

We have 2 opCons for failover regions in Global Database Planned failover (managed) • One-click on console • Cannot used on emergency What should you do on disaster? Unplanned failover • Manual opera9on with steps • Available even on disaster 👇 Use it first!

Slide 30

Slide 30 text

When you run Global Database on Tokyo (primary) and Osaka (secondary) ... Opera3on steps on disaster Aurora cluster (primary) Tokyo region Aurora cluster (secondary) Osaka region ‎ ‎ ‎ Primary

Slide 31

Slide 31 text

Disaster occured! Opera3on steps on disaster Aurora cluster (primary) Tokyo region Osaka region ‎ ‎ ‎ Aurora cluster (secondary)

Slide 32

Slide 32 text

Then you shoud remove secondary cluster from Global Database Opera3on steps on disaster Aurora cluster (primary) Tokyo region Aurora cluster (standalone) Osaka region Remove from Global DB

Slide 33

Slide 33 text

A,er disaster past, you can rebuild Global Database from Osaka (new-primary) Opera3on steps a?er disaster Aurora cluster (old) Tokyo region Aurora cluster (primary) Osaka region Aurora cluster (secondary) ‏ ‏ ‏ Rebuild GDB Primary

Slide 34

Slide 34 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo 2: Unplanned failover on emergency

Slide 35

Slide 35 text

A,er disaster past, you can rebuild Global Database from Osaka (new-primary) Opera3on steps a?er disaster Aurora cluster (old) Tokyo region Aurora cluster (primary) Osaka region Aurora cluster (secondary) ‏ ‏ ‏ Rebuild GDB Primary

Slide 36

Slide 36 text

On peaceful day, you can switch back regions with managed planned failover Opera3on steps a?er disaster Aurora cluster (old) Tokyo region Aurora cluster (secondary) Osaka region Aurora cluster (primary) ‎ ‎ ‎ Managed failover Primary

Slide 37

Slide 37 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo 3: Planned failover (managed)

Slide 38

Slide 38 text

© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. What I learned by using Global Database on produc:on

Slide 39

Slide 39 text

Make DR opera3on flowchart beforehand! Disaster occured!

Slide 40

Slide 40 text

Make DR opera3on flowchart beforehand! Disaster occured! Decide to ac4vate DR

Slide 41

Slide 41 text

Make DR opera3on flowchart beforehand! Disaster occured! Decide to ac4vate DR Check health of Osaka cluster using SQL

Slide 42

Slide 42 text

Make DR opera3on flowchart beforehand! Disaster occured! Decide to ac4vate DR Check health of Osaka cluster using SQL Shut requests out by enabling redirect

Slide 43

Slide 43 text

Make DR opera3on flowchart beforehand! Disaster occured! Decide to ac4vate DR Check health of Osaka cluster using SQL Shut requests out by enabling redirect Check replica4on lag

Slide 44

Slide 44 text

Make DR opera3on flowchart beforehand! Disaster occured! Decide to ac4vate DR Check health of Osaka cluster using SQL Shut requests out by enabling redirect Check replica4on lag Modify DNS record to switch Aurora endpoints

Slide 45

Slide 45 text

Make DR opera3on flowchart beforehand! Disaster occured! Decide to ac4vate DR Check health of Osaka cluster using SQL Shut requests out by enabling redirect Check replica4on lag Modify DNS record to switch Aurora endpoints Unplanned failover on Global Database

Slide 46

Slide 46 text

Make DR opera3on flowchart beforehand! Disaster occured! Decide to ac4vate DR Check health of Osaka cluster using SQL Shut requests out by enabling redirect Check replica4on lag Modify DNS record to switch Aurora endpoints Unplanned failover on Global Database Chack health of app on Osaka

Slide 47

Slide 47 text

Make DR opera3on flowchart beforehand! Disaster occured! Decide to ac4vate DR Check health of Osaka cluster using SQL Shut requests out by enabling redirect Check replica4on lag Modify DNS record to switch Aurora endpoints Unplanned failover on Global Database Chack health of app on Osaka within RTO

Slide 48

Slide 48 text

In my case... • Automated all the operaCons with Exastro (Japanese so,ware) • Separeted operaCons by group, making it easy to go flexible with situaCon • Using Ansible for included operaCons of on-prem network components You will make mistake on emergency. Automate it!

Slide 49

Slide 49 text

If you can, plan regular training with all the stakeholders • OperaCons team • Infrastructure developers (including DBA) • ApplicaCon developers • Management (who can decide to acCvate DR) And prac3ce DR opera3ons regularly!

Slide 50

Slide 50 text

Thank you! Minoru Onda How to make DR-ready system with Amazon Aurora Global Database Give me your feedback!