Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2020-01-08 Thinking after that disaster in cloud

Sammy Lin
January 08, 2020

2020-01-08 Thinking after that disaster in cloud

Sammy Lin

January 08, 2020
Tweet

More Decks by Sammy Lin

Other Decks in Technology

Transcript

  1. "WBJMBCJMJUZ 100 x (Calendar Time ‒ Downtime) / Calendar Time

    3FMJBCJMJUZ MTBF(Mean Time Between Failures) = Operating time (hours) / Number of Failures
  2. 17Media SRE Journey 2015 • 17Media founded • Builded on

    AWS 2016/8 • First DevOps Engineer joined 2018/5 • Migrated to GCP 2017/1 • Migrated from beanstalk to ECS 2018/6 • Migrated from node.js to golang 2019/8 • Building GRE
  3. 17Media SRE Journey 2015 • 17Media founded • Builded on

    AWS 2016/8 • First DevOps Engineer joined 2018/5 • Migrated to GCP 2017/1 • Migrated from beanstalk to ECS 2018/6 • Migrated from node.js to golang 2019/8 • Building GRE 2019 started to thinking and planning in terms of Disaster Recovery
  4. 1. Someone drops database or workload because of fat-fingering error.

    2. The entire region is down (Not considering global services) Assumptions
  5. Timeline • Investigating • Identified • Still Resolving • Started

    DR 12:00 PHASE 1 15:00 PHASE 2 18:00 PHASE 4 PHASE 3 17:00
  6. Timeline • Investigating • Identified • Still Resolving • Still

    Resolving • Enabled throttle • Started DR 12:00 PHASE 1 15:00 PHASE 2 18:00 PHASE 4 20:00 PHASE 5 PHASE 3 17:00
  7. Timeline • Investigating • Identified • Still Resolving • Still

    Resolving • Enabled throttle • Resolved • Started DR 12:00 PHASE 1 15:00 PHASE 2 18:00 PHASE 4 20:00 PHASE 5 22:30 PHASE 6 PHASE 3 17:00
  8. Timeline • Investigating • Identified • Still Resolving • Still

    Resolving • Enabled throttle • Resolved • Started DR 12:00 PHASE 1 15:00 PHASE 2 18:00 PHASE 4 20:00 PHASE 5 22:30 PHASE 6 PHASE 3 17:00
  9. 1. Set up Emergency Response Team & SOP Developer Lead

    ! " # Communication Lead Owner BU Lead BU Lead Developer Developer ! # ! #
  10. 1. Cold DR 2. Hot DR 3. HA 2. Phased

    DR implementation Image from https://picsart.com/i/sticker-cute-rainbow-cloud-sprinkle-sparkle-glitter-kawaii-293803594026211
  11. 2-1. Cold DR implementation VPC VPC Private VPN Gateway Customer

    Gateway VPN Connection MySQL Mongo Redis Amazon Route 53 Weighted Records Amazon EKS Kubernetes Elastic Load Balancing Elastic Load Balancing 100%
  12. 2-1. Cold DR implementation VPC VPC Private VPN Gateway Customer

    Gateway VPN Connection MySQL Mongo Redis Amazon Route 53 Weighted Records Amazon EKS Kubernetes Elastic Load Balancing Elastic Load Balancing 100%
  13. 2-1. Cold DR implementation VPC VPC Private VPN Gateway Customer

    Gateway VPN Connection MySQL Mongo Redis Amazon Route 53 Weighted Records Amazon EKS Kubernetes Elastic Load Balancing Elastic Load Balancing 60% 40%
  14. 2-1. Cold DR implementation MTTR: approximately 30 Mins Pros: Very

    cheap Cons High latency Break down if infra change
  15. 2-2. Hot DR implementation VPC VPC Private VPN Gateway Customer

    Gateway VPN Connection MySQL Mongo Redis Amazon Route 53 Weighted Records Amazon EKS Kubernetes Elastic Load Balancing Elastic Load Balancing
  16. 2-2. Hot DR implementation VPC VPC Private VPN Gateway Customer

    Gateway VPN Connection MySQL Mongo Redis Amazon Route 53 Weighted Records Amazon EKS Kubernetes Elastic Load Balancing Elastic Load Balancing 90% 10%
  17. 2-2. Hot DR implementation VPC VPC Private VPN Gateway Customer

    Gateway VPN Connection MySQL Mongo Redis Amazon Route 53 Weighted Records Amazon EKS Kubernetes Elastic Load Balancing Elastic Load Balancing 90% 10%
  18. 2-2. Hot DR implementation VPC VPC Private VPN Gateway Customer

    Gateway VPN Connection MySQL Mongo Redis Amazon Route 53 Weighted Records Amazon EKS Kubernetes Elastic Load Balancing Elastic Load Balancing 90% 10% Redis (slave) Mongo (secondary) MySQL
 (slave)
  19. 2-2. Hot DR implementation MTTR: approximately 5 Mins Pros: Low

    latency of read Immediate notice if infra change Cons High latency of write Expensive
  20. 2-3. HA implementation of multi-region/cloud MTTR: Without downtime Pros: Low

    latency of read/write Immediate notice if infra change Cons Very expensive Very complicated
  21. 5IBOLZPV ˜ "NB[PO8FC4FSWJDFT *ODPSJUTBGGJMJBUFT"MMSJHIUT SFTFSWFE Check our job page! 17

    Media We're hiring Site Reliability Engineer - Infrastructure Site Reliability Engineer - Automation & Tools
 Backend Engineer