Architecting Multi Region Critical SaaS Solution on AWS

#OGinsights @ekelog Architecting Multi Region Critical SaaS Solution on AWS
Abdurrahim Eke Co-founder & CSRO @ OpsGenie @ekelog @OpsGenieEng

#OGinsights @ekelog Outline • Journey through modern building blocks •
Role of OpsGenie • Motivation for high reliability • In Action ◦ S3 ◦ DynamoDb ◦ Disaster Monitoring ◦ Cross Region Traffic Routing Alternatives

#OGinsights @ekelog * OpsGenie Engineering

#OGinsights @ekelog DevOps - CI / CD - Evolution *
https://hackernoon.com/has-devops-changed-the-role-of-a-tester-b140dddc7824

#OGinsights @ekelog DevOps Tools * https://www.enterpriseirregulars.com/116202/race-pipeline-atlassian-aint-playin-introducing-devops-marketplace/

#OGinsights @ekelog AI Tools * https://www.topbots.com/essential-landscape-overview-enterprise-artificial-intelligence/

#OGinsights @ekelog Amazon Web Services * AWS Console Decision factors
• Cost • Scalability • Features • Disaster Cases • Monitoring • Engineering Effort • Does it really suit • Supports agility or not ? Not all services well built as others AWS is also iterative

#OGinsights @ekelog An analysis of Modern Tools Pros • Focus
on higher level benefits • Faster business growth • Fast response to incidents • Higher levels of reliability, scalability, security Cons • Increased Complexity • Learning curve • New tools required for monitoring & management Inner Voice : Should be kidding, really need more tools for managing more tools ?

#OGinsights @ekelog Monitoring - Observability are Hot Topics All services
generate • Events • Metrics • Logs • Errors Alerts Incidents Impact mostly unknown Identified problems Operators should respond Major Outage Multiple teams should respond Inner Voice : How many tools can be in monitoring domain ?

Role of OpsGenie in IT Operations IT Systems Factories Scientific
Labs Food trucks ... Real World Monitoring Alerting Alert Management Notifications Security tools Custom software User Actions Operations Notify via phone call, sms, mobile app, emails Notify right person, right time Notify collaboration tools & it systems Help operator solve fast ... Need to understand & solve fast Needs clean & rich information from multiple systems Ticketing, Collaboration IT management ...

OpsGenie Integrates With >150 Tools

#OGinsights @ekelog Motivation For High Reliability - Chaos Engineering •
Always up & running, trustworthy • Fast response for incidents • Monitoring systems should be more available than actual systems • To make fun, to sleep well * https://aws.amazon.com/message/41926/ Who watches the watcher ?

#OGinsights @ekelog In Action Questions Welcome

#OGinsights @ekelog Reliability Engineering Challenges • Almost all systems deserve
reliability attention • Any new tech player in stack requires new effort • Level of effort for reliability engineering depend on reliability of services • Each service trying to improve reliability on its own * OpsGenie Engineering

#OGinsights @ekelog AWS Reliability Offering • Fully managed services: ◦
Highly reliable & available ◦ Spanning multiple zones in a single region. • Half managed services • A zone shall completely fail • A service in a region shall completely fail • A region shall completely fail

#OGinsights @ekelog Some AWS Outages • c5.large EC2 s on
Ohio failing with a high ratio • Greatest Api Outage of OpsGenie 20 minutes was caused by AWS deployment mistake ◦ AWS Application Load Balancer Bug • us-west-2c network high fail ratio • February 28th, 2017 AWS Operators mistakenly disrupted S3 Service on Virginia, took hours to recover. Other AWS services depending on S3 also disrupted. • Virginia Region is not reliable as Oregon Region ( past experience ) * https://aws.amazon.com/message/41926/ * https://status.aws.amazon.com/ * OpsGenie Engieering

#OGinsights @ekelog Multi Region vs Multi Cloud • Challenging topics
: ◦ Replication & traffic routing • Multi cloud approach additionally requires ◦ R&D for each provider ◦ Application support for different technologies. ◦ Code: if AWS if Azure if Google • Multi region is more like an operational burden ◦ Code: if Oregon if Ohio

#OGinsights @ekelog Amazon S3 • Object/File storage service, fully managed
• Scalability : ◦ Size infinitely scalable ◦ Fixed Read/write limits per second • Availability: S3 Standard storage class is designed for 99.99% • Durability : S3 Standard average annual expected loss of 0.000000001% of objects • Redundantly stores your objects on multiple devices across multiple facilities in an Amazon S3 Region before returning SUCCESS. • Repairs corruptions automatically by using redundant data * https://aws.amazon.com/s3/faqs/

#OGinsights @ekelog S3 Built-in Cross Region Replication * Sample Illustration

#OGinsights @ekelog S3 Built-in Cross Region Replication • Bi-directional or
uni-directional between two buckets • Only limited to two buckets, can not be extended • Recursive replication Protection by using different actions for replication ◦ Replication actions are separate ▪ PutObject vs ReplicateObject ▪ Delete vs ReplicateDelete • Async Cross Region Replication • Lacks monitoring & logging ◦ Additional effort needed to identify latencies & failures in async replication ◦ AWS has an extension for monitoring • How to extend to more than two buckets ◦ Custom Logic can be built by leveraging AWS S3 Object Triggers & AWS Lambda ◦ Should handle versioning & recursion challenges * https://docs.aws.amazon.com/AmazonS3/latest/dev/crr-how-setup.html

#OGinsights @ekelog Amazon DynamoDb • NoSQL Database Service, fully managed
• Scalability : ◦ Size infinitely scalable , * except LSI limits ◦ Provisioned read/write limits per second ▪ DynamoDB will scale up & down depending on provisioned capacity changes • Availability: Single Region 99.99%, Multi Region 99.999% • Durability : Synchronously replicates data across three facilities within an AWS Region • Automatic replication and failover within a region • GlobalTable Async Cross Region Replication ◦ Metrics available for monitoring * https://aws.amazon.com/dynamodb/faqs/

#OGinsights @ekelog AutoScaling DynamoDb * OpsGenie Engineering • Pricing based
on provisioned capacity • Applications will receive errors after 5 minutes of burst

#OGinsights @ekelog AutoScaling DynamoDb AWS Built-in AutoScale • DynamoDb Released
on Jan 2012 • AWS Autoscale Released on June 2017 • Inspired from open-source • Works with 5 minute interval alarms ◦ Not efficient for responding to traffic spikes * https://aws.amazon.com/dynamodb/faqs/ OpsGenie AutoScale • OpsGenie released on July 2012 • Initial Server based released on June 2014 • Leverages CloudWatch Alarms & Lambda • Can work with 1 minute interval alarms • Custom Logic allows better handling of scale up & down • ! Additionally applications shall trigger scale-up upon capacity errors, spikes • Partitions does not shrink • More number of partitions > more cost, capacity split • Partition addition is slow * OpsGenie Engineering

#OGinsights @ekelog DynamoDb Cross Region Replication * OpsGenie Engineering

#OGinsights @ekelog DynamoDb Cross Region Replication AWS Built-in Global Table
• Availability 99.99% > 99.999% • DynamoDb Released on Jan 2012 • Global Table Released on Nov 2017 • 1.5-2 years of R&D & Internal Testing • Async • Metrics available for latency • Recursion protection by depending on new attributes ◦ aws:rep:updateregion ◦ aws:rep:updatetime • Conflicts can happen ◦ Last write wins , full update ◦ No metrics for conflicts * https://aws.amazon.com/dynamodb/faqs/ OpsGenie Cross Region Replication • Borned from the need for higher availability & disaster recovery • Released on October 2016 • We were not aware of Global Table ◦ ! Importance of being in contact with AWS • Leveraging DynamoDB Streams, only read cost • Async Replication: Lambdas attached to DynamoDb Streams • Detailed logging & metrics available • Recursion protection : ◦ Each application needs to add additional attributes • Conflicts should be handled

#OGinsights @ekelog DynamoDb Zero Downtime Migration • * OpsGenie Engineering

#OGinsights @ekelog DynamoDb Zero Downtime Migration • DevOps & Modern
tools allows faster delivery ◦ Changes lead to data migration • Goal : Migrate all data while systems running • Requirements : ◦ Migrate new data by using DynamoDB Streams ◦ Migrate existing data by using DynamoDB parallel full scan ◦ Preserve compatibility within the application while switching from v1 to v2 • Challenges : ◦ Conflicts can happen between multiple data processors above ◦ Conflicts should be carefully engineered ◦ Result of migration should also be verified ▪ Data changes even verification is running ◦ If migration is risky , backward compatibility should also be designed * https://aws.amazon.com/dynamodb/faqs/

#OGinsights @ekelog Who watches the watcher ? * OpsGenie Engineering

#OGinsights @ekelog Disaster Monitoring of AWS Services * OpsGenie Engineering

#OGinsights @ekelog Traffic Routing Questions Welcome

#OGinsights @ekelog Traffic Routing • Stickiness vs Round Robin •
Sync vs Async Replication • Latency

#OGinsights @ekelog OpsGenie Reliability Effort • Devops , CI, CD
, Microservices • Chaos Engineering & Tools • Cross Region Replication • Custom Blue Green Deployment for EC2 • Custom EC2 - Kinesis - DynamoDb AutoScaling • DynamoDB Real Time Backup to S3 • OpsGenie Resource Creation • Hell of monitoring, alerting • Bunch of Security Operations • Deploy with MFA • … • ...

#OGinsights @ekelog Thanks for watching See you in the next
event Events on academy.opsgenie.com Thanks for watching Follow us: Events on academy.opsgenie.com Blogs on engineering.opsgenie.com

Architecting Multi Region Critical SaaS Solutio...

Architecting Multi Region Critical SaaS Solution on AWS

OpsGenie Engineering

More Decks by OpsGenie Engineering

Other Decks in Technology

Featured

Transcript

#OGinsights @ekelog Architecting Multi Region Critical SaaS Solution on AWS

#OGinsights @ekelog Outline • Journey through modern building blocks •

#OGinsights @ekelog * OpsGenie Engineering

#OGinsights @ekelog DevOps - CI / CD - Evolution *

#OGinsights @ekelog DevOps Tools * https://www.enterpriseirregulars.com/116202/race-pipeline-atlassian-aint-playin-introducing-devops-marketplace/

#OGinsights @ekelog AI Tools * https://www.topbots.com/essential-landscape-overview-enterprise-artificial-intelligence/

#OGinsights @ekelog Amazon Web Services * AWS Console Decision factors

#OGinsights @ekelog An analysis of Modern Tools Pros • Focus

#OGinsights @ekelog Monitoring - Observability are Hot Topics All services

Role of OpsGenie in IT Operations IT Systems Factories Scientific

OpsGenie Integrates With >150 Tools

#OGinsights @ekelog Motivation For High Reliability - Chaos Engineering •

#OGinsights @ekelog In Action Questions Welcome

#OGinsights @ekelog Reliability Engineering Challenges • Almost all systems deserve

#OGinsights @ekelog AWS Reliability Offering • Fully managed services: ◦

#OGinsights @ekelog Some AWS Outages • c5.large EC2 s on

#OGinsights @ekelog Multi Region vs Multi Cloud • Challenging topics

#OGinsights @ekelog Amazon S3 • Object/File storage service, fully managed

#OGinsights @ekelog S3 Built-in Cross Region Replication * Sample Illustration

#OGinsights @ekelog S3 Built-in Cross Region Replication • Bi-directional or

#OGinsights @ekelog Amazon DynamoDb • NoSQL Database Service, fully managed

#OGinsights @ekelog AutoScaling DynamoDb * OpsGenie Engineering • Pricing based

#OGinsights @ekelog AutoScaling DynamoDb AWS Built-in AutoScale • DynamoDb Released

#OGinsights @ekelog DynamoDb Cross Region Replication * OpsGenie Engineering

#OGinsights @ekelog DynamoDb Cross Region Replication AWS Built-in Global Table

#OGinsights @ekelog DynamoDb Zero Downtime Migration • * OpsGenie Engineering

#OGinsights @ekelog DynamoDb Zero Downtime Migration • DevOps & Modern

#OGinsights @ekelog Who watches the watcher ? * OpsGenie Engineering

#OGinsights @ekelog Disaster Monitoring of AWS Services * OpsGenie Engineering

#OGinsights @ekelog Traffic Routing Questions Welcome

#OGinsights @ekelog Traffic Routing • Stickiness vs Round Robin •

#OGinsights @ekelog OpsGenie Reliability Effort • Devops , CI, CD

#OGinsights @ekelog Thanks for watching See you in the next