Slide 1

Slide 1 text

#OGinsights @ekelog Architecting Multi Region Critical SaaS Solution on AWS Abdurrahim Eke Co-founder & CSRO @ OpsGenie @ekelog @OpsGenieEng

Slide 2

Slide 2 text

#OGinsights @ekelog Outline ● Journey through modern building blocks ● Role of OpsGenie ● Motivation for high reliability ● In Action ○ S3 ○ DynamoDb ○ Disaster Monitoring ○ Cross Region Traffic Routing Alternatives

Slide 3

Slide 3 text

#OGinsights @ekelog * OpsGenie Engineering

Slide 4

Slide 4 text

#OGinsights @ekelog DevOps - CI / CD - Evolution * https://hackernoon.com/has-devops-changed-the-role-of-a-tester-b140dddc7824

Slide 5

Slide 5 text

#OGinsights @ekelog DevOps Tools * https://www.enterpriseirregulars.com/116202/race-pipeline-atlassian-aint-playin-introducing-devops-marketplace/

Slide 6

Slide 6 text

#OGinsights @ekelog AI Tools * https://www.topbots.com/essential-landscape-overview-enterprise-artificial-intelligence/

Slide 7

Slide 7 text

#OGinsights @ekelog Amazon Web Services * AWS Console Decision factors ● Cost ● Scalability ● Features ● Disaster Cases ● Monitoring ● Engineering Effort ● Does it really suit ● Supports agility or not ? Not all services well built as others AWS is also iterative

Slide 8

Slide 8 text

#OGinsights @ekelog An analysis of Modern Tools Pros ● Focus on higher level benefits ● Faster business growth ● Fast response to incidents ● Higher levels of reliability, scalability, security Cons ● Increased Complexity ● Learning curve ● New tools required for monitoring & management Inner Voice : Should be kidding, really need more tools for managing more tools ?

Slide 9

Slide 9 text

#OGinsights @ekelog Monitoring - Observability are Hot Topics All services generate ● Events ● Metrics ● Logs ● Errors Alerts Incidents Impact mostly unknown Identified problems Operators should respond Major Outage Multiple teams should respond Inner Voice : How many tools can be in monitoring domain ?

Slide 10

Slide 10 text

Role of OpsGenie in IT Operations IT Systems Factories Scientific Labs Food trucks ... Real World Monitoring Alerting Alert Management Notifications Security tools Custom software User Actions Operations Notify via phone call, sms, mobile app, emails Notify right person, right time Notify collaboration tools & it systems Help operator solve fast ... Need to understand & solve fast Needs clean & rich information from multiple systems Ticketing, Collaboration IT management ...

Slide 11

Slide 11 text

OpsGenie Integrates With >150 Tools

Slide 12

Slide 12 text

#OGinsights @ekelog Motivation For High Reliability - Chaos Engineering ● Always up & running, trustworthy ● Fast response for incidents ● Monitoring systems should be more available than actual systems ● To make fun, to sleep well * https://aws.amazon.com/message/41926/ Who watches the watcher ?

Slide 13

Slide 13 text

#OGinsights @ekelog In Action Questions Welcome

Slide 14

Slide 14 text

#OGinsights @ekelog Reliability Engineering Challenges ● Almost all systems deserve reliability attention ● Any new tech player in stack requires new effort ● Level of effort for reliability engineering depend on reliability of services ● Each service trying to improve reliability on its own * OpsGenie Engineering

Slide 15

Slide 15 text

#OGinsights @ekelog AWS Reliability Offering ● Fully managed services: ○ Highly reliable & available ○ Spanning multiple zones in a single region. ● Half managed services ● A zone shall completely fail ● A service in a region shall completely fail ● A region shall completely fail

Slide 16

Slide 16 text

#OGinsights @ekelog Some AWS Outages ● c5.large EC2 s on Ohio failing with a high ratio ● Greatest Api Outage of OpsGenie 20 minutes was caused by AWS deployment mistake ○ AWS Application Load Balancer Bug ● us-west-2c network high fail ratio ● February 28th, 2017 AWS Operators mistakenly disrupted S3 Service on Virginia, took hours to recover. Other AWS services depending on S3 also disrupted. ● Virginia Region is not reliable as Oregon Region ( past experience ) * https://aws.amazon.com/message/41926/ * https://status.aws.amazon.com/ * OpsGenie Engieering

Slide 17

Slide 17 text

#OGinsights @ekelog Multi Region vs Multi Cloud ● Challenging topics : ○ Replication & traffic routing ● Multi cloud approach additionally requires ○ R&D for each provider ○ Application support for different technologies. ○ Code: if AWS if Azure if Google ● Multi region is more like an operational burden ○ Code: if Oregon if Ohio

Slide 18

Slide 18 text

#OGinsights @ekelog Amazon S3 ● Object/File storage service, fully managed ● Scalability : ○ Size infinitely scalable ○ Fixed Read/write limits per second ● Availability: S3 Standard storage class is designed for 99.99% ● Durability : S3 Standard average annual expected loss of 0.000000001% of objects ● Redundantly stores your objects on multiple devices across multiple facilities in an Amazon S3 Region before returning SUCCESS. ● Repairs corruptions automatically by using redundant data * https://aws.amazon.com/s3/faqs/

Slide 19

Slide 19 text

#OGinsights @ekelog S3 Built-in Cross Region Replication * Sample Illustration

Slide 20

Slide 20 text

#OGinsights @ekelog S3 Built-in Cross Region Replication ● Bi-directional or uni-directional between two buckets ● Only limited to two buckets, can not be extended ● Recursive replication Protection by using different actions for replication ○ Replication actions are separate ■ PutObject vs ReplicateObject ■ Delete vs ReplicateDelete ● Async Cross Region Replication ● Lacks monitoring & logging ○ Additional effort needed to identify latencies & failures in async replication ○ AWS has an extension for monitoring ● How to extend to more than two buckets ○ Custom Logic can be built by leveraging AWS S3 Object Triggers & AWS Lambda ○ Should handle versioning & recursion challenges * https://docs.aws.amazon.com/AmazonS3/latest/dev/crr-how-setup.html

Slide 21

Slide 21 text

#OGinsights @ekelog Amazon DynamoDb ● NoSQL Database Service, fully managed ● Scalability : ○ Size infinitely scalable , * except LSI limits ○ Provisioned read/write limits per second ■ DynamoDB will scale up & down depending on provisioned capacity changes ● Availability: Single Region 99.99%, Multi Region 99.999% ● Durability : Synchronously replicates data across three facilities within an AWS Region ● Automatic replication and failover within a region ● GlobalTable Async Cross Region Replication ○ Metrics available for monitoring * https://aws.amazon.com/dynamodb/faqs/

Slide 22

Slide 22 text

#OGinsights @ekelog AutoScaling DynamoDb * OpsGenie Engineering ● Pricing based on provisioned capacity ● Applications will receive errors after 5 minutes of burst

Slide 23

Slide 23 text

#OGinsights @ekelog AutoScaling DynamoDb AWS Built-in AutoScale ● DynamoDb Released on Jan 2012 ● AWS Autoscale Released on June 2017 ● Inspired from open-source ● Works with 5 minute interval alarms ○ Not efficient for responding to traffic spikes * https://aws.amazon.com/dynamodb/faqs/ OpsGenie AutoScale ● OpsGenie released on July 2012 ● Initial Server based released on June 2014 ● Leverages CloudWatch Alarms & Lambda ● Can work with 1 minute interval alarms ● Custom Logic allows better handling of scale up & down ● ! Additionally applications shall trigger scale-up upon capacity errors, spikes ● Partitions does not shrink ● More number of partitions > more cost, capacity split ● Partition addition is slow * OpsGenie Engineering

Slide 24

Slide 24 text

#OGinsights @ekelog DynamoDb Cross Region Replication * OpsGenie Engineering

Slide 25

Slide 25 text

#OGinsights @ekelog DynamoDb Cross Region Replication AWS Built-in Global Table ● Availability 99.99% > 99.999% ● DynamoDb Released on Jan 2012 ● Global Table Released on Nov 2017 ● 1.5-2 years of R&D & Internal Testing ● Async ● Metrics available for latency ● Recursion protection by depending on new attributes ○ aws:rep:updateregion ○ aws:rep:updatetime ● Conflicts can happen ○ Last write wins , full update ○ No metrics for conflicts * https://aws.amazon.com/dynamodb/faqs/ OpsGenie Cross Region Replication ● Borned from the need for higher availability & disaster recovery ● Released on October 2016 ● We were not aware of Global Table ○ ! Importance of being in contact with AWS ● Leveraging DynamoDB Streams, only read cost ● Async Replication: Lambdas attached to DynamoDb Streams ● Detailed logging & metrics available ● Recursion protection : ○ Each application needs to add additional attributes ● Conflicts should be handled

Slide 26

Slide 26 text

#OGinsights @ekelog DynamoDb Zero Downtime Migration ● * OpsGenie Engineering

Slide 27

Slide 27 text

#OGinsights @ekelog DynamoDb Zero Downtime Migration ● DevOps & Modern tools allows faster delivery ○ Changes lead to data migration ● Goal : Migrate all data while systems running ● Requirements : ○ Migrate new data by using DynamoDB Streams ○ Migrate existing data by using DynamoDB parallel full scan ○ Preserve compatibility within the application while switching from v1 to v2 ● Challenges : ○ Conflicts can happen between multiple data processors above ○ Conflicts should be carefully engineered ○ Result of migration should also be verified ■ Data changes even verification is running ○ If migration is risky , backward compatibility should also be designed * https://aws.amazon.com/dynamodb/faqs/

Slide 28

Slide 28 text

#OGinsights @ekelog Who watches the watcher ? * OpsGenie Engineering

Slide 29

Slide 29 text

#OGinsights @ekelog Disaster Monitoring of AWS Services * OpsGenie Engineering

Slide 30

Slide 30 text

#OGinsights @ekelog Traffic Routing Questions Welcome

Slide 31

Slide 31 text

#OGinsights @ekelog Traffic Routing ● Stickiness vs Round Robin ● Sync vs Async Replication ● Latency

Slide 32

Slide 32 text

#OGinsights @ekelog OpsGenie Reliability Effort ● Devops , CI, CD , Microservices ● Chaos Engineering & Tools ● Cross Region Replication ● Custom Blue Green Deployment for EC2 ● Custom EC2 - Kinesis - DynamoDb AutoScaling ● DynamoDB Real Time Backup to S3 ● OpsGenie Resource Creation ● Hell of monitoring, alerting ● Bunch of Security Operations ● Deploy with MFA ● … ● ...

Slide 33

Slide 33 text

#OGinsights @ekelog Thanks for watching See you in the next event Events on academy.opsgenie.com Thanks for watching Follow us: Events on academy.opsgenie.com Blogs on engineering.opsgenie.com