Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Architecting Multi Region Critical SaaS Solution on AWS

Architecting Multi Region Critical SaaS Solution on AWS

Modern software building blocks allows application development faster than ever, allows developers to focus more on higher level benefits, faster business growth, higher levels of reliability, scalability, security. A typical SaaS solution shall make use of many new modern tools, which brings additional complexity and management. This modern and complex world has a great demand for monitoring and observability, so that developers shall manage many systems fast and easily. Monitoring systems expected to be more available than actual systems. OpsGenie is a critical incident and alert management SaaS solution. We will be demonstrating how OpsGenie makes use of AWS services EC2, Lambda, DynamoDB, SNS, SQS, S3, CloudFront to provide an always available multi region and multi zone architecture, will dive into reliability challenges and demonstrate how AWS addresses these challenges.

OpsGenie Engineering

June 23, 2018
Tweet

More Decks by OpsGenie Engineering

Other Decks in Technology

Transcript

  1. #OGinsights @ekelog Architecting Multi Region Critical SaaS Solution on AWS

    Abdurrahim Eke Co-founder & CSRO @ OpsGenie @ekelog @OpsGenieEng
  2. #OGinsights @ekelog Outline • Journey through modern building blocks •

    Role of OpsGenie • Motivation for high reliability • In Action ◦ S3 ◦ DynamoDb ◦ Disaster Monitoring ◦ Cross Region Traffic Routing Alternatives
  3. #OGinsights @ekelog DevOps - CI / CD - Evolution *

    https://hackernoon.com/has-devops-changed-the-role-of-a-tester-b140dddc7824
  4. #OGinsights @ekelog Amazon Web Services * AWS Console Decision factors

    • Cost • Scalability • Features • Disaster Cases • Monitoring • Engineering Effort • Does it really suit • Supports agility or not ? Not all services well built as others AWS is also iterative
  5. #OGinsights @ekelog An analysis of Modern Tools Pros • Focus

    on higher level benefits • Faster business growth • Fast response to incidents • Higher levels of reliability, scalability, security Cons • Increased Complexity • Learning curve • New tools required for monitoring & management Inner Voice : Should be kidding, really need more tools for managing more tools ?
  6. #OGinsights @ekelog Monitoring - Observability are Hot Topics All services

    generate • Events • Metrics • Logs • Errors Alerts Incidents Impact mostly unknown Identified problems Operators should respond Major Outage Multiple teams should respond Inner Voice : How many tools can be in monitoring domain ?
  7. Role of OpsGenie in IT Operations IT Systems Factories Scientific

    Labs Food trucks ... Real World Monitoring Alerting Alert Management Notifications Security tools Custom software User Actions Operations Notify via phone call, sms, mobile app, emails Notify right person, right time Notify collaboration tools & it systems Help operator solve fast ... Need to understand & solve fast Needs clean & rich information from multiple systems Ticketing, Collaboration IT management ...
  8. #OGinsights @ekelog Motivation For High Reliability - Chaos Engineering •

    Always up & running, trustworthy • Fast response for incidents • Monitoring systems should be more available than actual systems • To make fun, to sleep well * https://aws.amazon.com/message/41926/ Who watches the watcher ?
  9. #OGinsights @ekelog Reliability Engineering Challenges • Almost all systems deserve

    reliability attention • Any new tech player in stack requires new effort • Level of effort for reliability engineering depend on reliability of services • Each service trying to improve reliability on its own * OpsGenie Engineering
  10. #OGinsights @ekelog AWS Reliability Offering • Fully managed services: ◦

    Highly reliable & available ◦ Spanning multiple zones in a single region. • Half managed services • A zone shall completely fail • A service in a region shall completely fail • A region shall completely fail
  11. #OGinsights @ekelog Some AWS Outages • c5.large EC2 s on

    Ohio failing with a high ratio • Greatest Api Outage of OpsGenie 20 minutes was caused by AWS deployment mistake ◦ AWS Application Load Balancer Bug • us-west-2c network high fail ratio • February 28th, 2017 AWS Operators mistakenly disrupted S3 Service on Virginia, took hours to recover. Other AWS services depending on S3 also disrupted. • Virginia Region is not reliable as Oregon Region ( past experience ) * https://aws.amazon.com/message/41926/ * https://status.aws.amazon.com/ * OpsGenie Engieering
  12. #OGinsights @ekelog Multi Region vs Multi Cloud • Challenging topics

    : ◦ Replication & traffic routing • Multi cloud approach additionally requires ◦ R&D for each provider ◦ Application support for different technologies. ◦ Code: if AWS if Azure if Google • Multi region is more like an operational burden ◦ Code: if Oregon if Ohio
  13. #OGinsights @ekelog Amazon S3 • Object/File storage service, fully managed

    • Scalability : ◦ Size infinitely scalable ◦ Fixed Read/write limits per second • Availability: S3 Standard storage class is designed for 99.99% • Durability : S3 Standard average annual expected loss of 0.000000001% of objects • Redundantly stores your objects on multiple devices across multiple facilities in an Amazon S3 Region before returning SUCCESS. • Repairs corruptions automatically by using redundant data * https://aws.amazon.com/s3/faqs/
  14. #OGinsights @ekelog S3 Built-in Cross Region Replication • Bi-directional or

    uni-directional between two buckets • Only limited to two buckets, can not be extended • Recursive replication Protection by using different actions for replication ◦ Replication actions are separate ▪ PutObject vs ReplicateObject ▪ Delete vs ReplicateDelete • Async Cross Region Replication • Lacks monitoring & logging ◦ Additional effort needed to identify latencies & failures in async replication ◦ AWS has an extension for monitoring • How to extend to more than two buckets ◦ Custom Logic can be built by leveraging AWS S3 Object Triggers & AWS Lambda ◦ Should handle versioning & recursion challenges * https://docs.aws.amazon.com/AmazonS3/latest/dev/crr-how-setup.html
  15. #OGinsights @ekelog Amazon DynamoDb • NoSQL Database Service, fully managed

    • Scalability : ◦ Size infinitely scalable , * except LSI limits ◦ Provisioned read/write limits per second ▪ DynamoDB will scale up & down depending on provisioned capacity changes • Availability: Single Region 99.99%, Multi Region 99.999% • Durability : Synchronously replicates data across three facilities within an AWS Region • Automatic replication and failover within a region • GlobalTable Async Cross Region Replication ◦ Metrics available for monitoring * https://aws.amazon.com/dynamodb/faqs/
  16. #OGinsights @ekelog AutoScaling DynamoDb * OpsGenie Engineering • Pricing based

    on provisioned capacity • Applications will receive errors after 5 minutes of burst
  17. #OGinsights @ekelog AutoScaling DynamoDb AWS Built-in AutoScale • DynamoDb Released

    on Jan 2012 • AWS Autoscale Released on June 2017 • Inspired from open-source • Works with 5 minute interval alarms ◦ Not efficient for responding to traffic spikes * https://aws.amazon.com/dynamodb/faqs/ OpsGenie AutoScale • OpsGenie released on July 2012 • Initial Server based released on June 2014 • Leverages CloudWatch Alarms & Lambda • Can work with 1 minute interval alarms • Custom Logic allows better handling of scale up & down • ! Additionally applications shall trigger scale-up upon capacity errors, spikes • Partitions does not shrink • More number of partitions > more cost, capacity split • Partition addition is slow * OpsGenie Engineering
  18. #OGinsights @ekelog DynamoDb Cross Region Replication AWS Built-in Global Table

    • Availability 99.99% > 99.999% • DynamoDb Released on Jan 2012 • Global Table Released on Nov 2017 • 1.5-2 years of R&D & Internal Testing • Async • Metrics available for latency • Recursion protection by depending on new attributes ◦ aws:rep:updateregion ◦ aws:rep:updatetime • Conflicts can happen ◦ Last write wins , full update ◦ No metrics for conflicts * https://aws.amazon.com/dynamodb/faqs/ OpsGenie Cross Region Replication • Borned from the need for higher availability & disaster recovery • Released on October 2016 • We were not aware of Global Table ◦ ! Importance of being in contact with AWS • Leveraging DynamoDB Streams, only read cost • Async Replication: Lambdas attached to DynamoDb Streams • Detailed logging & metrics available • Recursion protection : ◦ Each application needs to add additional attributes • Conflicts should be handled
  19. #OGinsights @ekelog DynamoDb Zero Downtime Migration • DevOps & Modern

    tools allows faster delivery ◦ Changes lead to data migration • Goal : Migrate all data while systems running • Requirements : ◦ Migrate new data by using DynamoDB Streams ◦ Migrate existing data by using DynamoDB parallel full scan ◦ Preserve compatibility within the application while switching from v1 to v2 • Challenges : ◦ Conflicts can happen between multiple data processors above ◦ Conflicts should be carefully engineered ◦ Result of migration should also be verified ▪ Data changes even verification is running ◦ If migration is risky , backward compatibility should also be designed * https://aws.amazon.com/dynamodb/faqs/
  20. #OGinsights @ekelog OpsGenie Reliability Effort • Devops , CI, CD

    , Microservices • Chaos Engineering & Tools • Cross Region Replication • Custom Blue Green Deployment for EC2 • Custom EC2 - Kinesis - DynamoDb AutoScaling • DynamoDB Real Time Backup to S3 • OpsGenie Resource Creation • Hell of monitoring, alerting • Bunch of Security Operations • Deploy with MFA • … • ...
  21. #OGinsights @ekelog Thanks for watching See you in the next

    event Events on academy.opsgenie.com Thanks for watching Follow us: Events on academy.opsgenie.com Blogs on engineering.opsgenie.com