Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Surviving an Amazon Outage

Neil Armitage
September 15, 2022
19

Surviving an Amazon Outage

Neil Armitage

September 15, 2022
Tweet

Transcript

  1. ©Continuent 2012 2 Overview • Continuent’s external/internal infrastructure is built

    in AWS • Review carried out in the Summer of 2012 after several AWS Outages • Treated the review as a Customer engagement • Further review in Autumn of 2012 leading to the Multi-Cloud deployment Sunday, 21 April 13
  2. ©Continuent 2012 What is AWS Amazon Web Services is a

    collection of remote computing services (also called web services) that together make up a cloud computing platform. The central services are EC2 (Compute) and S3 (Storage) Services. 3 Sunday, 21 April 13
  3. ©Continuent 2012 AWS Regions 4 Ireland (3 AZ) Sao Paulo

    (2 AZ) Northern Virginia (5 AZ) Oregon (3 AZ) California (3 AZ) Singapore (2 AZ) Tokyo (3 AZ) Sydney (2 AZ) Sunday, 21 April 13
  4. ©Continuent 2012 AWS Availability Zones 5 Region Availability Zone Availability

    Zone Availability Zone Region Availability Zone Availability Zone Sunday, 21 April 13
  5. ©Continuent 2012 AWS Services • Compute EC2 • Network -

    Route 53 and Virtual Private Cloud (VPC) • Content Delivery - Cloudfront • Storage - S3, Glacier, EBS • Database - DynamoDB, RDS, RedShift, SimpleDB • Deployment - Cloudformation, Beanstalk, OpsWorks 6 Sunday, 21 April 13
  6. ©Continuent 2012 AWS Size* • Between 100K and 500K physical

    servers • 1.5million Public IP Addresses • S3 holds > 2 Trillion objects - 1.1m requests per second • 1/3 of daily users access a site running on AWS • 1% of internet tra!c goes through Amazon Infrastructure 7 * Estimates based on various internet sources Sunday, 21 April 13
  7. ©Continuent 2012 Continuent Systems • External facing website • Jira/Con"uence

    internal systems • Subversion • Jenkins build system 8 Sunday, 21 April 13
  8. ©Continuent 2012 External Website 9 Internet Elastic IP Web Server

    DB Server Region Availability Zone Sunday, 21 April 13
  9. ©Continuent 2012 Jira/Con"uence/Subversion 10 Internet Elastic IP App Server Jira

    Confluence SVN Server MySQL Availability Zone Region Sunday, 21 April 13
  10. ©Continuent 2012 AWS Problems Summer 2012 “Amazon Cloud Hit by

    Real Clouds, Downing Net"ix, Instagram, Other Sites” Severe Storms caused power outages at AWS US-East Data centers, generators failed taking out 7% of EC2 instances. http://www.pcworld.com/article/258627/ amazon_cloud_hit_by_real_clouds_knocking_out_popular_sites_like_netflix_instagram.html 11 Sunday, 21 April 13
  11. ©Continuent 2012 Migration Plan • Move to a clustered Continuent

    Tungsten environment • Ensure all components are replicated into at least one other AWS Region • Limited downtime on Customer facing systems • Minimal downtime on internal systems 12 Sunday, 21 April 13
  12. ©Continuent 2012 13 Master Slave Slave App Logic Tungsten Connector

    Replicator Replicator Replicator App Logic Tungsten Connector Manager Manager Manager Data Service: nyc Sunday, 21 April 13
  13. ©Continuent 2012 13 Master Slave Slave App Logic Tungsten Connector

    Replicator Replicator Replicator App Logic Tungsten Connector Manager Manager Manager Monitoring and control Monitoring and control Data Service: nyc Sunday, 21 April 13
  14. ©Continuent 2012 13 Master Slave Slave App Logic Tungsten Connector

    Replicator Replicator Replicator App Logic Tungsten Connector Manager Manager Manager Monitoring and control Monitoring and control Data Service: nyc Sunday, 21 April 13
  15. ©Continuent 2012 13 Master Slave Slave App Logic Tungsten Connector

    Replicator Replicator Replicator App Logic Tungsten Connector Manager Manager Manager Monitoring and control Monitoring and control Data Service: nyc Sunday, 21 April 13
  16. ©Continuent 2012 Website Database Tier - Round 1 14 Region

    Availability Zone Availability Zone Region Availability Zone US-EAST-1 US-WEST-1 1B 1C 1C S3 Backups S3 Backups Connectors Sunday, 21 April 13
  17. ©Continuent 2012 DB Failures - Failure in US-EAST-1C 15 Region

    Availability Zone Availability Zone Region Availability Zone US-EAST-1 US-WEST-1 1B 1C 1C S3 Backups S3 Backups Connectors Sunday, 21 April 13
  18. ©Continuent 2012 DB Failures - Failure in US-EAST 16 Region

    Availability Zone Availability Zone Region Availability Zone US-EAST-1 US-WEST-1 1B 1C 1C S3 Backups S3 Backups Connectors Sunday, 21 April 13
  19. ©Continuent 2012 Website Web Tier - Round 1 18 Region

    Availability Zone Availability Zone Region Availability Zone US-EAST-1 US-WEST-1 1B 1C 1C S3 Backups S3 Backups Internet EIP Sunday, 21 April 13
  20. ©Continuent 2012 Web Failures - Failure in US-EAST-1C 19 Region

    Availability Zone Availability Zone Region Availability Zone US-EAST-1 US-WEST-1 1B 1C 1C S3 Backups S3 Backups Internet EIP Sunday, 21 April 13
  21. ©Continuent 2012 Web Failures - Failure in US-EAST 20 Region

    Availability Zone Availability Zone Region Availability Zone US-EAST-1 US-WEST-1 1B 1C 1C S3 Backups S3 Backups Internet EIP DNS Update Sunday, 21 April 13
  22. ©Continuent 2012 Jira/Con"uence/SVN - Round 1 21 Region Availability Zone

    Region Availability Zone US-EAST-1 US-WEST-1 1C 1C S3 Backups S3 Backups Internet EIP Sunday, 21 April 13
  23. ©Continuent 2012 AWS Failures - Autumn 2012 “Amazon Web Services

    outage takes out popular websites again” •EBS degraded performance •Problems allocating new volumes http://www.pcworld.com/article/2012852/amazon-web-services-outage-takes-out-popular- websites-again.html 22 Sunday, 21 April 13
  24. ©Continuent 2012 Website Database Tier - Round 2 23 Region

    Availability Zone Availability Zone Region Availability Zone US-EAST-1 US-WEST-1 1B 1C 1C S3 Backups S3 Backups RackSpace Sunday, 21 April 13
  25. ©Continuent 2012 Website Web Tier - Round 2 24 Region

    Availability Zone Availability Zone Region Availability Zone US-EAST-1 US-WEST-1 1B 1C 1C S3 Backups S3 Backups Internet EIP RackSpace Sunday, 21 April 13
  26. ©Continuent 2012 Jira/Con"uence/SVN - Round 2 25 Region Availability Zone

    Region Availability Zone US-EAST-1 US-WEST-1 1C 1C S3 Backups S3 Backups Internet EIP RackSpace Sunday, 21 April 13
  27. ©Continuent 2012 Best Practices • RAID EBS Volumes (RAID1) •

    Backups • xtrabackup (backed up into S3) • EBS Snapshot 26 ec2-­‐consistent-­‐snapshot  \  -­‐-­‐mysql  -­‐-­‐freeze-­‐filesystem  /vol  \  -­‐-­‐region  eu-­‐west-­‐1    \  -­‐-­‐description  "$(hostanme)  RAID  snapshot   $(date  +'%Y-­‐%m-­‐%d  %H:%M:%S')"  \  vol-­‐1f9a6446  vol-­‐649a643d Sunday, 21 April 13
  28. ©Continuent 2012 Best Practices • Monitoring • Nagios scripts converted

    to email alerts • New Relic 27 Sunday, 21 April 13
  29. ©Continuent 2012 Lesson Learnt • EC2 Instances fail • One

    of anything is never enough • Don’t assume you can spin up more resources instantly • Think multi-cloud, public/private 28 Sunday, 21 April 13
  30. ©Continuent 2012 Further Plans • Realtime replication of web assets

    (glusterFS?) • Introduce a Elastic Load Balancer in front of US-EAST Web servers to allow for auto web failover • Migrate into a VPC 29 Sunday, 21 April 13
  31. ©Continuent 2012 30 Continuent Website: http://www.continuent.com Tungsten Replicator 2.0: http://code.google.com/p/tungsten-replicator

    Our Blogs: http://scale-out-blog.blogspot.com http://datacharmer.blogspot.com http://flyingclusters.blogspot.com 560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009 e-mail: [email protected] Sunday, 21 April 13