Failover Early: When to Failover at Your CDN

© AKAMAI - EDGE 2017 Failover early When to failover
at your CDN

© AKAMAI - EDGE 2017 Global Consulting Services Accelerate our
customer's business through a partnership focused on expertise, innovation, and education. [email protected] Manuel Alvarez Enterprise Architect [email protected] https://www.linkedin.com/in/manueldalvarez/ @MD_A13

© AKAMAI - EDGE 2017 By failing to prepare, you
are preparing to fail - Benjamin Franklin

© AKAMAI - EDGE 2017 You Are At Risk Risk
is the probability of something adverse to occur and the magnitude of its consequences Risk mitigation reduce the likelihood or the consequences of its occurrence •Risk Avoidance •Risk Limitation •Risk Transference •Risk Acceptance Mitigation cost ≤ cost of failure

© AKAMAI - EDGE 2017 Risk Mitigation is DevOps You
already monitor “adversities” on real time Automate error handling as you automate deployment! Call it constant availability instead of disaster recovery 99.9% Availability Down 8.76 hours/year

© AKAMAI - EDGE 2017 Akamai and Failover Failover serves
an alternate response when an error condition is met Client Origin Akamai

© AKAMAI - EDGE 2017 Failover serves an alternate response
when an error condition is met Akamai and Failover Client Origin Connection Errors {"task":"plan B"} NetStorage Akamai

© AKAMAI - EDGE 2017 Good Idea: Network Problems Client
Origin Connecti on Errors {"task":"plan B"} Cloud Storage Akamai Akamai capture errors errors that do not show up in your logs •Connection timeouts •Network (link) problems The most expensive item is the one you cannot find •Failover is included on Ion, DSA, and most content delivery products Akamai can transform errors to client friendly messages

© AKAMAI - EDGE 2017 Good idea: IoT Friendly Errors
Low capacity devices can hang if a well formatted response is not received Serve an empty or “default” JSON/XML error to IoT devices and API clients Set Akamai read and connection timeouts to less than 30s

© AKAMAI - EDGE 2017 Case Study: Pilot Light Disaster
Recovery Disaster Recovery retain business continuity in the event of a Disaster Pilot Light or Warm Environment are scaled down replicas of your production environment waiting to receive traffic in the event of a disaster •In Cloud or Infrastructure as a Code environment, automated processes combined with auto-scaling increase the environment capacity Challenges with Pilot Light or Warm Environment •Scaling to meet Recovery Time Objective (RTO) •Time it takes to shift traffic to DR location •Human errors

Recovery User Edge Server Disaster Recovery NGINX Decides Origin Control Traffics Route 53 Elastic Load Balancer Auto-scaling group EC2 Instances RDS S3 Origin Microservice s MySQL F5 NetStorage

Recovery User Edge Server User Disaster Recovery NGINX Decides Origin Control Traffics Route 53 Elastic Load Balancer Auto-scaling group EC2 Instances RDS S3 Origin Microservice s MySQL F5 Send 50% Scale me up! NetStorage

© AKAMAI - EDGE 2017 Case Study: Bots There are
valid reasons to serve alternate content to bots •Single Page Apps prerendered pages •Excluding them from A/B tests •Malicious bots doing malicious things; i.e. crawling your prices or attacking Do a Cost Benefit Analysis: •What is the risk/impact of a false positive? •Does Bot Manager fir your use case? •What is the cost of replicating the logic on your side? •Is there a conversion impact related to performance?

© AKAMAI - EDGE 2017 Origin Infrastructure Errors The the
higher the stack, the higher abstraction •Infrastructure, database, component communications, overflows, etc. must be monitored/fixed infernally •Akamai cannot see this errors! •Leverage your cloud provider monitoring tools

© AKAMAI - EDGE 2017 Requests that Trigger an Action
Financial transaction (buy stocks, purchase goods) •It can result in double billing •Contain PII information PUT or POST requests on an RESTful API Best practices •Validate transactions at the back end •Keep checkout process state at the client and let it decide when to retry

© AKAMAI - EDGE 2017 Infinite Failover You can nest
failover conditions to increase resiliency Every try/re-try takes time, the client will abort before you respond Best practices: •Retry three (3) or fewer times •Set timeouts on each retry to reduce client wait time

© AKAMAI - EDGE 2017 Centralizing Independent Deployments Micro-services are
developed by independent teams that release containers synchronously Each container has its own deployment pipeline •Container owner might opt for for deployment strategies such as canary deployment •Canary Deployments consist on rolling out releases to a subset of users or servers first Challenge: No centralized source of information/governance

© AKAMAI - EDGE 2017 Centralizing Independent Deployments Bad Idea:
centralize the canary deployment of containers at the CDN §The CDN is on a different layer and do not see your container so it will require •Expose containers to the Internet •Adding an unique cookie per container with a value that indicates the container version Best practice: Control the release at the container level; e.g. using Envoy

© AKAMAI - EDGE 2017 Excessive Alerting You will always
see errors Good: dashboards displaying real time data and taking actions on spikes or abnormalities Bad: automate error escalation for every-single-error •Do not page on duty or management unnecessarily •Do not open tickets because you disagree with a design option

© AKAMAI - EDGE 2017 One Size Fits All Implementing
a catch all strategy for all your errors Bad for you: § No error information § Force you to analyze logs and “guess” the error

© AKAMAI - EDGE 2017 Implementing a catch all strategy
for all your errors One Size Fits All Bad for you: § No error information § Force you to analyze logs and “guess” the error Bad for client: § No navigation § No branding § No call to action

© AKAMAI - EDGE 2017 Error Pages Best Practices Failover
page must be •Branded •Contains the default navigation •Include search capabilities •Display error information • Body • Query String • JSON/XML •Current Do not develop a static page per error, create dynamic pages that display error information

© AKAMAI - EDGE 2017 Summary Failover good use cases
1. Network related: errors and mapping 2. Nullpotent requests (GET) 3. Improved efficiency over origin: reduced latency, better capability, risk transference Failover NOT good use cases 1. Infrastructure errors and autonomous deployments 2. When the request is non- idempotent (POST,PUT, etc.) 3. When cost or operational burden is higher than the consequences

© AKAMAI - EDGE 2017 Final Remarks Talk and mitigate
risk Akamai is another tool in your toolbox Talk about constant availability instead of disaster recovery •Reuse artifacts: Pilot light environment is a minimum viable product (MVP) for Cloud Migrations

Failover Early: When to Failover at Your CDN

Failover Early: When to Failover at Your CDN

Akamai Developer

More Decks by Akamai Developer

Other Decks in Technology

Featured

Transcript

© AKAMAI - EDGE 2017 Failover early When to failover

© AKAMAI - EDGE 2017 Global Consulting Services Accelerate our

© AKAMAI - EDGE 2017 Global Consulting Services Accelerate our

© AKAMAI - EDGE 2017 What Is This?

© AKAMAI - EDGE 2017 Oxygen Pack

© AKAMAI - EDGE 2017 By failing to prepare, you

© AKAMAI - EDGE 2017 You Are At Risk Risk

© AKAMAI - EDGE 2017 Risk Mitigation is DevOps You

© AKAMAI - EDGE 2017 Akamai and Failover Failover serves

© AKAMAI - EDGE 2017 Failover serves an alternate response

© AKAMAI - EDGE 2017 Do This!

© AKAMAI - EDGE 2017 Good Idea: Network Problems Client

© AKAMAI - EDGE 2017 Good idea: IoT Friendly Errors

© AKAMAI - EDGE 2017 Case Study: Pilot Light Disaster

© AKAMAI - EDGE 2017 Case Study: Pilot Light Disaster

© AKAMAI - EDGE 2017 Case Study: Pilot Light Disaster

© AKAMAI - EDGE 2017 Case Study: Bots There are

© AKAMAI - EDGE 2017 Do NOT Do This!

© AKAMAI - EDGE 2017 Origin Infrastructure Errors The the

© AKAMAI - EDGE 2017 Requests that Trigger an Action

© AKAMAI - EDGE 2017 Infinite Failover You can nest

© AKAMAI - EDGE 2017 Centralizing Independent Deployments Micro-services are

© AKAMAI - EDGE 2017 Centralizing Independent Deployments Bad Idea:

© AKAMAI - EDGE 2017 Excessive Alerting You will always

© AKAMAI - EDGE 2017 One Size Fits All Implementing

© AKAMAI - EDGE 2017 Implementing a catch all strategy

© AKAMAI - EDGE 2017 Error Pages Best Practices Failover

© AKAMAI - EDGE 2017 Summary Failover good use cases

© AKAMAI - EDGE 2017 Final Remarks Talk and mitigate

© AKAMAI - EDGE 2017