Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Failover Early: When to Failover at Your CDN

Failover Early: When to Failover at Your CDN

As we all know, you can put off filing your taxes all you like -- but they're going to happen no matter what. It's the same thing with data-center errors. too -- nobody watns to talk about them, but the reality is that we all deal with timeouts, slow databases, dropped responses, and the like. Not talking about it doesn't solve the problem; you need a solid risk-mitigation strategy that layers the most cost-efficient tactics to reduce or eliminate adverse effects when things don't go as expected. CDNs are an extension of your infrastructure. They give you control over the transport network (the Internet) and provide you cloud features that can be executed close to your clients to offload your infrastructure (whether cloud or on-premises). One of the features common to many CDNs is failover. Failover at this layer means serving an alternate response when a condition is met. Manuel Alvarez explores using the CDN as a failover tool, reviewing use cases and best practices and demonstrating how to decide whether to use a CDN by evaluating costs, benefits, operations, and time to mitigate.

Akamai Developer

October 11, 2017
Tweet

More Decks by Akamai Developer

Other Decks in Technology

Transcript

  1. © AKAMAI - EDGE 2017 Global Consulting Services Accelerate our

    customer's business through a partnership focused on expertise, innovation, and education. [email protected] Manuel Alvarez Enterprise Architect [email protected] https://www.linkedin.com/in/manueldalvarez/ @MD_A13
  2. © AKAMAI - EDGE 2017 Global Consulting Services Accelerate our

    customer's business through a partnership focused on expertise, innovation, and education. [email protected] Manuel Alvarez Enterprise Architect [email protected] https://www.linkedin.com/in/manueldalvarez/ @MD_A13
  3. © AKAMAI - EDGE 2017 By failing to prepare, you

    are preparing to fail - Benjamin Franklin
  4. © AKAMAI - EDGE 2017 You Are At Risk Risk

    is the probability of something adverse to occur and the magnitude of its consequences Risk mitigation reduce the likelihood or the consequences of its occurrence •Risk Avoidance •Risk Limitation •Risk Transference •Risk Acceptance Mitigation cost ≤ cost of failure
  5. © AKAMAI - EDGE 2017 Risk Mitigation is DevOps You

    already monitor “adversities” on real time Automate error handling as you automate deployment! Call it constant availability instead of disaster recovery 99.9% Availability Down 8.76 hours/year
  6. © AKAMAI - EDGE 2017 Akamai and Failover Failover serves

    an alternate response when an error condition is met Client Origin Akamai
  7. © AKAMAI - EDGE 2017 Failover serves an alternate response

    when an error condition is met Akamai and Failover Client Origin Connection Errors {"task":"plan B"} NetStorage Akamai
  8. © AKAMAI - EDGE 2017 Good Idea: Network Problems Client

    Origin Connecti on Errors {"task":"plan B"} Cloud Storage Akamai Akamai capture errors errors that do not show up in your logs •Connection timeouts •Network (link) problems The most expensive item is the one you cannot find •Failover is included on Ion, DSA, and most content delivery products Akamai can transform errors to client friendly messages
  9. © AKAMAI - EDGE 2017 Good idea: IoT Friendly Errors

    Low capacity devices can hang if a well formatted response is not received Serve an empty or “default” JSON/XML error to IoT devices and API clients Set Akamai read and connection timeouts to less than 30s
  10. © AKAMAI - EDGE 2017 Case Study: Pilot Light Disaster

    Recovery Disaster Recovery retain business continuity in the event of a Disaster Pilot Light or Warm Environment are scaled down replicas of your production environment waiting to receive traffic in the event of a disaster •In Cloud or Infrastructure as a Code environment, automated processes combined with auto-scaling increase the environment capacity Challenges with Pilot Light or Warm Environment •Scaling to meet Recovery Time Objective (RTO) •Time it takes to shift traffic to DR location •Human errors
  11. © AKAMAI - EDGE 2017 Case Study: Pilot Light Disaster

    Recovery User Edge Server Disaster Recovery NGINX Decides Origin Control Traffics Route 53 Elastic Load Balancer Auto-scaling group EC2 Instances RDS S3 Origin Microservice s MySQL F5 NetStorage
  12. © AKAMAI - EDGE 2017 Case Study: Pilot Light Disaster

    Recovery User Edge Server User Disaster Recovery NGINX Decides Origin Control Traffics Route 53 Elastic Load Balancer Auto-scaling group EC2 Instances RDS S3 Origin Microservice s MySQL F5 Send 50% Scale me up! NetStorage
  13. © AKAMAI - EDGE 2017 Case Study: Bots There are

    valid reasons to serve alternate content to bots •Single Page Apps prerendered pages •Excluding them from A/B tests •Malicious bots doing malicious things; i.e. crawling your prices or attacking Do a Cost Benefit Analysis: •What is the risk/impact of a false positive? •Does Bot Manager fir your use case? •What is the cost of replicating the logic on your side? •Is there a conversion impact related to performance?
  14. © AKAMAI - EDGE 2017 Origin Infrastructure Errors The the

    higher the stack, the higher abstraction •Infrastructure, database, component communications, overflows, etc. must be monitored/fixed infernally •Akamai cannot see this errors! •Leverage your cloud provider monitoring tools
  15. © AKAMAI - EDGE 2017 Requests that Trigger an Action

    Financial transaction (buy stocks, purchase goods) •It can result in double billing •Contain PII information PUT or POST requests on an RESTful API Best practices •Validate transactions at the back end •Keep checkout process state at the client and let it decide when to retry
  16. © AKAMAI - EDGE 2017 Infinite Failover You can nest

    failover conditions to increase resiliency Every try/re-try takes time, the client will abort before you respond Best practices: •Retry three (3) or fewer times •Set timeouts on each retry to reduce client wait time
  17. © AKAMAI - EDGE 2017 Centralizing Independent Deployments Micro-services are

    developed by independent teams that release containers synchronously Each container has its own deployment pipeline •Container owner might opt for for deployment strategies such as canary deployment •Canary Deployments consist on rolling out releases to a subset of users or servers first Challenge: No centralized source of information/governance
  18. © AKAMAI - EDGE 2017 Centralizing Independent Deployments Bad Idea:

    centralize the canary deployment of containers at the CDN §The CDN is on a different layer and do not see your container so it will require •Expose containers to the Internet •Adding an unique cookie per container with a value that indicates the container version Best practice: Control the release at the container level; e.g. using Envoy
  19. © AKAMAI - EDGE 2017 Excessive Alerting You will always

    see errors Good: dashboards displaying real time data and taking actions on spikes or abnormalities Bad: automate error escalation for every-single-error •Do not page on duty or management unnecessarily •Do not open tickets because you disagree with a design option
  20. © AKAMAI - EDGE 2017 One Size Fits All Implementing

    a catch all strategy for all your errors Bad for you: § No error information § Force you to analyze logs and “guess” the error
  21. © AKAMAI - EDGE 2017 Implementing a catch all strategy

    for all your errors One Size Fits All Bad for you: § No error information § Force you to analyze logs and “guess” the error Bad for client: § No navigation § No branding § No call to action
  22. © AKAMAI - EDGE 2017 Error Pages Best Practices Failover

    page must be •Branded •Contains the default navigation •Include search capabilities •Display error information • Body • Query String • JSON/XML •Current Do not develop a static page per error, create dynamic pages that display error information
  23. © AKAMAI - EDGE 2017 Summary Failover good use cases

    1. Network related: errors and mapping 2. Nullpotent requests (GET) 3. Improved efficiency over origin: reduced latency, better capability, risk transference Failover NOT good use cases 1. Infrastructure errors and autonomous deployments 2. When the request is non- idempotent (POST,PUT, etc.) 3. When cost or operational burden is higher than the consequences
  24. © AKAMAI - EDGE 2017 Final Remarks Talk and mitigate

    risk Akamai is another tool in your toolbox Talk about constant availability instead of disaster recovery •Reuse artifacts: Pilot light environment is a minimum viable product (MVP) for Cloud Migrations