Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Observability and resiliency patterns in the cloud - Serverless edition

Sara Gerion
February 06, 2020

Observability and resiliency patterns in the cloud - Serverless edition

Public talk presented at the ServerlessDays meetup in Amsterdam.

While observability empowers developers to be in full control over their owned services, resiliency relies on observability to automate failure handling.
In DAZN, using metrics to monitor the known and, even more importantly the unknown is fundamental to build resilient applications. I will show some examples of the steps that can be taken to ensure that your own micro-services are healthy and resilient on different layers of your (serverless) architecture.

Sara Gerion

February 06, 2020
Tweet

More Decks by Sara Gerion

Other Decks in Programming

Transcript

  1. 15 Logging gives us information related to the errors and

    operations of an application @Sarutule
  2. 24 Metrics give us information related to the state of

    the underlying infrastructure via discrete values collected during a specific amount of time @Sarutule
  3. 26 USEFUL STRATEGIES Capture missing data ▪ Missing logs ▪

    Incorrect number of running resources HTTP requests ▪ Latency of inbound & outbound requests ▪ Http status ▪ Missing http status (error on a TCP / connection layer) Don’t forget to study your data before getting started! Metrics @Sarutule
  4. 29 Missing data ▪ Notify if service unavailable Definition of

    different alert types ▪ Custom recipients and custom medium based on severity ▪ Better escalation & visibility for stakeholders Study and your data (should this be a metric or the context of a metric?) ▪ Better metrics filtering ▪ Less data pollution Monitor also the stage environment ▪ Get notified of errors before they appear on prod Monitors & alerts @Sarutule
  5. 43 ◆ Logging ◆ Metrics ◆ Traceability ◆ Monitors ◆

    Alerts Observability patterns @Sarutule
  6. 44 ◆ Redundancy of computing units ◆ Replication of data

    ◆ Load balancing ◆ DNS routing ◆ Delivery at the edge Resiliency patterns @Sarutule
  7. 45 ◆ What can go wrong? ◆ How should my

    system react in case of that specific failure? ◆ Who should be notified, and how? ◆ How can we limit the impact for customers? Useful questions during design phase @Sarutule