$30 off During Our Annual Pro Sale. View Details »

Reliability in a Serverless world

Driss Amri
February 24, 2023

Reliability in a Serverless world

Serverless @ Nederlandse Spoorwegen

Driss Amri

February 24, 2023
Tweet

More Decks by Driss Amri

Other Decks in Technology

Transcript

  1. ComIT
    Driss Amri
    (AWSome)
    February 2023
    Reliability in a Serverless world

    View Slide

  2. What is Reliability

    View Slide

  3. Availability Downtime Per Year Downtime Per Month
    99%
    (“Two Nines”)
    99.9%
    (“Three Nines”)
    99.99%
    (“Four Nines”)
    7,2 hours
    3,65 days
    43,2 minutes
    8,76 hours
    4,32 minutes
    52,6 minutes

    View Slide

  4. View Slide

  5. View Slide

  6. is the wrong reliability target for
    pretty much everything
    100%

    View Slide

  7. Reliability on AWS

    View Slide

  8. View Slide

  9. View Slide

  10. View Slide

  11. View Slide

  12. View Slide

  13. View Slide

  14. Serverless Reliability

    View Slide

  15. View Slide

  16. Multi-AZ out of the box

    View Slide

  17. Multi-Region

    View Slide

  18. Challenges
    ● Protecting downstream services that don’t scale as well
    ● Service Limit Quotas
    ● More granular architectures
    ● Denial Of Wallet
    ● Per Function and Service (mis)configuration
    ● Lots of services to choose from

    View Slide

  19. Chaos Engineering

    View Slide

  20. Chaos Engineering is the discipline of
    experimenting on a system
    in order to build confidence in the system’s
    capability to withstand turbulent conditions in
    production.

    View Slide

  21. Chaos Engineering is the discipline of
    experimenting on a system
    in order to build confidence in the system’s
    capability to withstand turbulent conditions in
    production.

    View Slide

  22. Chaos Engineering is the discipline of
    experimenting on a system
    in order to build confidence in the system’s
    capability to withstand turbulent conditions in
    production.

    View Slide

  23. Chaos Engineering is the discipline of
    experimenting on a system
    in order to build confidence in the system’s
    capability to withstand turbulent conditions in
    production.

    View Slide

  24. View Slide

  25. View Slide

  26. View Slide

  27. View Slide

  28. View Slide

  29. View Slide

  30. Common faults
    ● Network
    ○ Latency
    ○ Bandwidth
    ○ Failure to connect
    ○ 4XX/5XX HTTP
    Response
    ● Resource Exhaustion:
    ○ CPU Stress
    ○ Memory
    ○ Disk Space
    ● Weaknesses
    ○ Error handling
    ○ Timeout values
    ○ Events
    ○ Fallbacks
    ○ Fail overs

    View Slide

  31. Demo

    View Slide

  32. Demo

    View Slide

  33. View Slide

  34. View Slide

  35. View Slide

  36. View Slide

  37. View Slide

  38. Static stability using Availability Zones
    https://aws.amazon.com/builders-library/static-stability-using-availability-zones/
    Beyond five 9s: Lessons from our highest available data planes
    https://www.youtube.com/watch?v=2L1S0zfnIzo
    Chaos testen voor betrouwbaarheid
    https://nsdigitaal.sharepoint.com/sites/TestenBijNS/SitePages/Chaos-testen-voor-betrouwbaarheid.as
    px?source=https%3A%2F%2Fnsdigitaal.sharepoint.com%2Fsites%2FTestenBijNS

    View Slide