Error Handling in Stateless Environments

Slide 1

Slide 1 text

devopsdays Tel Aviv Error Handling in Stateless Environments Nitzan Shapira, Epsagon

Slide 2

Slide 2 text

Nitzan Shapira (@nitzanshapira) Software engineer > 12 years Co-Founder, CEO at Epsagon Tel Aviv 2 > whoami

Slide 3

Slide 3 text

What is serverless? How is it different? What is stateless? Why is it relevant today? How to handle errors in such environments? How will it help my job? 3 Things to discuss

Slide 4

Slide 4 text

4 [Compute-as-a-Service] FaaS: Function-as-a-Service CaaS: Container-as-a-Service + Managed services (APIs) = Don’t manage infrastructure Focus on business logic What is serverless?

Slide 5

Slide 5 text

5 Why serverless? Pay-per-use: reduces cloud compute cost by 90% Out-of-the-box auto-scaling DevOps à LowOps ++Developer velocity Focus on business logic – iterate faster Server Utilization

Slide 6

Slide 6 text

6 The limitations of FaaS Limited memory Limited running time Cold starts Stateless + concurrency limit + some others…

Slide 7

Slide 7 text

7 The properties of serverless applications Serverless is micro-services Serverless applications are - Highly distributed - Highly event-driven Utilizing managed services via APIs is key

Slide 8

Slide 8 text

A real example – HSBC 8 Source: re:Invent 2018

Slide 9

Slide 9 text

9 The challenge in serverless SIMPLE COMPLEX Yan Cui

Slide 10

Slide 10 text

10 Troubleshoot and fix e

Slide 11

Slide 11 text

11 What the community thinks 2018 Serverless Community Survey, serverless.com, July 2018 2017 results

Slide 12

Slide 12 text

12 Error handling in traditional environments

Slide 13

Slide 13 text

13 What do you do when something goes wrong? Take a look at the log – it will tell the story Connect to the host! Run a debugger!

Slide 14

Slide 14 text

14 Stateless environments challenges Event-drive design No server to connect to – difficult to troubleshoot No current state – difficult to determine system health

Slide 15

Slide 15 text

15 Error handling in stateless environments

Slide 16

Slide 16 text

16 Types of failures in serverless Unhandled exception – from your own code Timeout – due to your code or an external service Out-of-memory – due to your code or misconfiguration

Slide 17

Slide 17 text

17 Retries behavior Synchronous events • The invoking application is in charge of the error Asynchronous events • A retry mechanism is triggered for a certain period of time Stream-based events • A retry mechanism is triggered until the data is expired

Slide 18

Slide 18 text

18 Retry behavior consequences Retries might change the logical flow of the application In order for retries to succeed, the code must be idempotent: “Idempotence is the property of certain operations in mathematics and computer science that they can be applied multiple times without changing the result beyond the initial application” (Wikipedia).

Slide 19

Slide 19 text

19 Example of idempotent operations Update the same DB entry to the same value multiple times Authenticate a user Check if a file exists, and if not, create an empty file with that name Confusing to design, difficult to implement!

Slide 20

Slide 20 text

20 Practical methods for retries Write idempotent code Tough! Use a proper service Example : AWS Step Functions Source: AWS

Slide 21

Slide 21 text

21 AWS Step Functions – Serverless Orchestration

Slide 22

Slide 22 text

22 Troubleshooting with retries

Slide 23

Slide 23 text

23 SNS + AWS Step Functions

Slide 24

Slide 24 text

24 Deploying using Serverless Framework

Slide 25

Slide 25 text

25 Observability – why do we need it? Track system health Troubleshoot and fix Optimize performance and cost

Slide 26

Slide 26 text

26 Observability in serverless and stateless systems Let’s go one by one

Slide 27

Slide 27 text

27 Track system health System == Functions ?

Slide 28

Slide 28 text

28 Functions are important - Errors - Timeout - Out-of-memory - Cold start

Slide 29

Slide 29 text

29 Track system health System > Functions ! Serverless != Functions

Slide 30

Slide 30 text

30 Track system health System > Functions ! Functions APIs Transactions

Slide 31

Slide 31 text

31 Troubleshoot and fix Functions are not enough Need: track asynchronous events e

Slide 32

Slide 32 text

32 Transactions

Slide 33

Slide 33 text

33 Tracing asynchronous invocations

Slide 34

Slide 34 text

34 Tracing asynchronous invocations

Slide 35

Slide 35 text

35 Tracing asynchronous invocations

Slide 36

Slide 36 text

36 Distributed tracing …a trace tells the story of a transaction or workflow as it propagates through a (potentially distributed) system. Distributed tracing is a method used to profile and monitor applications.

Slide 37

Slide 37 text

37 Distributed tracing Jaeger

Slide 38

Slide 38 text

38 Implementing distributed tracing Manual tracing/instrumentation Before/after calls At the end of each micro-service High maintenance High potential of errors

Slide 39

Slide 39 text

39 Serverless apps are very distributed Complex systems have thousands of functions What about the developer velocity?

Slide 40

Slide 40 text

40 Can it be done differently in serverless?

Slide 41

Slide 41 text

41 Automation can help to keep up with the development speed of serverless

Slide 42

Slide 42 text

42 Example

Slide 43

Slide 43 text

43 Example

Slide 44

Slide 44 text

50 Why you should care about external APIs 702ms e

Slide 45

Slide 45 text

52 Business flows Subscribe Transfer Payment

Slide 46

Slide 46 text

53 What should I optimize first?

Slide 47

Slide 47 text

54 Remember… Adopt serverless Think event-driven Use proper orchestration services Automate observability Go to market – faster!

Slide 48

Slide 48 text

[email protected] @nitzanshapira www.epsagon.com Thank you!