Distributed Tracing in Serverless Systems - KubeCon 2018.pdf

Slide 1

Slide 1 text

KubeCon + CloudNativeCon Seattle Distributed Tracing in Serverless Systems Nitzan Shapira, Epsagon

Slide 2

Slide 2 text

Nitzan Shapira (@nitzanshapira) Software engineer > 12 years Co-Founder, CEO at Epsagon Tel Aviv 2 > whoami

Slide 3

Slide 3 text

What is serverless? How is it different? What is observability for serverless? How can distributed tracing help? How will it help my job? 3 Things to discuss

Slide 4

Slide 4 text

4 [Compute-as-a-Service] FaaS: Function-as-a-Service CaaS: Container-as-a-Service + Managed services (APIs) = Don’t manage infrastructure Focus on business logic What is serverless?

Slide 5

Slide 5 text

5 Why serverless? Pay-per-use: reduces cloud compute cost by 90% Out-of-the-box auto-scaling DevOps à LowOps ++Developer velocity Focus on business logic – iterate faster Server Utilization

Slide 6

Slide 6 text

6 The limitations of FaaS Limited memory Limited running time Cold starts Stateless + concurrency limit + some others…

Slide 7

Slide 7 text

7 The properties of serverless applications Serverless is micro-services Serverless applications are - Highly distributed - Highly event-driven Utilizing managed services via APIs is key

Slide 8

Slide 8 text

A real example – HSBC 8 Source: re:Invent 2018

Slide 9

Slide 9 text

9 The challenge in serverless SIMPLE COMPLEX Yan Cui

Slide 10

Slide 10 text

10 What the community thinks 2018 Serverless Community Survey, serverless.com, July 2018 2017 results

Slide 11

Slide 11 text

11 Observability – why do we need it? Track system health Troubleshoot and fix Optimize performance and cost

Slide 12

Slide 12 text

12 Observability in serverless Let’s go one by one

Slide 13

Slide 13 text

13 Track system health System == Functions ?

Slide 14

Slide 14 text

14 Functions are important - Errors - Timeout - Out-of-memory - Cold start

Slide 15

Slide 15 text

15 Track system health System > Functions ! Serverless != Functions

Slide 16

Slide 16 text

16 Track system health System > Functions ! Functions APIs Transactions

Slide 17

Slide 17 text

17 Troubleshoot and fix Functions are not enough Need: track asynchronous events e

Slide 18

Slide 18 text

18 Transactions

Slide 19

Slide 19 text

19 Tracing asynchronous invocations

Slide 20

Slide 20 text

20 Tracing asynchronous invocations

Slide 21

Slide 21 text

21 Tracing asynchronous invocations

Slide 22

Slide 22 text

22 Distributed tracing …a trace tells the story of a transaction or workflow as it propagates through a (potentially distributed) system. Distributed tracing is a method used to profile and monitor applications.

Slide 23

Slide 23 text

23 Distributed tracing Jaeger

Slide 24

Slide 24 text

24 Implementing distributed tracing Manual tracing/instrumentation Before/after calls At the end of each micro-service High maintenance High potential of errors

Slide 25

Slide 25 text

25 Serverless apps are very distributed Complex systems have thousands of functions What about the developer velocity?

Slide 26

Slide 26 text

26 Can it be done differently in serverless?

Slide 27

Slide 27 text

27 Automation can help to keep up with the development speed of serverless

Slide 28

Slide 28 text

28 Example

Slide 29

Slide 29 text

29 Example

Slide 30

Slide 30 text

30 Monitoring serverless Limited memory Limited running time Cold starts Stateless

Slide 31

Slide 31 text

31 Time is $$$

Slide 32

Slide 32 text

32 Where do we spend the most time? Our own code API calls

Slide 33

Slide 33 text

33 Serverless cost crisis A real-life example $$$$$$$$$$$$

Slide 34

Slide 34 text

34 Scanning functions Scanning CloudWatch using AWS Lambda Every 5 minutes, save to RDS A new Lambda is spawned for every customer’s function Poll Spawn (async) CloudWatch

Slide 35

Slide 35 text

35 As time flies… CloudWatch became highly throttled Requests took too much time 5K concurrent Lambdas, for 5 minutes, timing out , every 5 minutes !!!!

Slide 36

Slide 36 text

36 Why you should care about external APIs 702ms e

Slide 37

Slide 37 text

37 Track service health

Slide 38

Slide 38 text

38 Business flows Subscribe Transfer Payment

Slide 39

Slide 39 text

39 What should I optimize first?

Slide 40

Slide 40 text

40 Remember… Serverless + Distributed Tracing = Perfect marriage (but only if you automate)

Slide 41

Slide 41 text

[email protected] @nitzanshapira www.epsagon.com Thank you!