Cloud-Native Observability

Slide 1

Slide 1 text

@tyler_treat Cloud-Native Observability Tyler Treat / Cloud Native - Madison / June 6, 2019

Slide 2

Slide 2 text

@tyler_treat

Slide 3

Slide 3 text

@tyler_treat Monitoring

Slide 4

Slide 4 text

@tyler_treat APM Debugger Proﬁler SSH grep

Slide 5

Slide 5 text

@tyler_treat APM Debugger Proﬁler SSH grep

Slide 6

Slide 6 text

@tyler_treat APM Debugger Proﬁler SSH System Behavior grep

Slide 7

Slide 7 text

@tyler_treat APM Debugger Proﬁler SSH System Behavior Actual Customer Impact grep

Slide 8

Slide 8 text

@tyler_treat Monitoring

Slide 9

Slide 9 text

@tyler_treat APM Debugger Proﬁler SSH grep

Slide 10

Slide 10 text

@tyler_treat APM Debugger Proﬁler SSH Testing in Production at Scale, Amit Gud grep

Slide 11

Slide 11 text

@tyler_treat APM Debugger Proﬁler SSH System Behavior Actual Customer Impact ??? grep

Slide 12

Slide 12 text

@tyler_treat “Observability”

Slide 13

Slide 13 text

@tyler_treat Post Hoc vs. Ad Hoc

Slide 14

Slide 14 text

@tyler_treat Data Available Understanding

Slide 15

Slide 15 text

@tyler_treat Data Available Understanding Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit”

Slide 16

Slide 16 text

@tyler_treat Data Available Understanding Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”

Slide 17

Slide 17 text

Slide 18

Slide 18 text

@tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing  sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

@tyler_treat Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing  sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” ASSUMPTIONS FACTS HYPOTHESES

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

@tyler_treat   Observability Data application logs system logs audit logs application metrics distributed traces events

Slide 26

Slide 26 text

@tyler_treat Some  challenges…   Observability Data application logs system logs audit logs application metrics distributed traces events - Locked up inside a single vendor’s solution - Not readily available across the enterprise  (or in some cases, too readily available) - Many tools and products needed for  different data and use cases - Tool and data needs vary from team to  team - Ever-changing landscape of tools, products,  and services - Sheer volume of data can be overwhelming

Slide 27

Slide 27 text

@tyler_treat System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client Amazon Glacier S3 Client … Datadog Metrics Agent

Slide 28

Slide 28 text

System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Sp Un For Universal Analytics Client System System System System

Slide 29

Slide 29 text

@tyler_treat How big of a lift is it for your organization to change tools?

Slide 30

Slide 30 text

@tyler_treat How easy is it to experiment with new ones?

Slide 31

Slide 31 text

@tyler_treat Data Sources • VMs • Containers • Load balancers • Service meshes • Audit logs • VPC flow logs • Firewall logs • … Data Sinks • Centralized logging • SIEM • Monitoring • APM • Alerting • Cold storage • BI • … What data to send? Where to send it? How to send it?

Slide 32

Slide 32 text

@tyler_treat A decoupled approach

Slide 33

Slide 33 text

@tyler_treat What data to send? Where to send it? How to send it? Data Sources • VMs • Containers • Load balancers • Service meshes • Audit logs • VPC flow logs • Firewall logs • … Data Sinks • Centralized logging • SIEM • Monitoring • APM • Alerting • Cold storage • BI • … Observability Pipeline

Slide 34

Slide 34 text

@tyler_treat The Observability Pipeline

Slide 35

Slide 35 text

@tyler_treat Structure your damn data. 1. Data Speciﬁcations

Slide 36

Slide 36 text

@tyler_treat log.error(“User '{}' login failed”.format(user))

Slide 37

Slide 37 text

@tyler_treat ERROR 2019-04-05 13:26.42 User ‘tylertreat' login failed

Slide 38

Slide 38 text

@tyler_treat log.error(“User login failed”, event=LOGIN_ERROR, user=“tylertreat”, email=“[email protected]”, error=error)

Slide 39

Slide 39 text

@tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “user”: “tylertreat”, “email”: “[email protected]”, “error”: “Invalid username or password”, “message”: “User login failed” }

Slide 40

Slide 40 text

@tyler_treat JSON is ﬁne.

Slide 41

Slide 41 text

@tyler_treat Pass a context object to everything.

Slide 42

Slide 42 text

@tyler_treat def login(ctx, username, email, password): ctx.set(user=username, email=email) ... log.error(“User login failed”, event=LOGIN_ERROR, context=ctx, error=error) ...

Slide 43

Slide 43 text

@tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “context”: { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”, “email”: “[email protected]”, }, “error”: “Invalid username or password”, “message”: “User login failed” }

Slide 44

Slide 44 text

Slide 45

Slide 45 text

@tyler_treat Create standard specs for each data type collected (logs, metrics, traces).

Slide 46

Slide 46 text

@tyler_treat Specs can enforce required ﬁelds (e.g. user id, license, trace id) and data types.

Slide 47

Slide 47 text

@tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “INFO”, “event”: “user_login”, “context”: { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”,  “user_id”: “3bb12f6c63274abe87fd1ee4ee37f3d2”,  “license”: “942e6543f0844be680e72003d5e060fd”, “email”: “[email protected]”, } }

Slide 48

Slide 48 text

@tyler_treat Specs alone aren’t enough! 2. Speciﬁcation Libraries

Slide 49

Slide 49 text

@tyler_treat We need libraries.

Slide 50

Slide 50 text

@tyler_treat • Java: log4j • Go: logrus • Python: structlog • Ruby: ruby-cabin • .NET: serilog • JS: structured-log • etc. There are many existing libraries for structured logging.

Slide 51

Slide 51 text

@tyler_treat For tracing and metrics, there are vendor-neutral APIs like OpenTracing and OpenCensus.

Slide 52

Slide 52 text

@tyler_treat We need a lightweight agent that can collect data from hosts/containers. 3. Data Collector

Slide 53

Slide 53 text

@tyler_treat Collect data, perform transformations/ ﬁlters, and write it to the data pipeline.

Slide 54

Slide 54 text

@tyler_treat Typically runs as an agent on the host (DaemonSet in Kubernetes).

Slide 55

Slide 55 text

@tyler_treat Data is written to stdout/stderr or a Unix domain socket.

Slide 56

Slide 56 text

@tyler_treat Just use Fluentd or Logstash (+Beats).

Slide 57

Slide 57 text

@tyler_treat We need a scalable, fault-tolerant data stream to handle the ﬁrehose of observability data generated. 4. Data Pipeline

Slide 58

Slide 58 text

@tyler_treat This also provides a buffer that decouples producers from consumers.

Slide 59

Slide 59 text

@tyler_treat Lots of options…

Slide 60

Slide 60 text

@tyler_treat

Slide 61

Slide 61 text

@tyler_treat We need a component to consume data from the pipeline, perform ﬁltering, and write it to the appropriate backends. 5. Data Router

Slide 62

Slide 62 text

@tyler_treat This is where the data spec comes into play.

Slide 63

Slide 63 text

@tyler_treat The data shape determines how incoming data is handled.

Slide 64

Slide 64 text

@tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics

Slide 65

Slide 65 text

@tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics

Slide 66

Slide 66 text

@tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics

Slide 67

Slide 67 text

@tyler_treat This is primarily a stateless component writing to APIs.

Slide 68

Slide 68 text

@tyler_treat Good ﬁt for “serverless” solutions.

Slide 69

Slide 69 text

@tyler_treat Piecing It All Together

Slide 70

Slide 70 text

@tyler_treat

Slide 71

Slide 71 text

@tyler_treat You don’t need to build it out all in one go.

Slide 72

Slide 72 text

@tyler_treat There are quick wins along the way!

Slide 73

Slide 73 text

@tyler_treat Evolving to an Observability Pipeline • Adopt structured logging • Move log/data collection out of process • Use a centralized logging system • Introduce a streaming data solution • Start adding data consumers