Slide 1

Slide 1 text

@tyler_treat Cloud-Native Observability Tyler Treat / Cloud Native - Madison / June 6, 2019

Slide 2

Slide 2 text

@tyler_treat

Slide 3

Slide 3 text

@tyler_treat Monitoring

Slide 4

Slide 4 text

@tyler_treat APM Debugger Profiler SSH grep

Slide 5

Slide 5 text

@tyler_treat APM Debugger Profiler SSH grep

Slide 6

Slide 6 text

@tyler_treat APM Debugger Profiler SSH System Behavior grep

Slide 7

Slide 7 text

@tyler_treat APM Debugger Profiler SSH System Behavior Actual Customer Impact grep

Slide 8

Slide 8 text

@tyler_treat Monitoring

Slide 9

Slide 9 text

@tyler_treat APM Debugger Profiler SSH grep

Slide 10

Slide 10 text

@tyler_treat APM Debugger Profiler SSH Testing in Production at Scale, Amit Gud grep

Slide 11

Slide 11 text

@tyler_treat APM Debugger Profiler SSH System Behavior Actual Customer Impact ??? grep

Slide 12

Slide 12 text

@tyler_treat “Observability”

Slide 13

Slide 13 text

@tyler_treat Post Hoc vs. Ad Hoc

Slide 14

Slide 14 text

@tyler_treat Data Available Understanding

Slide 15

Slide 15 text

@tyler_treat Data Available Understanding Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit”

Slide 16

Slide 16 text

@tyler_treat Data Available Understanding Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”

Slide 17

Slide 17 text

@tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”

Slide 18

Slide 18 text

@tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”

Slide 19

Slide 19 text

@tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” FACTS

Slide 20

Slide 20 text

@tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” FACTS HYPOTHESES

Slide 21

Slide 21 text

@tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” ASSUMPTIONS FACTS HYPOTHESES

Slide 22

Slide 22 text

@tyler_treat Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” ASSUMPTIONS FACTS HYPOTHESES

Slide 23

Slide 23 text

@tyler_treat Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” HYPOTHESES Monitoring Observability

Slide 24

Slide 24 text

@tyler_treat Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” HYPOTHESES Testing Exploring

Slide 25

Slide 25 text

@tyler_treat 
 Observability Data application logs system logs audit logs application metrics distributed traces events

Slide 26

Slide 26 text

@tyler_treat Some
 challenges… 
 Observability Data application logs system logs audit logs application metrics distributed traces events - Locked up inside a single vendor’s solution - Not readily available across the enterprise
 (or in some cases, too readily available) - Many tools and products needed for
 different data and use cases - Tool and data needs vary from team to
 team - Ever-changing landscape of tools, products,
 and services - Sheer volume of data can be overwhelming

Slide 27

Slide 27 text

@tyler_treat System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client Amazon Glacier S3 Client … Datadog Metrics Agent

Slide 28

Slide 28 text

System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Sp Un For Universal Analytics Client System System System System

Slide 29

Slide 29 text

@tyler_treat How big of a lift is it for your organization to change tools?

Slide 30

Slide 30 text

@tyler_treat How easy is it to experiment with new ones?

Slide 31

Slide 31 text

@tyler_treat Data Sources • VMs • Containers • Load balancers • Service meshes • Audit logs • VPC flow logs • Firewall logs • … Data Sinks • Centralized logging • SIEM • Monitoring • APM • Alerting • Cold storage • BI • … What data to send? Where to send it? How to send it?

Slide 32

Slide 32 text

@tyler_treat A decoupled approach

Slide 33

Slide 33 text

@tyler_treat What data to send? Where to send it? How to send it? Data Sources • VMs • Containers • Load balancers • Service meshes • Audit logs • VPC flow logs • Firewall logs • … Data Sinks • Centralized logging • SIEM • Monitoring • APM • Alerting • Cold storage • BI • … Observability Pipeline

Slide 34

Slide 34 text

@tyler_treat The Observability Pipeline

Slide 35

Slide 35 text

@tyler_treat Structure your damn data. 1. Data Specifications

Slide 36

Slide 36 text

@tyler_treat log.error(“User '{}' login failed”.format(user))

Slide 37

Slide 37 text

@tyler_treat ERROR 2019-04-05 13:26.42 User ‘tylertreat' login failed

Slide 38

Slide 38 text

@tyler_treat log.error(“User login failed”, event=LOGIN_ERROR, user=“tylertreat”, email=“[email protected]”, error=error)

Slide 39

Slide 39 text

@tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “user”: “tylertreat”, “email”: “[email protected]”, “error”: “Invalid username or password”, “message”: “User login failed” }

Slide 40

Slide 40 text

@tyler_treat JSON is fine.

Slide 41

Slide 41 text

@tyler_treat Pass a context object to everything.

Slide 42

Slide 42 text

@tyler_treat def login(ctx, username, email, password): ctx.set(user=username, email=email) ... log.error(“User login failed”, event=LOGIN_ERROR, context=ctx, error=error) ...

Slide 43

Slide 43 text

@tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “context”: { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”, “email”: “[email protected]”, }, “error”: “Invalid username or password”, “message”: “User login failed” }

Slide 44

Slide 44 text

@tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “context”: { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”, “email”: “[email protected]”, }, “error”: “Invalid username or password”, “message”: “User login failed” }

Slide 45

Slide 45 text

@tyler_treat Create standard specs for each data type collected (logs, metrics, traces).

Slide 46

Slide 46 text

@tyler_treat Specs can enforce required fields (e.g. user id, license, trace id) and data types.

Slide 47

Slide 47 text

@tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “INFO”, “event”: “user_login”, “context”: { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”,
 “user_id”: “3bb12f6c63274abe87fd1ee4ee37f3d2”,
 “license”: “942e6543f0844be680e72003d5e060fd”, “email”: “[email protected]”, } }

Slide 48

Slide 48 text

@tyler_treat Specs alone aren’t enough! 2. Specification Libraries

Slide 49

Slide 49 text

@tyler_treat We need libraries.

Slide 50

Slide 50 text

@tyler_treat • Java: log4j • Go: logrus • Python: structlog • Ruby: ruby-cabin • .NET: serilog • JS: structured-log • etc. There are many existing libraries for structured logging.

Slide 51

Slide 51 text

@tyler_treat For tracing and metrics, there are vendor-neutral APIs like OpenTracing and OpenCensus.

Slide 52

Slide 52 text

@tyler_treat We need a lightweight agent that can collect data from hosts/containers. 3. Data Collector

Slide 53

Slide 53 text

@tyler_treat Collect data, perform transformations/ filters, and write it to the data pipeline.

Slide 54

Slide 54 text

@tyler_treat Typically runs as an agent on the host (DaemonSet in Kubernetes).

Slide 55

Slide 55 text

@tyler_treat Data is written to stdout/stderr or a Unix domain socket.

Slide 56

Slide 56 text

@tyler_treat Just use Fluentd or Logstash (+Beats).

Slide 57

Slide 57 text

@tyler_treat We need a scalable, fault-tolerant data stream to handle the firehose of observability data generated. 4. Data Pipeline

Slide 58

Slide 58 text

@tyler_treat This also provides a buffer that decouples producers from consumers.

Slide 59

Slide 59 text

@tyler_treat Lots of options…

Slide 60

Slide 60 text

@tyler_treat

Slide 61

Slide 61 text

@tyler_treat We need a component to consume data from the pipeline, perform filtering, and write it to the appropriate backends. 5. Data Router

Slide 62

Slide 62 text

@tyler_treat This is where the data spec comes into play.

Slide 63

Slide 63 text

@tyler_treat The data shape determines how incoming data is handled.

Slide 64

Slide 64 text

@tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics

Slide 65

Slide 65 text

@tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics

Slide 66

Slide 66 text

@tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics

Slide 67

Slide 67 text

@tyler_treat This is primarily a stateless component writing to APIs.

Slide 68

Slide 68 text

@tyler_treat Good fit for “serverless” solutions.

Slide 69

Slide 69 text

@tyler_treat Piecing It All Together

Slide 70

Slide 70 text

@tyler_treat

Slide 71

Slide 71 text

@tyler_treat You don’t need to build it out all in one go.

Slide 72

Slide 72 text

@tyler_treat There are quick wins along the way!

Slide 73

Slide 73 text

@tyler_treat Evolving to an Observability Pipeline • Adopt structured logging • Move log/data collection out of process • Use a centralized logging system • Introduce a streaming data solution • Start adding data consumers

Slide 74

Slide 74 text

@tyler_treat Dev/Ops/SRE Systems Production

Slide 75

Slide 75 text

@tyler_treat Dev/Ops/SRE Systems Production

Slide 76

Slide 76 text

@tyler_treat Dev/Ops/SRE Systems Production

Slide 77

Slide 77 text

@tyler_treat Dev/Ops/SRE Systems Production

Slide 78

Slide 78 text

@tyler_treat Dev/Ops/SRE Systems Production

Slide 79

Slide 79 text

@tyler_treat Dev/Ops/SRE Systems Production

Slide 80

Slide 80 text

@tyler_treat CI/CD Pre- Production
 (theorizing about known unknowns) Post- Production
 (learning from unknown unknowns) Observability

Slide 81

Slide 81 text

@tyler_treat Part 2: Demo

Slide 82

Slide 82 text

@tyler_treat Trip Service Flight Service Hotel Service Car Rental Service DynamoDB DynamoDB DynamoDB DynamoDB Book Trip

Slide 83

Slide 83 text

@tyler_treat Structured logging + context

Slide 84

Slide 84 text

@tyler_treat Kubernetes

Slide 85

Slide 85 text

@tyler_treat And now here’s some YAML…

Slide 86

Slide 86 text

@tyler_treat

Slide 87

Slide 87 text

@tyler_treat

Slide 88

Slide 88 text

@tyler_treat Kubernetes

Slide 89

Slide 89 text

@tyler_treat +

Slide 90

Slide 90 text

@tyler_treat Kubernetes Kinesis

Slide 91

Slide 91 text

@tyler_treat AWS Lambda

Slide 92

Slide 92 text

@tyler_treat Kubernetes Kinesis Lambda

Slide 93

Slide 93 text

@tyler_treat Kubernetes Kinesis Lambda CloudWatch Jaeger Stackdriver

Slide 94

Slide 94 text

@tyler_treat Code:
 https://github.com/RealKinetic/cloud-native-meetup-2019

Slide 95

Slide 95 text

@tyler_treat Thank You realkinetic.com
 bravenewgeek.com