Cloud-Native Observability

@tyler_treat Cloud-Native Observability Tyler Treat / Cloud Native - Madison
/ June 6, 2019

@tyler_treat

@tyler_treat Monitoring

@tyler_treat APM Debugger Proﬁler SSH grep

@tyler_treat APM Debugger Proﬁler SSH System Behavior grep

@tyler_treat APM Debugger Proﬁler SSH System Behavior Actual Customer Impact
grep

@tyler_treat Monitoring

@tyler_treat APM Debugger Proﬁler SSH grep

@tyler_treat APM Debugger Proﬁler SSH Testing in Production at Scale,
Amit Gud grep

@tyler_treat APM Debugger Proﬁler SSH System Behavior Actual Customer Impact
??? grep

@tyler_treat “Observability”

@tyler_treat Post Hoc vs. Ad Hoc

@tyler_treat Data Available Understanding

@tyler_treat Data Available Understanding Known Knowns • Things we are
aware of and understand • “The system has a 1GB memory limit”

@tyler_treat Data Available Understanding Known Knowns • Things we are
aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”

@tyler_treat Data Available Understanding Unknown Knowns • Things we understand
but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”

but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing  sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”

but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing  sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” FACTS

but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing  sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” FACTS HYPOTHESES

but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing  sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” ASSUMPTIONS FACTS HYPOTHESES

@tyler_treat Unknown Unknowns • Things we are neither aware of
nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing  sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” ASSUMPTIONS FACTS HYPOTHESES

nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing  sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” HYPOTHESES Monitoring Observability

nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing  sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” HYPOTHESES Testing Exploring

@tyler_treat   Observability Data application logs system logs audit logs
application metrics distributed traces events

@tyler_treat Some  challenges…   Observability Data application logs system logs
audit logs application metrics distributed traces events - Locked up inside a single vendor’s solution - Not readily available across the enterprise  (or in some cases, too readily available) - Many tools and products needed for  different data and use cases - Tool and data needs vary from team to  team - Ever-changing landscape of tools, products,  and services - Sheer volume of data can be overwhelming

@tyler_treat System Splunk Universal Forwarder Datadog APM Agent Universal Analytics
Client Amazon Glacier S3 Client … Datadog Metrics Agent

System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client
S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Sp Un For Universal Analytics Client System System System System

@tyler_treat How big of a lift is it for your
organization to change tools?

@tyler_treat How easy is it to experiment with new ones?

@tyler_treat Data Sources • VMs • Containers • Load balancers
• Service meshes • Audit logs • VPC flow logs • Firewall logs • … Data Sinks • Centralized logging • SIEM • Monitoring • APM • Alerting • Cold storage • BI • … What data to send? Where to send it? How to send it?

@tyler_treat A decoupled approach

@tyler_treat What data to send? Where to send it? How
to send it? Data Sources • VMs • Containers • Load balancers • Service meshes • Audit logs • VPC flow logs • Firewall logs • … Data Sinks • Centralized logging • SIEM • Monitoring • APM • Alerting • Cold storage • BI • … Observability Pipeline

@tyler_treat The Observability Pipeline

@tyler_treat Structure your damn data. 1. Data Speciﬁcations

@tyler_treat log.error(“User '{}' login failed”.format(user))

@tyler_treat ERROR 2019-04-05 13:26.42 User ‘tylertreat' login failed

@tyler_treat log.error(“User login failed”, event=LOGIN_ERROR, user=“tylertreat”, email=“[email protected]”, error=error)

@tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “user”:
“tylertreat”, “email”: “[email protected]”, “error”: “Invalid username or password”, “message”: “User login failed” }

@tyler_treat JSON is ﬁne.

@tyler_treat Pass a context object to everything.

@tyler_treat def login(ctx, username, email, password): ctx.set(user=username, email=email) ... log.error(“User
login failed”, event=LOGIN_ERROR, context=ctx, error=error) ...

@tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “context”:
{ “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”, “email”: “[email protected]”, }, “error”: “Invalid username or password”, “message”: “User login failed” }

@tyler_treat Create standard specs for each data type collected (logs,
metrics, traces).

@tyler_treat Specs can enforce required ﬁelds (e.g. user id, license,
trace id) and data types.

@tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “INFO”, “event”: “user_login”, “context”:
{ “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”,  “user_id”: “3bb12f6c63274abe87fd1ee4ee37f3d2”,  “license”: “942e6543f0844be680e72003d5e060fd”, “email”: “[email protected]”, } }

@tyler_treat Specs alone aren’t enough! 2. Speciﬁcation Libraries

@tyler_treat We need libraries.

@tyler_treat • Java: log4j • Go: logrus • Python: structlog
• Ruby: ruby-cabin • .NET: serilog • JS: structured-log • etc. There are many existing libraries for structured logging.

@tyler_treat For tracing and metrics, there are vendor-neutral APIs like
OpenTracing and OpenCensus.

@tyler_treat We need a lightweight agent that can collect data
from hosts/containers. 3. Data Collector

@tyler_treat Collect data, perform transformations/ ﬁlters, and write it to
the data pipeline.

@tyler_treat Typically runs as an agent on the host (DaemonSet
in Kubernetes).

@tyler_treat Data is written to stdout/stderr or a Unix domain
socket.

@tyler_treat Just use Fluentd or Logstash (+Beats).

@tyler_treat We need a scalable, fault-tolerant data stream to handle
the ﬁrehose of observability data generated. 4. Data Pipeline

@tyler_treat This also provides a buffer that decouples producers from
consumers.

@tyler_treat Lots of options…

@tyler_treat

@tyler_treat We need a component to consume data from the
pipeline, perform ﬁltering, and write it to the appropriate backends. 5. Data Router

@tyler_treat This is where the data spec comes into play.

@tyler_treat The data shape determines how incoming data is handled.

@tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics

@tyler_treat This is primarily a stateless component writing to APIs.

@tyler_treat Good ﬁt for “serverless” solutions.

@tyler_treat Piecing It All Together

@tyler_treat

@tyler_treat You don’t need to build it out all in
one go.

@tyler_treat There are quick wins along the way!

@tyler_treat Evolving to an Observability Pipeline • Adopt structured logging
• Move log/data collection out of process • Use a centralized logging system • Introduce a streaming data solution • Start adding data consumers

@tyler_treat Dev/Ops/SRE Systems Production

@tyler_treat CI/CD Pre- Production  (theorizing about known unknowns) Post- Production 
(learning from unknown unknowns) Observability

@tyler_treat Part 2: Demo

@tyler_treat Trip Service Flight Service Hotel Service Car Rental Service
DynamoDB DynamoDB DynamoDB DynamoDB Book Trip

@tyler_treat Structured logging + context

@tyler_treat Kubernetes

@tyler_treat And now here’s some YAML…

@tyler_treat

@tyler_treat Kubernetes

@tyler_treat +

@tyler_treat Kubernetes Kinesis

@tyler_treat AWS Lambda

@tyler_treat Kubernetes Kinesis Lambda

@tyler_treat Kubernetes Kinesis Lambda CloudWatch Jaeger Stackdriver

@tyler_treat Code:  https://github.com/RealKinetic/cloud-native-meetup-2019

@tyler_treat Thank You realkinetic.com  bravenewgeek.com

Cloud-Native Observability

Cloud-Native Observability

More Decks by Tyler Treat

Other Decks in Programming

Featured

Transcript