Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cloud-Native Observability

Cloud-Native Observability

What is observability and how is it different from traditional monitoring? How do we effectively monitor and debug complex, elastic microservice architectures? In this interactive discussion, we’ll answer these questions. We’ll also introduce the idea of an “observability pipeline” as a way to empower teams following DevOps practices. Lastly, we’ll demo cloud-native observability tools that fit this “observability pipeline” model, including Fluentd, OpenTracing, and Jaeger.

Tyler Treat

June 06, 2019
Tweet

More Decks by Tyler Treat

Other Decks in Programming

Transcript

  1. @tyler_treat Data Available Understanding Known Knowns • Things we are

    aware of and understand • “The system has a 1GB memory limit”
  2. @tyler_treat Data Available Understanding Known Knowns • Things we are

    aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”
  3. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand

    but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”
  4. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand

    but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”
  5. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand

    but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” FACTS
  6. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand

    but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” FACTS HYPOTHESES
  7. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand

    but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” ASSUMPTIONS FACTS HYPOTHESES
  8. @tyler_treat Unknown Unknowns • Things we are neither aware of

    nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” ASSUMPTIONS FACTS HYPOTHESES
  9. @tyler_treat Unknown Unknowns • Things we are neither aware of

    nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” HYPOTHESES Monitoring Observability
  10. @tyler_treat Unknown Unknowns • Things we are neither aware of

    nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” HYPOTHESES Testing Exploring
  11. @tyler_treat Some
 challenges… 
 Observability Data application logs system logs

    audit logs application metrics distributed traces events - Locked up inside a single vendor’s solution - Not readily available across the enterprise
 (or in some cases, too readily available) - Many tools and products needed for
 different data and use cases - Tool and data needs vary from team to
 team - Ever-changing landscape of tools, products,
 and services - Sheer volume of data can be overwhelming
  12. @tyler_treat System Splunk Universal Forwarder Datadog APM Agent Universal Analytics

    Client Amazon Glacier S3 Client … Datadog Metrics Agent
  13. System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client

    S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Sp Un For Universal Analytics Client System System System System
  14. @tyler_treat How big of a lift is it for your

    organization to change tools?
  15. @tyler_treat Data Sources • VMs • Containers • Load balancers

    • Service meshes • Audit logs • VPC flow logs • Firewall logs • … Data Sinks • Centralized logging • SIEM • Monitoring • APM • Alerting • Cold storage • BI • … What data to send? Where to send it? How to send it?
  16. @tyler_treat What data to send? Where to send it? How

    to send it? Data Sources • VMs • Containers • Load balancers • Service meshes • Audit logs • VPC flow logs • Firewall logs • … Data Sinks • Centralized logging • SIEM • Monitoring • APM • Alerting • Cold storage • BI • … Observability Pipeline
  17. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “user”:

    “tylertreat”, “email”: “[email protected]”, “error”: “Invalid username or password”, “message”: “User login failed” }
  18. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “context”:

    { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”, “email”: “[email protected]”, }, “error”: “Invalid username or password”, “message”: “User login failed” }
  19. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “context”:

    { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”, “email”: “[email protected]”, }, “error”: “Invalid username or password”, “message”: “User login failed” }
  20. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “INFO”, “event”: “user_login”, “context”:

    { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”,
 “user_id”: “3bb12f6c63274abe87fd1ee4ee37f3d2”,
 “license”: “942e6543f0844be680e72003d5e060fd”, “email”: “[email protected]”, } }
  21. @tyler_treat • Java: log4j • Go: logrus • Python: structlog

    • Ruby: ruby-cabin • .NET: serilog • JS: structured-log • etc. There are many existing libraries for structured logging.
  22. @tyler_treat We need a lightweight agent that can collect data

    from hosts/containers. 3. Data Collector
  23. @tyler_treat We need a scalable, fault-tolerant data stream to handle

    the firehose of observability data generated. 4. Data Pipeline
  24. @tyler_treat We need a component to consume data from the

    pipeline, perform filtering, and write it to the appropriate backends. 5. Data Router
  25. @tyler_treat Evolving to an Observability Pipeline • Adopt structured logging

    • Move log/data collection out of process • Use a centralized logging system • Introduce a streaming data solution • Start adding data consumers