Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cloud-Native Observability

Cloud-Native Observability

What is observability and how is it different from traditional monitoring? How do we effectively monitor and debug complex, elastic microservice architectures? In this interactive discussion, we’ll answer these questions. We’ll also introduce the idea of an “observability pipeline” as a way to empower teams following DevOps practices. Lastly, we’ll demo cloud-native observability tools that fit this “observability pipeline” model, including Fluentd, OpenTracing, and Jaeger.

Dcbf01e42178cd9698fb3d4806e33d84?s=128

Tyler Treat

June 06, 2019
Tweet

Transcript

  1. @tyler_treat Cloud-Native Observability Tyler Treat / Cloud Native - Madison

    / June 6, 2019
  2. @tyler_treat

  3. @tyler_treat Monitoring

  4. @tyler_treat APM Debugger Profiler SSH grep

  5. @tyler_treat APM Debugger Profiler SSH grep

  6. @tyler_treat APM Debugger Profiler SSH System Behavior grep

  7. @tyler_treat APM Debugger Profiler SSH System Behavior Actual Customer Impact

    grep
  8. @tyler_treat Monitoring

  9. @tyler_treat APM Debugger Profiler SSH grep

  10. @tyler_treat APM Debugger Profiler SSH Testing in Production at Scale,

    Amit Gud grep
  11. @tyler_treat APM Debugger Profiler SSH System Behavior Actual Customer Impact

    ??? grep
  12. @tyler_treat “Observability”

  13. @tyler_treat Post Hoc vs. Ad Hoc

  14. @tyler_treat Data Available Understanding

  15. @tyler_treat Data Available Understanding Known Knowns • Things we are

    aware of and understand • “The system has a 1GB memory limit”
  16. @tyler_treat Data Available Understanding Known Knowns • Things we are

    aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”
  17. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand

    but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”
  18. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand

    but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”
  19. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand

    but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” FACTS
  20. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand

    but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” FACTS HYPOTHESES
  21. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand

    but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” ASSUMPTIONS FACTS HYPOTHESES
  22. @tyler_treat Unknown Unknowns • Things we are neither aware of

    nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” ASSUMPTIONS FACTS HYPOTHESES
  23. @tyler_treat Unknown Unknowns • Things we are neither aware of

    nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” HYPOTHESES Monitoring Observability
  24. @tyler_treat Unknown Unknowns • Things we are neither aware of

    nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” HYPOTHESES Testing Exploring
  25. @tyler_treat 
 Observability Data application logs system logs audit logs

    application metrics distributed traces events
  26. @tyler_treat Some
 challenges… 
 Observability Data application logs system logs

    audit logs application metrics distributed traces events - Locked up inside a single vendor’s solution - Not readily available across the enterprise
 (or in some cases, too readily available) - Many tools and products needed for
 different data and use cases - Tool and data needs vary from team to
 team - Ever-changing landscape of tools, products,
 and services - Sheer volume of data can be overwhelming
  27. @tyler_treat System Splunk Universal Forwarder Datadog APM Agent Universal Analytics

    Client Amazon Glacier S3 Client … Datadog Metrics Agent
  28. System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client

    S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Sp Un For Universal Analytics Client System System System System
  29. @tyler_treat How big of a lift is it for your

    organization to change tools?
  30. @tyler_treat How easy is it to experiment with new ones?

  31. @tyler_treat Data Sources • VMs • Containers • Load balancers

    • Service meshes • Audit logs • VPC flow logs • Firewall logs • … Data Sinks • Centralized logging • SIEM • Monitoring • APM • Alerting • Cold storage • BI • … What data to send? Where to send it? How to send it?
  32. @tyler_treat A decoupled approach

  33. @tyler_treat What data to send? Where to send it? How

    to send it? Data Sources • VMs • Containers • Load balancers • Service meshes • Audit logs • VPC flow logs • Firewall logs • … Data Sinks • Centralized logging • SIEM • Monitoring • APM • Alerting • Cold storage • BI • … Observability Pipeline
  34. @tyler_treat The Observability Pipeline

  35. @tyler_treat Structure your damn data. 1. Data Specifications

  36. @tyler_treat log.error(“User '{}' login failed”.format(user))

  37. @tyler_treat ERROR 2019-04-05 13:26.42 User ‘tylertreat' login failed

  38. @tyler_treat log.error(“User login failed”, event=LOGIN_ERROR, user=“tylertreat”, email=“tyler.treat@realkinetic.com”, error=error)

  39. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “user”:

    “tylertreat”, “email”: “tyler.treat@realkinetic.com”, “error”: “Invalid username or password”, “message”: “User login failed” }
  40. @tyler_treat JSON is fine.

  41. @tyler_treat Pass a context object to everything.

  42. @tyler_treat def login(ctx, username, email, password): ctx.set(user=username, email=email) ... log.error(“User

    login failed”, event=LOGIN_ERROR, context=ctx, error=error) ...
  43. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “context”:

    { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”, “email”: “tyler.treat@realkinetic.com”, }, “error”: “Invalid username or password”, “message”: “User login failed” }
  44. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “context”:

    { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”, “email”: “tyler.treat@realkinetic.com”, }, “error”: “Invalid username or password”, “message”: “User login failed” }
  45. @tyler_treat Create standard specs for each data type collected (logs,

    metrics, traces).
  46. @tyler_treat Specs can enforce required fields (e.g. user id, license,

    trace id) and data types.
  47. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “INFO”, “event”: “user_login”, “context”:

    { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”,
 “user_id”: “3bb12f6c63274abe87fd1ee4ee37f3d2”,
 “license”: “942e6543f0844be680e72003d5e060fd”, “email”: “tyler.treat@realkinetic.com”, } }
  48. @tyler_treat Specs alone aren’t enough! 2. Specification Libraries

  49. @tyler_treat We need libraries.

  50. @tyler_treat • Java: log4j • Go: logrus • Python: structlog

    • Ruby: ruby-cabin • .NET: serilog • JS: structured-log • etc. There are many existing libraries for structured logging.
  51. @tyler_treat For tracing and metrics, there are vendor-neutral APIs like

    OpenTracing and OpenCensus.
  52. @tyler_treat We need a lightweight agent that can collect data

    from hosts/containers. 3. Data Collector
  53. @tyler_treat Collect data, perform transformations/ filters, and write it to

    the data pipeline.
  54. @tyler_treat Typically runs as an agent on the host (DaemonSet

    in Kubernetes).
  55. @tyler_treat Data is written to stdout/stderr or a Unix domain

    socket.
  56. @tyler_treat Just use Fluentd or Logstash (+Beats).

  57. @tyler_treat We need a scalable, fault-tolerant data stream to handle

    the firehose of observability data generated. 4. Data Pipeline
  58. @tyler_treat This also provides a buffer that decouples producers from

    consumers.
  59. @tyler_treat Lots of options…

  60. @tyler_treat

  61. @tyler_treat We need a component to consume data from the

    pipeline, perform filtering, and write it to the appropriate backends. 5. Data Router
  62. @tyler_treat This is where the data spec comes into play.

  63. @tyler_treat The data shape determines how incoming data is handled.

  64. @tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics

  65. @tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics

  66. @tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics

  67. @tyler_treat This is primarily a stateless component writing to APIs.

  68. @tyler_treat Good fit for “serverless” solutions.

  69. @tyler_treat Piecing It All Together

  70. @tyler_treat

  71. @tyler_treat You don’t need to build it out all in

    one go.
  72. @tyler_treat There are quick wins along the way!

  73. @tyler_treat Evolving to an Observability Pipeline • Adopt structured logging

    • Move log/data collection out of process • Use a centralized logging system • Introduce a streaming data solution • Start adding data consumers
  74. @tyler_treat Dev/Ops/SRE Systems Production

  75. @tyler_treat Dev/Ops/SRE Systems Production

  76. @tyler_treat Dev/Ops/SRE Systems Production

  77. @tyler_treat Dev/Ops/SRE Systems Production

  78. @tyler_treat Dev/Ops/SRE Systems Production

  79. @tyler_treat Dev/Ops/SRE Systems Production

  80. @tyler_treat CI/CD Pre- Production
 (theorizing about known unknowns) Post- Production


    (learning from unknown unknowns) Observability
  81. @tyler_treat Part 2: Demo

  82. @tyler_treat Trip Service Flight Service Hotel Service Car Rental Service

    DynamoDB DynamoDB DynamoDB DynamoDB Book Trip
  83. @tyler_treat Structured logging + context

  84. @tyler_treat Kubernetes

  85. @tyler_treat And now here’s some YAML…

  86. @tyler_treat

  87. @tyler_treat

  88. @tyler_treat Kubernetes

  89. @tyler_treat +

  90. @tyler_treat Kubernetes Kinesis

  91. @tyler_treat AWS Lambda

  92. @tyler_treat Kubernetes Kinesis Lambda

  93. @tyler_treat Kubernetes Kinesis Lambda CloudWatch Jaeger Stackdriver

  94. @tyler_treat Code:
 https://github.com/RealKinetic/cloud-native-meetup-2019

  95. @tyler_treat Thank You realkinetic.com
 bravenewgeek.com