Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Observability Pipeline

The Observability Pipeline

The pervasiveness of cloud and containers has led to systems that are much more distributed and dynamic in nature. Highly elastic microservice and serverless architectures mean containers spin up on demand and scale to zero when that demand goes away. In this world, servers are very much cattle, not pets. This shift has exposed deficiencies in some of the tools and practices we used in the world of servers-as-pets. Specifically, there are questions around how we monitor and debug these types of systems at scale. And with the rise of DevOps and product mindset, making data-driven decisions is becoming increasingly important for agile development teams.

In this talk, we discuss a new approach to system monitoring and data collection: the observability pipeline. For organizations that are heavily siloed, this approach can help empower teams when it comes to operating their software. The observability pipeline provides a layer of abstraction that allows you to get operational data such as logs and metrics everywhere it needs to be without impacting developers and the core system. Unlocking this data can also be a huge win for the business with things like auditability, business analytics, and pricing. Lastly, it allows you to change backing data systems easily or test multiple in parallel. With the amount of data and the number of tools modern systems demand these days, we'll see how the observability pipeline becomes just as essential to the operations of a service as the CI/CD pipeline.

Dcbf01e42178cd9698fb3d4806e33d84?s=128

Tyler Treat

April 29, 2019
Tweet

Transcript

  1. @tyler_treat The Observability Pipeline Tyler Treat / deliver:Agile 2019 /

    April 29, 2019
  2. @tyler_treat The way we build systems has fundamentally changed.

  3. @tyler_treat Our systems are more complex than they’ve ever been.

  4. @tyler_treat Don’t believe me?

  5. @tyler_treat https://www.youtube.com/watch?v=xy3w2hGijhE

  6. @tyler_treat Pets vs. Cattle

  7. @tyler_treat This is our server. His name is Toby.

  8. @tyler_treat We take good care of Toby.

  9. @tyler_treat We release to him twice a year.
 (quarterly if

    we’re feeling dangerous)
  10. @tyler_treat Toby is compatible with most
 versions of Internet Explorer.

  11. @tyler_treat Toby likes to go on long walks,
 so sometimes

    we’ll take him 
 offline for a bit.
 (usually just nights and weekends)
  12. @tyler_treat No one seems to mind.

  13. @tyler_treat Sometimes Toby crashes,
 but we always make sure
 to

    restart him.
  14. @tyler_treat We like Toby.

  15. @tyler_treat This is 74db150601cd.

  16. @tyler_treat It’s best not to get too
 attached because when

    he’s
 no longer needed, well…
  17. @tyler_treat

  18. @tyler_treat Transactional
 DB App Server Reporting
 DB

  19. @tyler_treat Transactional
 DB App Server Reporting
 DB

  20. @tyler_treat “We need to be highly available.”

  21. @tyler_treat Transactional
 DB App Server Reporting
 DB

  22. @tyler_treat Node 1 App Server Reporting
 DB Node 2 Node

    3 Node 4 Node 5 Database Cluster App Server App Server rver
  23. @tyler_treat Node 1 App Server Reporting
 DB Node 2 Node

    3 Node 4 Node 5 Database Cluster App Server App Server rver
  24. @tyler_treat “We need to support every device.”

  25. @tyler_treat Node 1 App Server Reporting
 DB Node 2 Node

    3 Node 4 Node 5 Database Cluster App Server App Server rver
  26. @tyler_treat Node 1 App Server Reporting
 DB Node 2 Node

    3 Node 4 Node 5 Database Cluster App Server App Server rver
  27. @tyler_treat “We need faster response times.”

  28. @tyler_treat Node 1 App Server Reporting
 DB Node 2 Node

    3 Node 4 Node 5 Database Cluster App Server App Server rver
  29. @tyler_treat Node 1 App Server Reporting
 DB Node 2 Node

    3 Node 4 Node 5 Database Cluster App Server App Server rver Node 1 Node 2 Node 3 Node 4 Node 5 Cache Cluster
  30. @tyler_treat “We need real-time analytics, not batch.”

  31. @tyler_treat Node 1 App Server Reporting
 DB Node 2 Node

    3 Node 4 Node 5 Database Cluster App Server App Server rver Node 1 Node 2 Node 3 Node 4 Node 5 Cache Cluster
  32. @tyler_treat App Server Node 1 Node 2 Node 3 Node

    4 Node 5 Database Cluster App Server App Server rver Node 1 Node 2 Node 3 Node 4 Node 5 Cache Cluster Node 1 Node 2 Node 3 Node 4 Node 5 BI Data Cluster BI Server BI Server Data Pipeline
  33. @tyler_treat “We need to release multiple times a day.”

  34. @tyler_treat App Server Node 1 Node 2 Node 3 Node

    4 Node 5 Database Cluster App Server App Server rver Node 1 Node 2 Node 3 Node 4 Node 5 Cache Cluster Node 1 Node 2 Node 3 Node 4 Node 5 BI Data Cluster BI Server BI Server Data Pipeline
  35. @tyler_treat Node 1 Node 2 Node 3 Node 4 Node

    5 BI Data Cluster BI Server BI Server 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice Data Pipeline
  36. @tyler_treat “We need to support multiple geos.”

  37. @tyler_treat Node 1 Node 2 Node 3 Node 4 Node

    5 BI Data Cluster BI Server BI Server 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice Data Pipeline
  38. @tyler_treat North America BI Server BI Server Microservice Microservice Microservice

    Microservice Asia Pacific BI Server BI Server Microservice Microservice Microservice Microservice
  39. @tyler_treat North America BI Server BI Server Microservice Microservice Microservice

    Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN
  40. @tyler_treat North America BI Server BI Server Microservice Microservice Microservice

    Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN Infrastructure Load Balancers Orchestrators DNS Configuration . . .
  41. @tyler_treat North America BI Server BI Server Microservice Microservice Microservice

    Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN CI/CD Repo Repo Repo Repo Builder Builder Builder Builder Builder Builder Artifacts Artifacts Artifacts Deployer Deployer Infrastructure Load Balancers Orchestrators DNS Configuration . . .
  42. @tyler_treat “Oh, and one more thing…”

  43. @tyler_treat “…we need to do DevOps.”

  44. @tyler_treat North America BI Server BI Server Microservice Microservice Microservice

    Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN CI/CD Repo Repo Repo Repo Builder Builder Builder Builder Builder Builder Artifacts Artifacts Artifacts Deployer Deployer Infrastructure Load Balancers Orchestrators DNS Configuration . . .
  45. @tyler_treat North America BI Server BI Server Microservice Microservice Microservice

    Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN CI/CD Repo Repo Repo Repo Builder Builder Builder Builder Builder Builder Artifacts Artifacts Artifacts Deployer Deployer “DevOps” Infrastructure Load Balancers Orchestrators DNS Configuration . . .
  46. @tyler_treat The way we build systems has fundamentally changed.

  47. @tyler_treat Because our constraints and expectations have fundamentally changed.

  48. @tyler_treat Cloud and containers have led to much more distributed

    and dynamic systems.
  49. @tyler_treat Transactional
 DB App Server Reporting
 DB

  50. @tyler_treat North America BI Server BI Server Microservice Microservice Microservice

    Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN CI/CD Repo Repo Repo Repo Builder Builder Builder Builder Builder Builder Artifacts Artifacts Artifacts Deployer Deployer Infrastructure Load Balancers Orchestrators DNS Configuration . . . “DevOps”
  51. @tyler_treat This shift has exposed deficiencies in our tools and

    practices…
  52. @tyler_treat …and has led to new tools created to help

    us support our systems.
  53. @tyler_treat How do we make sense of it all?

  54. @tyler_treat In particular, how do we make this…

  55. @tyler_treat North America BI Server BI Server Microservice Microservice Microservice

    Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN CI/CD Repo Repo Repo Repo Builder Builder Builder Builder Builder Builder Artifacts Artifacts Artifacts Deployer Deployer Infrastructure Load Balancers Orchestrators DNS Configuration . . . “DevOps”
  56. @tyler_treat more like this…

  57. @tyler_treat North America BI Server BI Server Microservice Microservice Microservice

    Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN CI/CD Repo Repo Repo Repo Builder Builder Builder Builder Builder Builder Artifacts Artifacts Artifacts Deployer Deployer Infrastructure Load Balancers Orchestrators DNS Configuration . . . “DevOps”
  58. @tyler_treat “The Observability Pipeline”

  59. @tyler_treat A Brave New World

  60. @tyler_treat Operations for

  61. @tyler_treat APM Debugger Profiler SSH grep

  62. @tyler_treat APM Debugger Profiler SSH grep

  63. @tyler_treat APM Debugger Profiler SSH grep

  64. @tyler_treat APM Debugger Profiler SSH grep

  65. @tyler_treat APM Debugger Profiler SSH grep

  66. @tyler_treat APM Debugger Profiler SSH System Behavior grep

  67. @tyler_treat APM Debugger Profiler SSH System Behavior Actual Customer Impact

    grep
  68. @tyler_treat Operations for

  69. @tyler_treat APM Debugger Profiler SSH grep

  70. @tyler_treat APM Debugger Profiler SSH grep

  71. @tyler_treat APM Debugger Profiler SSH Testing in Production at Scale,

    Amit Gud grep
  72. @tyler_treat APM Debugger Profiler SSH System Behavior Actual Customer Impact

    ??? grep
  73. @tyler_treat grep APM Debugger Profiler SSH System Behavior Actual Customer

    Impact ???
  74. @tyler_treat Also, culture.

  75. @tyler_treat Many companies rely on a separate operations team to

    monitor, triage, and even resolve issues.
  76. @tyler_treat This model doesn’t map to the world of microservices

    and containers.
  77. @tyler_treat And it leads to ineffective feedback loops.

  78. @tyler_treat In order for developers to take on this responsibility,

    they need to be enabled.
  79. @tyler_treat “DevOps” teams are really “Developer Enablement” teams.

  80. @tyler_treat This shift in how we build systems has caused

    an explosion of new tools and terminology.
  81. @tyler_treat “Observability”

  82. @tyler_treat Post Hoc vs. Ad Hoc

  83. @tyler_treat Data Available Understanding

  84. @tyler_treat Data Available Understanding Known Knowns • Things we are

    aware of and understand • “The system has a 1GB memory limit”
  85. @tyler_treat Data Available Understanding Known Knowns • Things we are

    aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”
  86. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand

    but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”
  87. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand

    but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”
  88. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand

    but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” FACTS
  89. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand

    but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” FACTS HYPOTHESES
  90. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand

    but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” ASSUMPTIONS FACTS HYPOTHESES
  91. @tyler_treat Unknown Unknowns • Things we are neither aware of

    nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” ASSUMPTIONS FACTS HYPOTHESES
  92. @tyler_treat Unknown Unknowns • Things we are neither aware of

    nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” HYPOTHESES Monitoring Observability
  93. @tyler_treat Unknown Unknowns • Things we are neither aware of

    nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” HYPOTHESES Testing Exploring
  94. @tyler_treat “The army is now fully prepared to fight the

    previous war.”
  95. @tyler_treat 
 Observability Data application logs system logs audit logs

    application metrics distributed traces events
  96. @tyler_treat Some
 challenges… 
 Observability Data application logs system logs

    audit logs application metrics distributed traces events - Locked up inside a single vendor’s solution - Not readily available across the enterprise
 (or in some cases, too readily available) - Many tools and products needed for
 different data and use cases - Tool and data needs vary from team to
 team - Ever-changing landscape of tools, products,
 and services - Sheer volume of data can be overwhelming
  97. @tyler_treat System

  98. @tyler_treat System Splunk Universal Forwarder

  99. @tyler_treat System Splunk Universal Forwarder Datadog Metrics Agent Datadog APM

    Agent
  100. @tyler_treat System Splunk Universal Forwarder Datadog Metrics Agent Datadog APM

    Agent Universal Analytics Client
  101. @tyler_treat System Splunk Universal Forwarder Datadog Metrics Agent Datadog APM

    Agent Universal Analytics Client Amazon Glacier S3 Client
  102. @tyler_treat System Splunk Universal Forwarder Datadog APM Agent Universal Analytics

    Client Amazon Glacier S3 Client … Datadog Metrics Agent
  103. System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client

    S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Sp Un For Universal Analytics Client System System System System
  104. @tyler_treat “Oh, actually we want to change how we parse

    our logs.”
  105. System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client

    S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Sp Un For Universal Analytics Client System System System System
  106. @tyler_treat “Re-roll the agents."

  107. @tyler_treat “Oh, actually we want to use Sumo Logic for

    logging.”
  108. System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client

    S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Sp Un For Universal Analytics Client System System System System
  109. @tyler_treat “Re-roll the agents."

  110. System Sumo Logic Collector Datadog APM Agent Universal Analytics Client

    S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sum Co Datad A Universal Analytics Client S3 Client … Datado A System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sum Co Datad A Universal Analytics Client S3 Client … Datado A Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sum Co Universal Analytics Client System System System System
  111. @tyler_treat “Oh, actually we want to use New Relic for

    APM.”
  112. System Sumo Logic Collector Datadog APM Agent Universal Analytics Client

    S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sum Co Datad A Universal Analytics Client S3 Client … Datado A System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sum Co Datad A Universal Analytics Client S3 Client … Datado A Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sum Co Universal Analytics Client System System System System
  113. @tyler_treat “Re-roll the agents."

  114. System Sumo Logic Collector Universal Analytics Client S3 Client …

    New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sum Co Universal Analytics Client S3 Client … New R A System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sum Co Universal Analytics Client S3 Client … New R A Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sum Co Universal Analytics Client System System System System
  115. @tyler_treat “Oh, actually we want to evaluate Honeycomb for debugging.”

  116. System Sumo Logic Collector Universal Analytics Client S3 Client …

    New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sum Co Universal Analytics Client S3 Client … New R A System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sum Co Universal Analytics Client S3 Client … New R A Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sum Co Universal Analytics Client System System System System
  117. @tyler_treat “Re-roll the agents."

  118. System Sumo Logic Collector Universal Analytics Client S3 Client …

    New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sum Co Universal Analytics Client S3 Client … New R A System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sum Co Universal Analytics Client S3 Client … New R A Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sum Co Universal Analytics Client System System System System Honeytail Agent Honeytail Agent Honeytail Agent Honey Honeytail Agent Honeytail Agent Honeytail Agent Honey
  119. @tyler_treat You get the idea.

  120. @tyler_treat How big of a lift is it for your

    organization to change tools?
  121. @tyler_treat How easy is it to experiment with new ones?

  122. @tyler_treat Data Sources • VMs • Containers • Load balancers

    • Service meshes • Audit logs • VPC flow logs • Firewall logs • … Data Sinks • Centralized logging • SIEM • Monitoring • APM • Alerting • Cold storage • BI • … What data to send? Where to send it? How to send it?
  123. @tyler_treat A decoupled approach

  124. @tyler_treat What data to send? Where to send it? How

    to send it? Data Sources • VMs • Containers • Load balancers • Service meshes • Audit logs • VPC flow logs • Firewall logs • … Data Sinks • Centralized logging • SIEM • Monitoring • APM • Alerting • Cold storage • BI • … Observability Pipeline
  125. @tyler_treat Anatomy of an Observability Pipeline

  126. @tyler_treat Structure your damn data. 1. Data Specifications

  127. @tyler_treat log.error(“User '{}' login failed”.format(user))

  128. @tyler_treat ERROR 2019-04-05 13:26.42 User ‘tylertreat' login failed

  129. @tyler_treat log.error(“User login failed”, event=LOGIN_ERROR, user=“tylertreat”, email=“tyler.treat@realkinetic.com”, error=error)

  130. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “user”:

    “tylertreat”, “email”: “tyler.treat@realkinetic.com”, “error”: “Invalid username or password”, “message”: “User login failed” }
  131. @tyler_treat JSON is fine.

  132. @tyler_treat Pass a context object to everything.

  133. @tyler_treat def login(ctx, username, email, password): ctx.set(user=username, email=email) ... log.error(“User

    login failed”, event=LOGIN_ERROR, context=ctx, error=error) ...
  134. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “context”:

    { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”, “email”: “tyler.treat@realkinetic.com”, }, “error”: “Invalid username or password”, “message”: “User login failed” }
  135. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “context”:

    { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”, “email”: “tyler.treat@realkinetic.com”, }, “error”: “Invalid username or password”, “message”: “User login failed” }
  136. @tyler_treat What goes on the context?

  137. @tyler_treat What can you get for “free” and what do

    you need to pass along?
  138. @tyler_treat Create standard specs for each data type collected (logs,

    metrics, traces).
  139. @tyler_treat Specs can enforce required fields (e.g. user id, license,

    trace id) and data types.
  140. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “INFO”, “event”: “user_login”, “context”:

    { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”,
 “user_id”: “3bb12f6c63274abe87fd1ee4ee37f3d2”,
 “license”: “942e6543f0844be680e72003d5e060fd”, “email”: “tyler.treat@realkinetic.com”, } }
  141. @tyler_treat Be mindful not to log sensitive data like passwords.

  142. @tyler_treat Specs alone aren’t enough! 2. Specification Libraries

  143. @tyler_treat Empowering developers requires providing tools that align the “easy”

    path with the “right” path.
  144. @tyler_treat We need libraries that implement the specs and make

    it easy for devs to instrument their systems.
  145. @tyler_treat • Java: log4j • Go: logrus • Python: structlog

    • Ruby: ruby-cabin • .NET: serilog • JS: structured-log • etc. There are many existing libraries for structured logging.
  146. @tyler_treat For tracing and metrics, there are vendor-neutral APIs like

    OpenTracing and OpenCensus.
  147. @tyler_treat We need a lightweight agent that can collect data

    from hosts/containers. 3. Data Collector
  148. @tyler_treat Collect data, perform transformations/ filters, and write it to

    the data pipeline.
  149. @tyler_treat Typically runs as an agent on the host (DaemonSet

    in Kubernetes).
  150. @tyler_treat Data is written to stdout/stderr or a Unix domain

    socket.
  151. @tyler_treat Just use Fluentd or Logstash (+Beats).

  152. @tyler_treat We need a scalable, fault-tolerant data stream to handle

    the firehose of observability data generated. 4. Data Pipeline
  153. @tyler_treat This also provides a buffer that decouples producers from

    consumers.
  154. @tyler_treat System Splunk Universal Forwarder Datadog APM Agent Universal Analytics

    Client Amazon Glacier S3 Client … Datadog Metrics Agent
  155. @tyler_treat System Splunk Universal Forwarder Datadog APM Agent Universal Analytics

    Client Amazon Glacier S3 Client … Datadog Metrics Agent
  156. @tyler_treat Lots of options…

  157. @tyler_treat

  158. @tyler_treat We need a component to consume data from the

    pipeline, perform filtering, and write it to the appropriate backends. 5. Data Router
  159. @tyler_treat May perform transformations and processing of data, but heavy

    processing should be the responsibility of a backend system (e.g. alerting or aggregations).
  160. @tyler_treat This is where the data spec comes into play.

  161. @tyler_treat The data type determines how incoming data is routed.

  162. @tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics

  163. @tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics

  164. @tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics

  165. @tyler_treat This is primarily a stateless component writing to APIs.

  166. @tyler_treat Good fit for “serverless” solutions.

  167. @tyler_treat Piecing It All Together

  168. @tyler_treat

  169. @tyler_treat You don’t need to build it out all in

    one go.
  170. @tyler_treat There are quick wins along the way!

  171. @tyler_treat Evolving to an Observability Pipeline • Adopt structured logging

    • Move log/data collection out of process • Use a centralized logging system • Introduce a streaming data solution • Start adding data consumers
  172. @tyler_treat Moving from host-centric to service-centric observability.

  173. @tyler_treat This maps to VMs and containers as well as

    it does to “serverless” models.
  174. @tyler_treat Ops Systems Production Product
 Development Product
 Management Security &


    Compliance Support/
 Helpdesk
  175. @tyler_treat Dev/Ops/SRE Systems Production Audit Business Analytics Pricing Decisions Data-Driven

    Product Decisions Threat Detection Monitoring Debugging & Operational Insights ...
  176. @tyler_treat Dev/Ops/SRE Systems Production

  177. @tyler_treat Dev/Ops/SRE Systems Production

  178. @tyler_treat Dev/Ops/SRE Systems Production

  179. @tyler_treat Dev/Ops/SRE Systems Production

  180. @tyler_treat Dev/Ops/SRE Systems Production

  181. @tyler_treat Dev/Ops/SRE Systems Production

  182. @tyler_treat Benefits • Pattern can be evolved to with quick

    wins along the way • Maps to elastic and serverless architectures better • Empowers teams in siloed organizations and unlocks data for other parts of the business • Enables teams to use the tools best suited to their needs • Easier to change tools or evaluate them side-by-side by decoupling • Minimizes impact on developers and the core system
  183. @tyler_treat But it’s not a silver bullet.

  184. @tyler_treat Downsides • Moving away from agent-based model means we

    have to handle data routing ourselves • A lot of the Data Router components might need to be custom-made using various vendor SDKs or client libraries (assuming they have APIs) • This also means we might lose some of the value-add features of certain agents • Unclear how well this maps to pull-based models (e.g. Prometheus)
  185. @tyler_treat CI/CD Pipeline +
 Observability Pipeline

  186. @tyler_treat CI/CD Pre- Production
 (theorizing about known unknowns) Post- Production


    (learning from unknown unknowns) Observability
  187. @tyler_treat Thank You realkinetic.com
 bravenewgeek.com