$30 off During Our Annual Pro Sale. View Details »

The Observability Pipeline

The Observability Pipeline

The pervasiveness of cloud and containers has led to systems that are much more distributed and dynamic in nature. Highly elastic microservice and serverless architectures mean containers spin up on demand and scale to zero when that demand goes away. In this world, servers are very much cattle, not pets. This shift has exposed deficiencies in some of the tools and practices we used in the world of servers-as-pets. Specifically, there are questions around how we monitor and debug these types of systems at scale. And with the rise of DevOps and product mindset, making data-driven decisions is becoming increasingly important for agile development teams.

In this talk, we discuss a new approach to system monitoring and data collection: the observability pipeline. For organizations that are heavily siloed, this approach can help empower teams when it comes to operating their software. The observability pipeline provides a layer of abstraction that allows you to get operational data such as logs and metrics everywhere it needs to be without impacting developers and the core system. Unlocking this data can also be a huge win for the business with things like auditability, business analytics, and pricing. Lastly, it allows you to change backing data systems easily or test multiple in parallel. With the amount of data and the number of tools modern systems demand these days, we'll see how the observability pipeline becomes just as essential to the operations of a service as the CI/CD pipeline.

Tyler Treat

April 29, 2019
Tweet

More Decks by Tyler Treat

Other Decks in Programming

Transcript

  1. @tyler_treat
    The Observability Pipeline
    Tyler Treat / deliver:Agile 2019 / April 29, 2019

    View Slide

  2. @tyler_treat
    The way we build systems has
    fundamentally changed.

    View Slide

  3. @tyler_treat
    Our systems are more complex
    than they’ve ever been.

    View Slide

  4. @tyler_treat
    Don’t believe me?

    View Slide

  5. @tyler_treat
    https://www.youtube.com/watch?v=xy3w2hGijhE

    View Slide

  6. @tyler_treat
    Pets vs. Cattle

    View Slide

  7. @tyler_treat
    This is our server.
    His name is Toby.

    View Slide

  8. @tyler_treat
    We take good care of Toby.

    View Slide

  9. @tyler_treat
    We release to him twice a year.

    (quarterly if we’re feeling dangerous)

    View Slide

  10. @tyler_treat
    Toby is compatible with most

    versions of Internet Explorer.

    View Slide

  11. @tyler_treat
    Toby likes to go on long walks,

    so sometimes we’ll take him 

    offline for a bit.

    (usually just nights and weekends)

    View Slide

  12. @tyler_treat
    No one seems to mind.

    View Slide

  13. @tyler_treat
    Sometimes Toby crashes,

    but we always make sure

    to restart him.

    View Slide

  14. @tyler_treat
    We like Toby.

    View Slide

  15. @tyler_treat
    This is 74db150601cd.

    View Slide

  16. @tyler_treat
    It’s best not to get too

    attached because when he’s

    no longer needed, well…

    View Slide

  17. @tyler_treat

    View Slide

  18. @tyler_treat
    Transactional

    DB
    App Server
    Reporting

    DB

    View Slide

  19. @tyler_treat
    Transactional

    DB
    App Server
    Reporting

    DB

    View Slide

  20. @tyler_treat
    “We need to be
    highly available.”

    View Slide

  21. @tyler_treat
    Transactional

    DB
    App Server
    Reporting

    DB

    View Slide

  22. @tyler_treat
    Node 1
    App Server
    Reporting

    DB
    Node 2 Node 3
    Node 4 Node 5
    Database Cluster
    App Server App Server
    rver

    View Slide

  23. @tyler_treat
    Node 1
    App Server
    Reporting

    DB
    Node 2 Node 3
    Node 4 Node 5
    Database Cluster
    App Server App Server
    rver

    View Slide

  24. @tyler_treat
    “We need to support
    every device.”

    View Slide

  25. @tyler_treat
    Node 1
    App Server
    Reporting

    DB
    Node 2 Node 3
    Node 4 Node 5
    Database Cluster
    App Server App Server
    rver

    View Slide

  26. @tyler_treat
    Node 1
    App Server
    Reporting

    DB
    Node 2 Node 3
    Node 4 Node 5
    Database Cluster
    App Server App Server
    rver

    View Slide

  27. @tyler_treat
    “We need faster
    response times.”

    View Slide

  28. @tyler_treat
    Node 1
    App Server
    Reporting

    DB
    Node 2 Node 3
    Node 4 Node 5
    Database Cluster
    App Server App Server
    rver

    View Slide

  29. @tyler_treat
    Node 1
    App Server
    Reporting

    DB
    Node 2 Node 3
    Node 4 Node 5
    Database Cluster
    App Server App Server
    rver
    Node 1 Node 2 Node 3
    Node 4 Node 5
    Cache Cluster

    View Slide

  30. @tyler_treat
    “We need real-time
    analytics, not batch.”

    View Slide

  31. @tyler_treat
    Node 1
    App Server
    Reporting

    DB
    Node 2 Node 3
    Node 4 Node 5
    Database Cluster
    App Server App Server
    rver
    Node 1 Node 2 Node 3
    Node 4 Node 5
    Cache Cluster

    View Slide

  32. @tyler_treat
    App Server
    Node 1 Node 2 Node 3
    Node 4 Node 5
    Database Cluster
    App Server App Server
    rver
    Node 1 Node 2 Node 3
    Node 4 Node 5
    Cache Cluster
    Node 1 Node 2 Node 3
    Node 4 Node 5
    BI Data Cluster
    BI Server BI Server
    Data Pipeline

    View Slide

  33. @tyler_treat
    “We need to release
    multiple times a day.”

    View Slide

  34. @tyler_treat
    App Server
    Node 1 Node 2 Node 3
    Node 4 Node 5
    Database Cluster
    App Server App Server
    rver
    Node 1 Node 2 Node 3
    Node 4 Node 5
    Cache Cluster
    Node 1 Node 2 Node 3
    Node 4 Node 5
    BI Data Cluster
    BI Server BI Server
    Data Pipeline

    View Slide

  35. @tyler_treat
    Node 1 Node 2 Node 3
    Node 4 Node 5
    BI Data Cluster
    BI Server BI Server
    1 2 3
    4 5
    Database Cluster
    1 2 3
    4 5
    Cache Cluster
    Microservice
    1 2 3
    4 5
    Database Cluster
    1 2 3
    4 5
    Cache Cluster
    Microservice
    1 2 3
    4 5
    Database Cluster
    1 2 3
    4 5
    Cache Cluster
    Microservice
    1 2 3
    4 5
    Database Cluster
    1 2 3
    4 5
    Cache Cluster
    Microservice
    Data Pipeline

    View Slide

  36. @tyler_treat
    “We need to support
    multiple geos.”

    View Slide

  37. @tyler_treat
    Node 1 Node 2 Node 3
    Node 4 Node 5
    BI Data Cluster
    BI Server BI Server
    1 2 3
    4 5
    Database Cluster
    1 2 3
    4 5
    Cache Cluster
    Microservice
    1 2 3
    4 5
    Database Cluster
    1 2 3
    4 5
    Cache Cluster
    Microservice
    1 2 3
    4 5
    Database Cluster
    1 2 3
    4 5
    Cache Cluster
    Microservice
    1 2 3
    4 5
    Database Cluster
    1 2 3
    4 5
    Cache Cluster
    Microservice
    Data Pipeline

    View Slide

  38. @tyler_treat
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    Asia Pacific
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice

    View Slide

  39. @tyler_treat
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    CDN

    View Slide

  40. @tyler_treat
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    CDN
    Infrastructure
    Load Balancers Orchestrators DNS Configuration . . .

    View Slide

  41. @tyler_treat
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    CDN
    CI/CD
    Repo Repo Repo Repo
    Builder Builder Builder
    Builder Builder Builder
    Artifacts Artifacts Artifacts
    Deployer Deployer
    Infrastructure
    Load Balancers Orchestrators DNS Configuration . . .

    View Slide

  42. @tyler_treat
    “Oh, and one more
    thing…”

    View Slide

  43. @tyler_treat
    “…we need to do
    DevOps.”

    View Slide

  44. @tyler_treat
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    CDN
    CI/CD
    Repo Repo Repo Repo
    Builder Builder Builder
    Builder Builder Builder
    Artifacts Artifacts Artifacts
    Deployer Deployer
    Infrastructure
    Load Balancers Orchestrators DNS Configuration . . .

    View Slide

  45. @tyler_treat
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    CDN
    CI/CD
    Repo Repo Repo Repo
    Builder Builder Builder
    Builder Builder Builder
    Artifacts Artifacts Artifacts
    Deployer Deployer
    “DevOps”
    Infrastructure
    Load Balancers Orchestrators DNS Configuration . . .

    View Slide

  46. @tyler_treat
    The way we build systems has
    fundamentally changed.

    View Slide

  47. @tyler_treat
    Because our constraints and expectations
    have fundamentally changed.

    View Slide

  48. @tyler_treat
    Cloud and containers have led to much
    more distributed and dynamic systems.

    View Slide

  49. @tyler_treat
    Transactional

    DB
    App Server
    Reporting

    DB

    View Slide

  50. @tyler_treat
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    CDN
    CI/CD
    Repo Repo Repo Repo
    Builder Builder Builder
    Builder Builder Builder
    Artifacts Artifacts Artifacts
    Deployer Deployer
    Infrastructure
    Load Balancers Orchestrators DNS Configuration . . .
    “DevOps”

    View Slide

  51. @tyler_treat
    This shift has exposed deficiencies
    in our tools and practices…

    View Slide

  52. @tyler_treat
    …and has led to new tools created
    to help us support our systems.

    View Slide

  53. @tyler_treat
    How do we make sense of it all?

    View Slide

  54. @tyler_treat
    In particular, how do we make
    this…

    View Slide

  55. @tyler_treat
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    CDN
    CI/CD
    Repo Repo Repo Repo
    Builder Builder Builder
    Builder Builder Builder
    Artifacts Artifacts Artifacts
    Deployer Deployer
    Infrastructure
    Load Balancers Orchestrators DNS Configuration . . .
    “DevOps”

    View Slide

  56. @tyler_treat
    more like this…

    View Slide

  57. @tyler_treat
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    North America
    BI Server BI Server
    Microservice Microservice
    Microservice Microservice
    CDN
    CI/CD
    Repo Repo Repo Repo
    Builder Builder Builder
    Builder Builder Builder
    Artifacts Artifacts Artifacts
    Deployer Deployer
    Infrastructure
    Load Balancers Orchestrators DNS Configuration . . .
    “DevOps”

    View Slide

  58. @tyler_treat
    “The Observability Pipeline”

    View Slide

  59. @tyler_treat
    A Brave New World

    View Slide

  60. @tyler_treat
    Operations for

    View Slide

  61. @tyler_treat
    APM
    Debugger
    Profiler
    SSH
    grep

    View Slide

  62. @tyler_treat
    APM
    Debugger
    Profiler
    SSH
    grep

    View Slide

  63. @tyler_treat
    APM
    Debugger
    Profiler
    SSH
    grep

    View Slide

  64. @tyler_treat
    APM
    Debugger
    Profiler
    SSH
    grep

    View Slide

  65. @tyler_treat
    APM
    Debugger
    Profiler
    SSH
    grep

    View Slide

  66. @tyler_treat
    APM
    Debugger
    Profiler
    SSH System Behavior
    grep

    View Slide

  67. @tyler_treat
    APM
    Debugger
    Profiler
    SSH System Behavior
    Actual Customer Impact
    grep

    View Slide

  68. @tyler_treat
    Operations for

    View Slide

  69. @tyler_treat
    APM
    Debugger
    Profiler
    SSH
    grep

    View Slide

  70. @tyler_treat
    APM
    Debugger
    Profiler
    SSH
    grep

    View Slide

  71. @tyler_treat
    APM
    Debugger
    Profiler
    SSH
    Testing in Production at Scale, Amit Gud
    grep

    View Slide

  72. @tyler_treat
    APM
    Debugger
    Profiler
    SSH System Behavior
    Actual Customer Impact
    ???
    grep

    View Slide

  73. @tyler_treat
    grep
    APM
    Debugger
    Profiler
    SSH System Behavior
    Actual Customer Impact
    ???

    View Slide

  74. @tyler_treat
    Also, culture.

    View Slide

  75. @tyler_treat
    Many companies rely on a separate
    operations team to monitor, triage, and
    even resolve issues.

    View Slide

  76. @tyler_treat
    This model doesn’t map to the world
    of microservices and containers.

    View Slide

  77. @tyler_treat
    And it leads to ineffective
    feedback loops.

    View Slide

  78. @tyler_treat
    In order for developers to take on this
    responsibility, they need to be enabled.

    View Slide

  79. @tyler_treat
    “DevOps” teams are really
    “Developer Enablement” teams.

    View Slide

  80. @tyler_treat
    This shift in how we build systems has
    caused an explosion of new tools and
    terminology.

    View Slide

  81. @tyler_treat
    “Observability”

    View Slide

  82. @tyler_treat
    Post Hoc vs. Ad Hoc

    View Slide

  83. @tyler_treat
    Data Available
    Understanding

    View Slide

  84. @tyler_treat
    Data Available
    Understanding
    Known Knowns
    • Things we are aware of and understand
    • “The system has a 1GB memory limit”

    View Slide

  85. @tyler_treat
    Data Available
    Understanding
    Known Knowns
    • Things we are aware of and understand
    • “The system has a 1GB memory limit”
    Known Unknowns
    • Things we are aware of but don’t
    understand
    • “The system exceeded its memory limit
    and crashed, causing an outage”

    View Slide

  86. @tyler_treat
    Data Available
    Understanding
    Unknown Knowns
    • Things we understand but are not
    aware of
    • “We implemented an orchestrator to
    ensure the system is always running”
    Known Knowns
    • Things we are aware of and understand
    • “The system has a 1GB memory limit”
    Known Unknowns
    • Things we are aware of but don’t
    understand
    • “The system exceeded its memory limit
    and crashed, causing an outage”

    View Slide

  87. @tyler_treat
    Data Available
    Understanding
    Unknown Knowns
    • Things we understand but are not
    aware of
    • “We implemented an orchestrator to
    ensure the system is always running”
    Known Knowns
    • Things we are aware of and understand
    • “The system has a 1GB memory limit”
    Unknown Unknowns
    • Things we are neither aware of nor
    understand
    • “Instances churn because the
    orchestrator restarts the process when
    it approaches its memory limit, causing

    sporadic failures and slowdowns”
    Known Unknowns
    • Things we are aware of but don’t
    understand
    • “The system exceeded its memory limit
    and crashed, causing an outage”

    View Slide

  88. @tyler_treat
    Data Available
    Understanding
    Unknown Knowns
    • Things we understand but are not
    aware of
    • “We implemented an orchestrator to
    ensure the system is always running”
    Known Knowns
    • Things we are aware of and understand
    • “The system has a 1GB memory limit”
    Unknown Unknowns
    • Things we are neither aware of nor
    understand
    • “Instances churn because the
    orchestrator restarts the process when
    it approaches its memory limit, causing

    sporadic failures and slowdowns”
    Known Unknowns
    • Things we are aware of but don’t
    understand
    • “The system exceeded its memory limit
    and crashed, causing an outage”
    FACTS

    View Slide

  89. @tyler_treat
    Data Available
    Understanding
    Unknown Knowns
    • Things we understand but are not
    aware of
    • “We implemented an orchestrator to
    ensure the system is always running”
    Known Knowns
    • Things we are aware of and understand
    • “The system has a 1GB memory limit”
    Unknown Unknowns
    • Things we are neither aware of nor
    understand
    • “Instances churn because the
    orchestrator restarts the process when
    it approaches its memory limit, causing

    sporadic failures and slowdowns”
    Known Unknowns
    • Things we are aware of but don’t
    understand
    • “The system exceeded its memory limit
    and crashed, causing an outage”
    FACTS
    HYPOTHESES

    View Slide

  90. @tyler_treat
    Data Available
    Understanding
    Unknown Knowns
    • Things we understand but are not
    aware of
    • “We implemented an orchestrator to
    ensure the system is always running”
    Known Knowns
    • Things we are aware of and understand
    • “The system has a 1GB memory limit”
    Unknown Unknowns
    • Things we are neither aware of nor
    understand
    • “Instances churn because the
    orchestrator restarts the process when
    it approaches its memory limit, causing

    sporadic failures and slowdowns”
    Known Unknowns
    • Things we are aware of but don’t
    understand
    • “The system exceeded its memory limit
    and crashed, causing an outage”
    ASSUMPTIONS FACTS
    HYPOTHESES

    View Slide

  91. @tyler_treat
    Unknown Unknowns
    • Things we are neither aware of nor
    understand
    • “Instances churn because the
    orchestrator restarts the process when
    it approaches its memory limit, causing

    sporadic failures and slowdowns”
    DISCOVERIES
    Data Available
    Understanding
    Unknown Knowns
    • Things we understand but are not
    aware of
    • “We implemented an orchestrator to
    ensure the system is always running”
    Known Knowns
    • Things we are aware of and understand
    • “The system has a 1GB memory limit”
    Known Unknowns
    • Things we are aware of but don’t
    understand
    • “The system exceeded its memory limit
    and crashed, causing an outage”
    ASSUMPTIONS FACTS
    HYPOTHESES

    View Slide

  92. @tyler_treat
    Unknown Unknowns
    • Things we are neither aware of nor
    understand
    • “Instances churn because the
    orchestrator restarts the process when
    it approaches its memory limit, causing

    sporadic failures and slowdowns”
    DISCOVERIES
    Data Available
    Understanding
    Known Unknowns
    • Things we are aware of but don’t
    understand
    • “The system exceeded its memory limit
    and crashed, causing an outage”
    HYPOTHESES
    Monitoring
    Observability

    View Slide

  93. @tyler_treat
    Unknown Unknowns
    • Things we are neither aware of nor
    understand
    • “Instances churn because the
    orchestrator restarts the process when
    it approaches its memory limit, causing

    sporadic failures and slowdowns”
    DISCOVERIES
    Data Available
    Understanding
    Known Unknowns
    • Things we are aware of but don’t
    understand
    • “The system exceeded its memory limit
    and crashed, causing an outage”
    HYPOTHESES
    Testing
    Exploring

    View Slide

  94. @tyler_treat
    “The army is now fully prepared
    to fight the previous war.”

    View Slide

  95. @tyler_treat

    Observability Data
    application logs
    system logs
    audit logs
    application metrics
    distributed traces
    events

    View Slide

  96. @tyler_treat
    Some

    challenges…

    Observability Data
    application logs
    system logs
    audit logs
    application metrics
    distributed traces
    events
    - Locked up inside a single vendor’s solution
    - Not readily available across the enterprise

    (or in some cases, too readily available)
    - Many tools and products needed for

    different data and use cases
    - Tool and data needs vary from team to

    team
    - Ever-changing landscape of tools, products,

    and services
    - Sheer volume of data can be overwhelming

    View Slide

  97. @tyler_treat
    System

    View Slide

  98. @tyler_treat
    System
    Splunk
    Universal
    Forwarder

    View Slide

  99. @tyler_treat
    System
    Splunk
    Universal
    Forwarder
    Datadog Metrics
    Agent
    Datadog APM
    Agent

    View Slide

  100. @tyler_treat
    System
    Splunk
    Universal
    Forwarder
    Datadog Metrics
    Agent
    Datadog APM
    Agent
    Universal
    Analytics Client

    View Slide

  101. @tyler_treat
    System
    Splunk
    Universal
    Forwarder
    Datadog Metrics
    Agent
    Datadog APM
    Agent
    Universal
    Analytics Client
    Amazon Glacier
    S3 Client

    View Slide

  102. @tyler_treat
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    Amazon Glacier
    S3 Client

    Datadog Metrics
    Agent

    View Slide

  103. System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sp
    Un
    For
    Datad
    A
    Universal
    Analytics Client
    S3 Client

    Datado
    A
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sp
    Un
    For
    Datad
    A
    Universal
    Analytics Client
    S3 Client

    Datado
    A
    Splunk
    Universal
    Forwarder
    Universal
    Analytics Client
    Splunk
    Universal
    Forwarder
    Universal
    Analytics Client
    Splunk
    Universal
    Forwarder
    Universal
    Analytics Client
    Sp
    Un
    For
    Universal
    Analytics Client
    System System System System

    View Slide

  104. @tyler_treat
    “Oh, actually we want to change
    how we parse our logs.”

    View Slide

  105. System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sp
    Un
    For
    Datad
    A
    Universal
    Analytics Client
    S3 Client

    Datado
    A
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sp
    Un
    For
    Datad
    A
    Universal
    Analytics Client
    S3 Client

    Datado
    A
    Splunk
    Universal
    Forwarder
    Universal
    Analytics Client
    Splunk
    Universal
    Forwarder
    Universal
    Analytics Client
    Splunk
    Universal
    Forwarder
    Universal
    Analytics Client
    Sp
    Un
    For
    Universal
    Analytics Client
    System System System System

    View Slide

  106. @tyler_treat
    “Re-roll the agents."

    View Slide

  107. @tyler_treat
    “Oh, actually we want to use
    Sumo Logic for logging.”

    View Slide

  108. System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sp
    Un
    For
    Datad
    A
    Universal
    Analytics Client
    S3 Client

    Datado
    A
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sp
    Un
    For
    Datad
    A
    Universal
    Analytics Client
    S3 Client

    Datado
    A
    Splunk
    Universal
    Forwarder
    Universal
    Analytics Client
    Splunk
    Universal
    Forwarder
    Universal
    Analytics Client
    Splunk
    Universal
    Forwarder
    Universal
    Analytics Client
    Sp
    Un
    For
    Universal
    Analytics Client
    System System System System

    View Slide

  109. @tyler_treat
    “Re-roll the agents."

    View Slide

  110. System
    Sumo Logic
    Collector
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sumo Logic
    Collector
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sumo Logic
    Collector
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sum
    Co
    Datad
    A
    Universal
    Analytics Client
    S3 Client

    Datado
    A
    System
    Sumo Logic
    Collector
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sumo Logic
    Collector
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sumo Logic
    Collector
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sum
    Co
    Datad
    A
    Universal
    Analytics Client
    S3 Client

    Datado
    A
    Sumo Logic
    Collector
    Universal
    Analytics Client
    Sumo Logic
    Collector
    Universal
    Analytics Client
    Sumo Logic
    Collector
    Universal
    Analytics Client
    Sum
    Co
    Universal
    Analytics Client
    System System System System

    View Slide

  111. @tyler_treat
    “Oh, actually we want to use
    New Relic for APM.”

    View Slide

  112. System
    Sumo Logic
    Collector
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sumo Logic
    Collector
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sumo Logic
    Collector
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sum
    Co
    Datad
    A
    Universal
    Analytics Client
    S3 Client

    Datado
    A
    System
    Sumo Logic
    Collector
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sumo Logic
    Collector
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sumo Logic
    Collector
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sum
    Co
    Datad
    A
    Universal
    Analytics Client
    S3 Client

    Datado
    A
    Sumo Logic
    Collector
    Universal
    Analytics Client
    Sumo Logic
    Collector
    Universal
    Analytics Client
    Sumo Logic
    Collector
    Universal
    Analytics Client
    Sum
    Co
    Universal
    Analytics Client
    System System System System

    View Slide

  113. @tyler_treat
    “Re-roll the agents."

    View Slide

  114. System
    Sumo Logic
    Collector
    Universal
    Analytics Client
    S3 Client

    New Relic APM
    Agent
    System
    Sumo Logic
    Collector
    Universal
    Analytics Client
    S3 Client

    New Relic APM
    Agent
    System
    Sumo Logic
    Collector
    Universal
    Analytics Client
    S3 Client

    New Relic APM
    Agent
    System
    Sum
    Co
    Universal
    Analytics Client
    S3 Client

    New R
    A
    System
    Sumo Logic
    Collector
    Universal
    Analytics Client
    S3 Client

    New Relic APM
    Agent
    System
    Sumo Logic
    Collector
    Universal
    Analytics Client
    S3 Client

    New Relic APM
    Agent
    System
    Sumo Logic
    Collector
    Universal
    Analytics Client
    S3 Client

    New Relic APM
    Agent
    System
    Sum
    Co
    Universal
    Analytics Client
    S3 Client

    New R
    A
    Sumo Logic
    Collector
    Universal
    Analytics Client
    Sumo Logic
    Collector
    Universal
    Analytics Client
    Sumo Logic
    Collector
    Universal
    Analytics Client
    Sum
    Co
    Universal
    Analytics Client
    System System System System

    View Slide

  115. @tyler_treat
    “Oh, actually we want to evaluate
    Honeycomb for debugging.”

    View Slide

  116. System
    Sumo Logic
    Collector
    Universal
    Analytics Client
    S3 Client

    New Relic APM
    Agent
    System
    Sumo Logic
    Collector
    Universal
    Analytics Client
    S3 Client

    New Relic APM
    Agent
    System
    Sumo Logic
    Collector
    Universal
    Analytics Client
    S3 Client

    New Relic APM
    Agent
    System
    Sum
    Co
    Universal
    Analytics Client
    S3 Client

    New R
    A
    System
    Sumo Logic
    Collector
    Universal
    Analytics Client
    S3 Client

    New Relic APM
    Agent
    System
    Sumo Logic
    Collector
    Universal
    Analytics Client
    S3 Client

    New Relic APM
    Agent
    System
    Sumo Logic
    Collector
    Universal
    Analytics Client
    S3 Client

    New Relic APM
    Agent
    System
    Sum
    Co
    Universal
    Analytics Client
    S3 Client

    New R
    A
    Sumo Logic
    Collector
    Universal
    Analytics Client
    Sumo Logic
    Collector
    Universal
    Analytics Client
    Sumo Logic
    Collector
    Universal
    Analytics Client
    Sum
    Co
    Universal
    Analytics Client
    System System System System

    View Slide

  117. @tyler_treat
    “Re-roll the agents."

    View Slide

  118. System
    Sumo Logic
    Collector
    Universal
    Analytics Client
    S3 Client

    New Relic APM
    Agent
    System
    Sumo Logic
    Collector
    Universal
    Analytics Client
    S3 Client

    New Relic APM
    Agent
    System
    Sumo Logic
    Collector
    Universal
    Analytics Client
    S3 Client

    New Relic APM
    Agent
    System
    Sum
    Co
    Universal
    Analytics Client
    S3 Client

    New R
    A
    System
    Sumo Logic
    Collector
    Universal
    Analytics Client
    S3 Client

    New Relic APM
    Agent
    System
    Sumo Logic
    Collector
    Universal
    Analytics Client
    S3 Client

    New Relic APM
    Agent
    System
    Sumo Logic
    Collector
    Universal
    Analytics Client
    S3 Client

    New Relic APM
    Agent
    System
    Sum
    Co
    Universal
    Analytics Client
    S3 Client

    New R
    A
    Sumo Logic
    Collector
    Universal
    Analytics Client
    Sumo Logic
    Collector
    Universal
    Analytics Client
    Sumo Logic
    Collector
    Universal
    Analytics Client
    Sum
    Co
    Universal
    Analytics Client
    System System System System
    Honeytail Agent
    Honeytail Agent Honeytail Agent Honey
    Honeytail Agent Honeytail Agent Honeytail Agent Honey

    View Slide

  119. @tyler_treat
    You get the idea.

    View Slide

  120. @tyler_treat
    How big of a lift is it for your
    organization to change tools?

    View Slide

  121. @tyler_treat
    How easy is it to experiment
    with new ones?

    View Slide

  122. @tyler_treat
    Data Sources
    • VMs
    • Containers
    • Load balancers
    • Service meshes
    • Audit logs
    • VPC flow logs
    • Firewall logs
    • …
    Data Sinks
    • Centralized logging
    • SIEM
    • Monitoring
    • APM
    • Alerting
    • Cold storage
    • BI
    • …
    What data to send?
    Where to send it?
    How to send it?

    View Slide

  123. @tyler_treat
    A decoupled approach

    View Slide

  124. @tyler_treat
    What data to send?
    Where to send it?
    How to send it?
    Data Sources
    • VMs
    • Containers
    • Load balancers
    • Service meshes
    • Audit logs
    • VPC flow logs
    • Firewall logs
    • …
    Data Sinks
    • Centralized logging
    • SIEM
    • Monitoring
    • APM
    • Alerting
    • Cold storage
    • BI
    • …
    Observability Pipeline

    View Slide

  125. @tyler_treat
    Anatomy of an Observability Pipeline

    View Slide

  126. @tyler_treat
    Structure your damn data.
    1. Data Specifications

    View Slide

  127. @tyler_treat
    log.error(“User '{}' login failed”.format(user))

    View Slide

  128. @tyler_treat
    ERROR 2019-04-05 13:26.42 User ‘tylertreat' login failed

    View Slide

  129. @tyler_treat
    log.error(“User login failed”,
    event=LOGIN_ERROR,
    user=“tylertreat”,
    email=“[email protected]”,
    error=error)

    View Slide

  130. @tyler_treat
    {
    “timestamp”: “2019-04-05 13:26.42”,
    “level”: “ERROR”,
    “event”: “user_login_error”,
    “user”: “tylertreat”,
    “email”: “[email protected]”,
    “error”: “Invalid username or password”,
    “message”: “User login failed”
    }

    View Slide

  131. @tyler_treat
    JSON is fine.

    View Slide

  132. @tyler_treat
    Pass a context object to
    everything.

    View Slide

  133. @tyler_treat
    def login(ctx, username, email, password):
    ctx.set(user=username, email=email)
    ...
    log.error(“User login failed”,
    event=LOGIN_ERROR,
    context=ctx,
    error=error)
    ...

    View Slide

  134. @tyler_treat
    {
    “timestamp”: “2019-04-05 13:26.42”,
    “level”: “ERROR”,
    “event”: “user_login_error”,
    “context”: {
    “id”: “accfbb8315c44a52ad893ca6772e1caf”,
    “http_method”: “POST”,
    “http_path”: “/login”,
    “user”: “tylertreat”,
    “email”: “[email protected]”,
    },
    “error”: “Invalid username or password”,
    “message”: “User login failed”
    }

    View Slide

  135. @tyler_treat
    {
    “timestamp”: “2019-04-05 13:26.42”,
    “level”: “ERROR”,
    “event”: “user_login_error”,
    “context”: {
    “id”: “accfbb8315c44a52ad893ca6772e1caf”,
    “http_method”: “POST”,
    “http_path”: “/login”,
    “user”: “tylertreat”,
    “email”: “[email protected]”,
    },
    “error”: “Invalid username or password”,
    “message”: “User login failed”
    }

    View Slide

  136. @tyler_treat
    What goes on the context?

    View Slide

  137. @tyler_treat
    What can you get for “free” and
    what do you need to pass along?

    View Slide

  138. @tyler_treat
    Create standard specs for each data
    type collected (logs, metrics, traces).

    View Slide

  139. @tyler_treat
    Specs can enforce required fields (e.g.
    user id, license, trace id) and data types.

    View Slide

  140. @tyler_treat
    {
    “timestamp”: “2019-04-05 13:26.42”,
    “level”: “INFO”,
    “event”: “user_login”,
    “context”: {
    “id”: “accfbb8315c44a52ad893ca6772e1caf”,
    “http_method”: “POST”,
    “http_path”: “/login”,
    “user”: “tylertreat”,

    “user_id”: “3bb12f6c63274abe87fd1ee4ee37f3d2”,

    “license”: “942e6543f0844be680e72003d5e060fd”,
    “email”: “[email protected]”,
    }
    }

    View Slide

  141. @tyler_treat
    Be mindful not to log sensitive
    data like passwords.

    View Slide

  142. @tyler_treat
    Specs alone aren’t enough!
    2. Specification Libraries

    View Slide

  143. @tyler_treat
    Empowering developers requires
    providing tools that align the “easy” path
    with the “right” path.

    View Slide

  144. @tyler_treat
    We need libraries that implement the
    specs and make it easy for devs to
    instrument their systems.

    View Slide

  145. @tyler_treat
    • Java: log4j
    • Go: logrus
    • Python: structlog
    • Ruby: ruby-cabin
    • .NET: serilog
    • JS: structured-log
    • etc.
    There are many
    existing libraries
    for structured
    logging.

    View Slide

  146. @tyler_treat
    For tracing and
    metrics, there are
    vendor-neutral APIs
    like OpenTracing
    and OpenCensus.

    View Slide

  147. @tyler_treat
    We need a lightweight agent that can
    collect data from hosts/containers.
    3. Data Collector

    View Slide

  148. @tyler_treat
    Collect data, perform transformations/
    filters, and write it to the data pipeline.

    View Slide

  149. @tyler_treat
    Typically runs as an agent on the
    host (DaemonSet in Kubernetes).

    View Slide

  150. @tyler_treat
    Data is written to stdout/stderr
    or a Unix domain socket.

    View Slide

  151. @tyler_treat
    Just use
    Fluentd or
    Logstash
    (+Beats).

    View Slide

  152. @tyler_treat
    We need a scalable, fault-tolerant data
    stream to handle the firehose of
    observability data generated.
    4. Data Pipeline

    View Slide

  153. @tyler_treat
    This also provides a buffer that
    decouples producers from consumers.

    View Slide

  154. @tyler_treat
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    Amazon Glacier
    S3 Client

    Datadog Metrics
    Agent

    View Slide

  155. @tyler_treat
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    Amazon Glacier
    S3 Client

    Datadog Metrics
    Agent

    View Slide

  156. @tyler_treat
    Lots of options…

    View Slide

  157. @tyler_treat

    View Slide

  158. @tyler_treat
    We need a component to consume data
    from the pipeline, perform filtering, and
    write it to the appropriate backends.
    5. Data Router

    View Slide

  159. @tyler_treat
    May perform transformations and processing of data,
    but heavy processing should be the responsibility of a
    backend system (e.g. alerting or aggregations).

    View Slide

  160. @tyler_treat
    This is where the data spec
    comes into play.

    View Slide

  161. @tyler_treat
    The data type determines how
    incoming data is routed.

    View Slide

  162. @tyler_treat
    Data Pipeline
    Amazon Glacier
    Data Router
    logs
    traces
    metrics

    View Slide

  163. @tyler_treat
    Data Pipeline
    Amazon Glacier
    Data Router
    logs
    traces
    metrics

    View Slide

  164. @tyler_treat
    Data Pipeline
    Amazon Glacier
    Data Router
    logs
    traces
    metrics

    View Slide

  165. @tyler_treat
    This is primarily a stateless
    component writing to APIs.

    View Slide

  166. @tyler_treat
    Good fit for
    “serverless”
    solutions.

    View Slide

  167. @tyler_treat
    Piecing It All Together

    View Slide

  168. @tyler_treat

    View Slide

  169. @tyler_treat
    You don’t need to build it out all
    in one go.

    View Slide

  170. @tyler_treat
    There are quick wins along the
    way!

    View Slide

  171. @tyler_treat
    Evolving to an Observability Pipeline
    • Adopt structured logging
    • Move log/data collection out of process
    • Use a centralized logging system
    • Introduce a streaming data solution
    • Start adding data consumers

    View Slide

  172. @tyler_treat
    Moving from host-centric to
    service-centric observability.

    View Slide

  173. @tyler_treat
    This maps to VMs and containers as
    well as it does to “serverless” models.

    View Slide

  174. @tyler_treat
    Ops
    Systems
    Production
    Product

    Development
    Product

    Management
    Security &

    Compliance
    Support/

    Helpdesk

    View Slide

  175. @tyler_treat
    Dev/Ops/SRE
    Systems
    Production
    Audit
    Business Analytics
    Pricing Decisions
    Data-Driven Product Decisions
    Threat Detection
    Monitoring
    Debugging & Operational Insights
    ...

    View Slide

  176. @tyler_treat
    Dev/Ops/SRE
    Systems
    Production

    View Slide

  177. @tyler_treat
    Dev/Ops/SRE
    Systems
    Production

    View Slide

  178. @tyler_treat
    Dev/Ops/SRE
    Systems
    Production

    View Slide

  179. @tyler_treat
    Dev/Ops/SRE
    Systems
    Production

    View Slide

  180. @tyler_treat
    Dev/Ops/SRE
    Systems
    Production

    View Slide

  181. @tyler_treat
    Dev/Ops/SRE
    Systems
    Production

    View Slide

  182. @tyler_treat
    Benefits
    • Pattern can be evolved to with quick wins along the way
    • Maps to elastic and serverless architectures better
    • Empowers teams in siloed organizations and unlocks data for other parts
    of the business
    • Enables teams to use the tools best suited to their needs
    • Easier to change tools or evaluate them side-by-side by decoupling
    • Minimizes impact on developers and the core system

    View Slide

  183. @tyler_treat
    But it’s not a silver bullet.

    View Slide

  184. @tyler_treat
    Downsides
    • Moving away from agent-based model means we have to handle data
    routing ourselves
    • A lot of the Data Router components might need to be custom-made
    using various vendor SDKs or client libraries (assuming they have
    APIs)
    • This also means we might lose some of the value-add features of
    certain agents
    • Unclear how well this maps to pull-based models (e.g. Prometheus)

    View Slide

  185. @tyler_treat
    CI/CD Pipeline +

    Observability Pipeline

    View Slide

  186. @tyler_treat
    CI/CD
    Pre-
    Production

    (theorizing about
    known unknowns)
    Post-
    Production

    (learning from
    unknown unknowns)
    Observability

    View Slide

  187. @tyler_treat
    Thank You
    realkinetic.com

    bravenewgeek.com

    View Slide