Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cloud-Native Observability

Cloud-Native Observability

What is observability and how is it different from traditional monitoring? How do we effectively monitor and debug complex, elastic microservice architectures? In this interactive discussion, we’ll answer these questions. We’ll also introduce the idea of an “observability pipeline” as a way to empower teams following DevOps practices. Lastly, we’ll demo cloud-native observability tools that fit this “observability pipeline” model, including Fluentd, OpenTracing, and Jaeger.

Tyler Treat

June 06, 2019
Tweet

More Decks by Tyler Treat

Other Decks in Programming

Transcript

  1. @tyler_treat
    Cloud-Native Observability
    Tyler Treat / Cloud Native - Madison / June 6, 2019

    View full-size slide

  2. @tyler_treat

    View full-size slide

  3. @tyler_treat
    Monitoring

    View full-size slide

  4. @tyler_treat
    APM
    Debugger
    Profiler
    SSH
    grep

    View full-size slide

  5. @tyler_treat
    APM
    Debugger
    Profiler
    SSH
    grep

    View full-size slide

  6. @tyler_treat
    APM
    Debugger
    Profiler
    SSH System Behavior
    grep

    View full-size slide

  7. @tyler_treat
    APM
    Debugger
    Profiler
    SSH System Behavior
    Actual Customer Impact
    grep

    View full-size slide

  8. @tyler_treat
    Monitoring

    View full-size slide

  9. @tyler_treat
    APM
    Debugger
    Profiler
    SSH
    grep

    View full-size slide

  10. @tyler_treat
    APM
    Debugger
    Profiler
    SSH
    Testing in Production at Scale, Amit Gud
    grep

    View full-size slide

  11. @tyler_treat
    APM
    Debugger
    Profiler
    SSH System Behavior
    Actual Customer Impact
    ???
    grep

    View full-size slide

  12. @tyler_treat
    “Observability”

    View full-size slide

  13. @tyler_treat
    Post Hoc vs. Ad Hoc

    View full-size slide

  14. @tyler_treat
    Data Available
    Understanding

    View full-size slide

  15. @tyler_treat
    Data Available
    Understanding
    Known Knowns
    • Things we are aware of and understand
    • “The system has a 1GB memory limit”

    View full-size slide

  16. @tyler_treat
    Data Available
    Understanding
    Known Knowns
    • Things we are aware of and understand
    • “The system has a 1GB memory limit”
    Known Unknowns
    • Things we are aware of but don’t
    understand
    • “The system exceeded its memory limit
    and crashed, causing an outage”

    View full-size slide

  17. @tyler_treat
    Data Available
    Understanding
    Unknown Knowns
    • Things we understand but are not
    aware of
    • “We implemented an orchestrator to
    ensure the system is always running”
    Known Knowns
    • Things we are aware of and understand
    • “The system has a 1GB memory limit”
    Known Unknowns
    • Things we are aware of but don’t
    understand
    • “The system exceeded its memory limit
    and crashed, causing an outage”

    View full-size slide

  18. @tyler_treat
    Data Available
    Understanding
    Unknown Knowns
    • Things we understand but are not
    aware of
    • “We implemented an orchestrator to
    ensure the system is always running”
    Known Knowns
    • Things we are aware of and understand
    • “The system has a 1GB memory limit”
    Unknown Unknowns
    • Things we are neither aware of nor
    understand
    • “Instances churn because the
    orchestrator restarts the process when
    it approaches its memory limit, causing

    sporadic failures and slowdowns”
    Known Unknowns
    • Things we are aware of but don’t
    understand
    • “The system exceeded its memory limit
    and crashed, causing an outage”

    View full-size slide

  19. @tyler_treat
    Data Available
    Understanding
    Unknown Knowns
    • Things we understand but are not
    aware of
    • “We implemented an orchestrator to
    ensure the system is always running”
    Known Knowns
    • Things we are aware of and understand
    • “The system has a 1GB memory limit”
    Unknown Unknowns
    • Things we are neither aware of nor
    understand
    • “Instances churn because the
    orchestrator restarts the process when
    it approaches its memory limit, causing

    sporadic failures and slowdowns”
    Known Unknowns
    • Things we are aware of but don’t
    understand
    • “The system exceeded its memory limit
    and crashed, causing an outage”
    FACTS

    View full-size slide

  20. @tyler_treat
    Data Available
    Understanding
    Unknown Knowns
    • Things we understand but are not
    aware of
    • “We implemented an orchestrator to
    ensure the system is always running”
    Known Knowns
    • Things we are aware of and understand
    • “The system has a 1GB memory limit”
    Unknown Unknowns
    • Things we are neither aware of nor
    understand
    • “Instances churn because the
    orchestrator restarts the process when
    it approaches its memory limit, causing

    sporadic failures and slowdowns”
    Known Unknowns
    • Things we are aware of but don’t
    understand
    • “The system exceeded its memory limit
    and crashed, causing an outage”
    FACTS
    HYPOTHESES

    View full-size slide

  21. @tyler_treat
    Data Available
    Understanding
    Unknown Knowns
    • Things we understand but are not
    aware of
    • “We implemented an orchestrator to
    ensure the system is always running”
    Known Knowns
    • Things we are aware of and understand
    • “The system has a 1GB memory limit”
    Unknown Unknowns
    • Things we are neither aware of nor
    understand
    • “Instances churn because the
    orchestrator restarts the process when
    it approaches its memory limit, causing

    sporadic failures and slowdowns”
    Known Unknowns
    • Things we are aware of but don’t
    understand
    • “The system exceeded its memory limit
    and crashed, causing an outage”
    ASSUMPTIONS FACTS
    HYPOTHESES

    View full-size slide

  22. @tyler_treat
    Unknown Unknowns
    • Things we are neither aware of nor
    understand
    • “Instances churn because the
    orchestrator restarts the process when
    it approaches its memory limit, causing

    sporadic failures and slowdowns”
    DISCOVERIES
    Data Available
    Understanding
    Unknown Knowns
    • Things we understand but are not
    aware of
    • “We implemented an orchestrator to
    ensure the system is always running”
    Known Knowns
    • Things we are aware of and understand
    • “The system has a 1GB memory limit”
    Known Unknowns
    • Things we are aware of but don’t
    understand
    • “The system exceeded its memory limit
    and crashed, causing an outage”
    ASSUMPTIONS FACTS
    HYPOTHESES

    View full-size slide

  23. @tyler_treat
    Unknown Unknowns
    • Things we are neither aware of nor
    understand
    • “Instances churn because the
    orchestrator restarts the process when
    it approaches its memory limit, causing

    sporadic failures and slowdowns”
    DISCOVERIES
    Data Available
    Understanding
    Known Unknowns
    • Things we are aware of but don’t
    understand
    • “The system exceeded its memory limit
    and crashed, causing an outage”
    HYPOTHESES
    Monitoring
    Observability

    View full-size slide

  24. @tyler_treat
    Unknown Unknowns
    • Things we are neither aware of nor
    understand
    • “Instances churn because the
    orchestrator restarts the process when
    it approaches its memory limit, causing

    sporadic failures and slowdowns”
    DISCOVERIES
    Data Available
    Understanding
    Known Unknowns
    • Things we are aware of but don’t
    understand
    • “The system exceeded its memory limit
    and crashed, causing an outage”
    HYPOTHESES
    Testing
    Exploring

    View full-size slide

  25. @tyler_treat

    Observability Data
    application logs
    system logs
    audit logs
    application metrics
    distributed traces
    events

    View full-size slide

  26. @tyler_treat
    Some

    challenges…

    Observability Data
    application logs
    system logs
    audit logs
    application metrics
    distributed traces
    events
    - Locked up inside a single vendor’s solution
    - Not readily available across the enterprise

    (or in some cases, too readily available)
    - Many tools and products needed for

    different data and use cases
    - Tool and data needs vary from team to

    team
    - Ever-changing landscape of tools, products,

    and services
    - Sheer volume of data can be overwhelming

    View full-size slide

  27. @tyler_treat
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    Amazon Glacier
    S3 Client

    Datadog Metrics
    Agent

    View full-size slide

  28. System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sp
    Un
    For
    Datad
    A
    Universal
    Analytics Client
    S3 Client

    Datado
    A
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Splunk
    Universal
    Forwarder
    Datadog APM
    Agent
    Universal
    Analytics Client
    S3 Client

    Datadog Metrics
    Agent
    System
    Sp
    Un
    For
    Datad
    A
    Universal
    Analytics Client
    S3 Client

    Datado
    A
    Splunk
    Universal
    Forwarder
    Universal
    Analytics Client
    Splunk
    Universal
    Forwarder
    Universal
    Analytics Client
    Splunk
    Universal
    Forwarder
    Universal
    Analytics Client
    Sp
    Un
    For
    Universal
    Analytics Client
    System System System System

    View full-size slide

  29. @tyler_treat
    How big of a lift is it for your
    organization to change tools?

    View full-size slide

  30. @tyler_treat
    How easy is it to experiment
    with new ones?

    View full-size slide

  31. @tyler_treat
    Data Sources
    • VMs
    • Containers
    • Load balancers
    • Service meshes
    • Audit logs
    • VPC flow logs
    • Firewall logs
    • …
    Data Sinks
    • Centralized logging
    • SIEM
    • Monitoring
    • APM
    • Alerting
    • Cold storage
    • BI
    • …
    What data to send?
    Where to send it?
    How to send it?

    View full-size slide

  32. @tyler_treat
    A decoupled approach

    View full-size slide

  33. @tyler_treat
    What data to send?
    Where to send it?
    How to send it?
    Data Sources
    • VMs
    • Containers
    • Load balancers
    • Service meshes
    • Audit logs
    • VPC flow logs
    • Firewall logs
    • …
    Data Sinks
    • Centralized logging
    • SIEM
    • Monitoring
    • APM
    • Alerting
    • Cold storage
    • BI
    • …
    Observability Pipeline

    View full-size slide

  34. @tyler_treat
    The Observability Pipeline

    View full-size slide

  35. @tyler_treat
    Structure your damn data.
    1. Data Specifications

    View full-size slide

  36. @tyler_treat
    log.error(“User '{}' login failed”.format(user))

    View full-size slide

  37. @tyler_treat
    ERROR 2019-04-05 13:26.42 User ‘tylertreat' login failed

    View full-size slide

  38. @tyler_treat
    log.error(“User login failed”,
    event=LOGIN_ERROR,
    user=“tylertreat”,
    email=“[email protected]”,
    error=error)

    View full-size slide

  39. @tyler_treat
    {
    “timestamp”: “2019-04-05 13:26.42”,
    “level”: “ERROR”,
    “event”: “user_login_error”,
    “user”: “tylertreat”,
    “email”: “[email protected]”,
    “error”: “Invalid username or password”,
    “message”: “User login failed”
    }

    View full-size slide

  40. @tyler_treat
    JSON is fine.

    View full-size slide

  41. @tyler_treat
    Pass a context object to
    everything.

    View full-size slide

  42. @tyler_treat
    def login(ctx, username, email, password):
    ctx.set(user=username, email=email)
    ...
    log.error(“User login failed”,
    event=LOGIN_ERROR,
    context=ctx,
    error=error)
    ...

    View full-size slide

  43. @tyler_treat
    {
    “timestamp”: “2019-04-05 13:26.42”,
    “level”: “ERROR”,
    “event”: “user_login_error”,
    “context”: {
    “id”: “accfbb8315c44a52ad893ca6772e1caf”,
    “http_method”: “POST”,
    “http_path”: “/login”,
    “user”: “tylertreat”,
    “email”: “[email protected]”,
    },
    “error”: “Invalid username or password”,
    “message”: “User login failed”
    }

    View full-size slide

  44. @tyler_treat
    {
    “timestamp”: “2019-04-05 13:26.42”,
    “level”: “ERROR”,
    “event”: “user_login_error”,
    “context”: {
    “id”: “accfbb8315c44a52ad893ca6772e1caf”,
    “http_method”: “POST”,
    “http_path”: “/login”,
    “user”: “tylertreat”,
    “email”: “[email protected]”,
    },
    “error”: “Invalid username or password”,
    “message”: “User login failed”
    }

    View full-size slide

  45. @tyler_treat
    Create standard specs for each data
    type collected (logs, metrics, traces).

    View full-size slide

  46. @tyler_treat
    Specs can enforce required fields (e.g.
    user id, license, trace id) and data types.

    View full-size slide

  47. @tyler_treat
    {
    “timestamp”: “2019-04-05 13:26.42”,
    “level”: “INFO”,
    “event”: “user_login”,
    “context”: {
    “id”: “accfbb8315c44a52ad893ca6772e1caf”,
    “http_method”: “POST”,
    “http_path”: “/login”,
    “user”: “tylertreat”,

    “user_id”: “3bb12f6c63274abe87fd1ee4ee37f3d2”,

    “license”: “942e6543f0844be680e72003d5e060fd”,
    “email”: “[email protected]”,
    }
    }

    View full-size slide

  48. @tyler_treat
    Specs alone aren’t enough!
    2. Specification Libraries

    View full-size slide

  49. @tyler_treat
    We need libraries.

    View full-size slide

  50. @tyler_treat
    • Java: log4j
    • Go: logrus
    • Python: structlog
    • Ruby: ruby-cabin
    • .NET: serilog
    • JS: structured-log
    • etc.
    There are many
    existing libraries
    for structured
    logging.

    View full-size slide

  51. @tyler_treat
    For tracing and
    metrics, there are
    vendor-neutral APIs
    like OpenTracing
    and OpenCensus.

    View full-size slide

  52. @tyler_treat
    We need a lightweight agent that can
    collect data from hosts/containers.
    3. Data Collector

    View full-size slide

  53. @tyler_treat
    Collect data, perform transformations/
    filters, and write it to the data pipeline.

    View full-size slide

  54. @tyler_treat
    Typically runs as an agent on the
    host (DaemonSet in Kubernetes).

    View full-size slide

  55. @tyler_treat
    Data is written to stdout/stderr
    or a Unix domain socket.

    View full-size slide

  56. @tyler_treat
    Just use
    Fluentd or
    Logstash
    (+Beats).

    View full-size slide

  57. @tyler_treat
    We need a scalable, fault-tolerant data
    stream to handle the firehose of
    observability data generated.
    4. Data Pipeline

    View full-size slide

  58. @tyler_treat
    This also provides a buffer that
    decouples producers from consumers.

    View full-size slide

  59. @tyler_treat
    Lots of options…

    View full-size slide

  60. @tyler_treat

    View full-size slide

  61. @tyler_treat
    We need a component to consume data
    from the pipeline, perform filtering, and
    write it to the appropriate backends.
    5. Data Router

    View full-size slide

  62. @tyler_treat
    This is where the data spec
    comes into play.

    View full-size slide

  63. @tyler_treat
    The data shape determines how
    incoming data is handled.

    View full-size slide

  64. @tyler_treat
    Data Pipeline
    Amazon Glacier
    Data Router
    logs
    traces
    metrics

    View full-size slide

  65. @tyler_treat
    Data Pipeline
    Amazon Glacier
    Data Router
    logs
    traces
    metrics

    View full-size slide

  66. @tyler_treat
    Data Pipeline
    Amazon Glacier
    Data Router
    logs
    traces
    metrics

    View full-size slide

  67. @tyler_treat
    This is primarily a stateless
    component writing to APIs.

    View full-size slide

  68. @tyler_treat
    Good fit for
    “serverless”
    solutions.

    View full-size slide

  69. @tyler_treat
    Piecing It All Together

    View full-size slide

  70. @tyler_treat

    View full-size slide

  71. @tyler_treat
    You don’t need to build it out all
    in one go.

    View full-size slide

  72. @tyler_treat
    There are quick wins along the
    way!

    View full-size slide

  73. @tyler_treat
    Evolving to an Observability Pipeline
    • Adopt structured logging
    • Move log/data collection out of process
    • Use a centralized logging system
    • Introduce a streaming data solution
    • Start adding data consumers

    View full-size slide

  74. @tyler_treat
    Dev/Ops/SRE
    Systems
    Production

    View full-size slide

  75. @tyler_treat
    Dev/Ops/SRE
    Systems
    Production

    View full-size slide

  76. @tyler_treat
    Dev/Ops/SRE
    Systems
    Production

    View full-size slide

  77. @tyler_treat
    Dev/Ops/SRE
    Systems
    Production

    View full-size slide

  78. @tyler_treat
    Dev/Ops/SRE
    Systems
    Production

    View full-size slide

  79. @tyler_treat
    Dev/Ops/SRE
    Systems
    Production

    View full-size slide

  80. @tyler_treat
    CI/CD
    Pre-
    Production

    (theorizing about
    known unknowns)
    Post-
    Production

    (learning from
    unknown unknowns)
    Observability

    View full-size slide

  81. @tyler_treat
    Part 2: Demo

    View full-size slide

  82. @tyler_treat
    Trip Service
    Flight Service
    Hotel Service
    Car Rental
    Service
    DynamoDB
    DynamoDB
    DynamoDB
    DynamoDB
    Book Trip

    View full-size slide

  83. @tyler_treat
    Structured logging + context

    View full-size slide

  84. @tyler_treat
    Kubernetes

    View full-size slide

  85. @tyler_treat
    And now here’s some YAML…

    View full-size slide

  86. @tyler_treat

    View full-size slide

  87. @tyler_treat

    View full-size slide

  88. @tyler_treat
    Kubernetes

    View full-size slide

  89. @tyler_treat
    +

    View full-size slide

  90. @tyler_treat
    Kubernetes
    Kinesis

    View full-size slide

  91. @tyler_treat
    AWS Lambda

    View full-size slide

  92. @tyler_treat
    Kubernetes
    Kinesis Lambda

    View full-size slide

  93. @tyler_treat
    Kubernetes
    Kinesis Lambda
    CloudWatch
    Jaeger
    Stackdriver

    View full-size slide

  94. @tyler_treat
    Code:

    https://github.com/RealKinetic/cloud-native-meetup-2019

    View full-size slide

  95. @tyler_treat
    Thank You
    realkinetic.com

    bravenewgeek.com

    View full-size slide