Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring and Observability @ Twitter

Monitoring and Observability @ Twitter

Talk about monitoring, August 2014.

Alex Yarmula

August 14, 2014
Tweet

Other Decks in Programming

Transcript

  1. Measuring the
    infrastructure resiliency
    @twalex
    Alex Yarmula

    View Slide

  2. Monitoring and infrastructure
    • Monitoring is Tier 0: can’t be less available than the
    systems under monitoring
    • Therefore, can’t rely on systems you’re monitoring
    to be part of your infrastructure
    • Understand which things have to be in your control

    View Slide

  3. Getting the metrics out
    • Instrument the code with metrics directly
    • higher chance of capturing the important knowledge than after-the-fact metrics
    • when the logic changes, it’s reflected in metrics right away
    • Don’t make people explicitly add/register metrics

    View Slide

  4. Getting the metrics in
    • What’s the metric -> monitoring path?
    • Don’t onboard services and hosts onto your
    monitoring stack - make it automatic
    • Choose between pushing metrics vs polling metrics
    • Maximizing the control over sending while
    minimizing failure scenarios
    • Centralized collectors vs decentralized agents

    View Slide

  5. Discovering the data
    • Assign a name and address for each piece of data
    gathered
    “for all front end servers, sum 15-min load avg”
    • Ability to query the data without knowing where it
    resides
    • Ability to perform maintenances, move data
    sources

    View Slide

  6. Discovering the data (cont.)
    Decouple monitoring from the knowledge about
    instances:
    “look at all the backend Riak metrics”
    vs
    “see metrics coming from my-riak-01.db.startup.ca,
    my-riak-02.db.startup.ca, my-riak-03.db.startup.ca”

    View Slide

  7. Discovering events
    • Metrics are only the projection of reality into your
    measuring devices
    • Keep track of higher-level events to provide context
    for the metrics

    View Slide

  8. Going faster
    • Invest in high-frequency metrics (1s, 10s)
    • Doesn’t have to go through the main monitoring path
    • can stream through WebSockets
    • can have high-frequency collection
    • High frequency != low latency
    • consider store -> batch -> forward

    View Slide

  9. Going faster

    View Slide

  10. Visualizations
    “Go to these pages and look at the graphs”
    vs
    Mathematica-style “enter your functions”

    View Slide

  11. Example queries
    expmovingavg(10, 0.9, ts(sum, myservice,
    myservice.hostgroupA, api_store))
    groupby(metric, sum(_), ts(sum, myservice,
    members(myservice.hostgroupA), api_store))

    View Slide

  12. Visualizations
    Side-by-side

    data representations:

    View Slide

  13. • “Live” graphs: auto-update
    periodically as the data arrives
    • Allow overlaying time-series
    • Consider log scale to compare results
    of different magnitudes
    Visualizations

    View Slide

  14. Problems at Twitter scale

    View Slide

  15. Quality of service
    • Your workload is write-many read-few
    • Protect against heavy reads: evaluate per-query
    costs, kill expensive queries, track hot requests
    • Protect against heavy writes: impose quotas,
    prioritize or drop

    View Slide

  16. Quality of service (cont.)
    • Full disclosure: this is where we’re heading now
    • Some metrics are more equal than others. If you
    know which ones that are more important, you can
    protect them better
    • At a certain size, you can’t put all of your metrics in
    one place. Have to isolate for reliability, then
    federate

    View Slide

  17. Quality of service (cont.)
    • Do all read queries have the same SLA? Can some
    be answered in minutes and not seconds?
    • Can some writes be aggregated offline? Can they
    be approximated and then improved (e.g. Lambda
    Architecture model)

    View Slide

  18. Consistency
    • Do alerts and dashboards have to be separate?
    • In a metrics-driven organization, how does one
    discover metrics that matter the most?
    • In a service-oriented environment, who monitors
    your dependencies? Who gets paged?

    View Slide

  19. Configuration
    • In the service-oriented architecture, it’s tempting to
    configure each system separately
    • Filtering metrics, aggregating metrics on multiple
    levels adds to complexity
    • Attempt to consolidate the most important pieces of
    configuration early on

    View Slide

  20. Line charts and scale
    vs

    View Slide

  21. Summary
    • Invest in reliability of your monitoring stack to
    increase reliability of your company’s services
    • Visualize to reduce cognitive cost
    • Optimize for the shortest path to get to the data
    • Make an effort to get to high-frequency data

    View Slide

  22. Thanks!
    • My team is hiring, come talk to us later
    twitter.com/jobs
    • There are a few Observability team members here
    tonight
    • Share your monitoring experiences with us!

    View Slide