Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Altitude 2018

Altitude 2018

The burden of a successful feature: Scaling our real time logging platform

Observability is a hot topic in the computing world: we’ve all dealt with systems that are difficult to reason about because we have no visibility into what they’re doing. Fastly’s real time logging gives you immediate visibility into your application’s behavior at the edge. It streams millions of request logs per second and can ship data to customer defined infrastructure including 3rd party cloud services. In this talk we’ll give you a peek into the logging platform, share some challenges we found along the way and the lessons we learned from them. And more importantly we’ll talk about what we are doing to evolve this platform into the future.

https://github.com/Randommood/altitude2018

Ines Sombra

May 22, 2018
Tweet

More Decks by Ines Sombra

Other Decks in Technology

Transcript

  1. Ines Sombra Director of Engineering The burden of a successful

    feature: 
 Scaling our real time logging platform presents
  2. https://vimeo.com/267641392 Fresh from Altitude NYC —Peter Bourgon, Altitude NYC Observability

    is an umbrella term. There are different techniques to achieve observability in a system.
  3. But Why? This pipeline is one of the oldest systems

    at Fastly Born out of our dissatisfaction w the status quo We wanted something that would send you logs extremely fast (stream them near realtime) to anywhere you want (many endpoints)
  4. Logging pipeline is Stateless We don’t batch your logs We

    don’t store your logs We stream your logs in near real-time to your defined endpoints We really don’t want your logs on disk
  5. Logging pipeline is Best Effort We try our best to

    send logs to your defined endpoint Your endpoint must be up & healthy in order for us to be able to send data to it We have minimal buffering Pipeline optimized for log streaming speed
  6. Logging Endpoints We don’t limit the number of endpoints or

    log lines per request ~8.6K active endpoints Ecosystem of endpoints in different stages of evolution Aggregators Endpoints s3 syslog gcs sumologic bigquery ftp papertrail …
  7. Logging Streams data File-based endpoints (time ranged) Streaming endpoints (protocol

    or http-requests) s3 gcs ftp sftp syslog sumologic bigquery logentries papertrail splunk scalyr honeycomb
  8. We send a lot of data continuously to our supported

    endpoints Syslog continues to be our most popular endpoint but S3 & GCS have the highest volume The 70's are still alive with a very respectable 13 MBps to ftp and 74 kBps to sftp* * for the non-millennials Logging Endpoints
  9. Volume Challenges No hard limits to what you can log,

    this can be challenging System is multi-tenant. Noisy neighbors can affect delivery Consider sampling for high volume logging
  10. Burden of many endpoints Classic integrations challenges (each endpoint is

    a downstream dependency) Standard endpoint clients often don’t meet our needs Having our own clients affords us extra optimizations
  11. Endpoints & Health Some endpoints have known limitations (infamous examples:

    S3, BigQuery, GCS) Difficult to infer if an endpoint is working or not (Hard to test setup too) Structured logging (JSON via VCL) is challenging
  12. Service Isolation Prioritize delivery of content over log retention An

    aggregator discards the oldest logs it has when it can’t deliver them fast enough In a cache node we are our own customers so senders do the same when they can’t reach aggregators fast enough
  13. Expectation Mismatch Burden of a system that works so well

    is that it makes you believe you have strong guarantees Design constraints determine the SLA of the pipeline General advice: Understand the design choices of the systems you use because they limit what is possible to guarantee *
  14. The team have been Busy bees H2 H1 Platform performance

    & addressing the challenges of individual endpoints We are getting fancy!
  15. Platform Performance Reducing lock contention & CPU usage Smarter memory

    allocation & management Overhauling all endpoints Halving the time it takes for a log line to be processed (from sender read to aggregator line preparation)
  16. Want more endpoints? Want metrics? Want easier structured logging? Want

    VCL counters + secondly aggregation + a higher SLA? Want More?
  17. Want more endpoints? Want metrics? Want easier structured logging? Want

    VCL counters + secondly aggregation + a higher SLA? Dom Fee Want More?
  18. Want more endpoints? Want metrics? Want easier structured logging? Want

    VCL counters + secondly aggregation + a higher SLA? Dom Fee Want More?
  19. tl;dr LOGGING Fastly lets you extend the visibility of your

    system to the edge & gain meaningful insights in near real-time Is a pipeline with very specific constraints & guarantees Exciting things are coming!