Altitude 2018

Altitude 2018

The burden of a successful feature: Scaling our real time logging platform

Observability is a hot topic in the computing world: we’ve all dealt with systems that are difficult to reason about because we have no visibility into what they’re doing. Fastly’s real time logging gives you immediate visibility into your application’s behavior at the edge. It streams millions of request logs per second and can ship data to customer defined infrastructure including 3rd party cloud services. In this talk we’ll give you a peek into the logging platform, share some challenges we found along the way and the lessons we learned from them. And more importantly we’ll talk about what we are doing to evolve this platform into the future.

https://github.com/Randommood/altitude2018

C64a0152c9b0928e62d88f0bb5eb8138?s=128

Ines Sombra

May 22, 2018
Tweet

Transcript

  1. Ines Sombra Director of Engineering The burden of a successful

    feature: 
 Scaling our real time logging platform presents
  2. Today’s Agenda A delightful demo & context A deep dive

    into logging Challenges & future
  3. But first… A bit of context

  4. Observability tl;dr

  5. https://vimeo.com/267641392 Fresh from Altitude NYC —Peter Bourgon, Altitude NYC Observability

    is an umbrella term. There are different techniques to achieve observability in a system.
  6. Peter’s classification of Observability TECHNIQUES SYSTEMS * Lovingly stolen from

    Peter Bourgon
  7. SYSTEMS * Lovingly stolen from Peter Bourgon TODAY Peter’s classification

    of Observability TECHNIQUES
  8. STOP Demo Time!

  9. None
  10. None
  11. None
  12. But Why? This pipeline is one of the oldest systems

    at Fastly Born out of our dissatisfaction w the status quo We wanted something that would send you logs extremely fast (stream them near realtime) to anywhere you want (many endpoints)
  13. Log Streaming 
 at Fastly

  14. Logging @ Fastly Caches Aggregators Endpoints s3 syslog gcs sumologic

    bigquery ftp papertrail …
  15. s3 syslog gcs sumologic bigquery ftp papertrail … Logging @

    Fastly Caches Aggregators Endpoints
  16. s3 syslog gcs sumologic bigquery ftp papertrail … Logging @

    Fastly Caches Aggregators Endpoints
  17. s3 syslog gcs sumologic bigquery ftp papertrail … Logging @

    Fastly Caches Aggregators Endpoints
  18. s3 syslog gcs sumologic bigquery ftp papertrail … Logging @

    Fastly Caches Aggregators Endpoints
  19. s3 syslog gcs sumologic bigquery ftp papertrail … Logging @

    Fastly Caches Aggregators Endpoints
  20. s3 syslog gcs sumologic bigquery ftp papertrail … Logging @

    Fastly Caches Aggregators Endpoints
  21. s3 syslog gcs sumologic bigquery ftp papertrail … Logging @

    Fastly Caches Aggregators Endpoints
  22. Logging pipeline is Stateless We don’t batch your logs We

    don’t store your logs We stream your logs in near real-time to your defined endpoints We really don’t want your logs on disk
  23. Logging @ Fastly Caches + Senders Aggregators Varnish Varnish Varnish

    Varnish
  24. Varnish Varnish Varnish Varnish Logging @ Fastly Caches + Senders

    Aggregators
  25. Varnish Varnish Varnish Varnish Logging @ Fastly Caches + Senders

    Aggregators
  26. Varnish Varnish Varnish Logging @ Fastly Caches + Senders Aggregators

    Varnish
  27. Logging pipeline is Best Effort We try our best to

    send logs to your defined endpoint Your endpoint must be up & healthy in order for us to be able to send data to it We have minimal buffering Pipeline optimized for log streaming speed
  28. Logging Endpoints We don’t limit the number of endpoints or

    log lines per request ~8.6K active endpoints Ecosystem of endpoints in different stages of evolution Aggregators Endpoints s3 syslog gcs sumologic bigquery ftp papertrail …
  29. Logging Streams data File-based endpoints (time ranged) Streaming endpoints (protocol

    or http-requests) s3 gcs ftp sftp syslog sumologic bigquery logentries papertrail splunk scalyr honeycomb
  30. Logging Growth (2014-2015) ~430K LPS ~1.2K endpoints ~ 2GBps

  31. Logging Growth (2014-2015) ~430K LPS ~1.2K endpoints ~ 2GBps

  32. Logging Growth (2017-2018) ~3M LPS ~8.6K endpoints ~4GBps

  33. Logging Growth (2017-2018) ~3M LPS ~8.6K endpoints ~4GBps

  34. Logging Growth (8X!!) ~3M LPS ~8.6K endpoints ~4GBps

  35. Logging Endpoints

  36. We send a lot of data continuously to our supported

    endpoints Syslog continues to be our most popular endpoint but S3 & GCS have the highest volume The 70's are still alive with a very respectable 13 MBps to ftp and 74 kBps to sftp* * for the non-millennials Logging Endpoints
  37. Challenges & 
 Lessons learned

  38. s3 syslog gcs sumologic bigquery ftp papertrail … Logging @

    Fastly Caches Aggregators Endpoints
  39. Volume Challenges No hard limits to what you can log,

    this can be challenging System is multi-tenant. Noisy neighbors can affect delivery Consider sampling for high volume logging
  40. Burden of many endpoints Classic integrations challenges (each endpoint is

    a downstream dependency) Standard endpoint clients often don’t meet our needs Having our own clients affords us extra optimizations
  41. Endpoints & Health Some endpoints have known limitations (infamous examples:

    S3, BigQuery, GCS) Difficult to infer if an endpoint is working or not (Hard to test setup too) Structured logging (JSON via VCL) is challenging
  42. Service Isolation Prioritize delivery of content over log retention An

    aggregator discards the oldest logs it has when it can’t deliver them fast enough In a cache node we are our own customers so senders do the same when they can’t reach aggregators fast enough
  43. Expectation Mismatch Burden of a system that works so well

    is that it makes you believe you have strong guarantees Design constraints determine the SLA of the pipeline General advice: Understand the design choices of the systems you use because they limit what is possible to guarantee *
  44. The Future of Logging

  45. The team have been Busy bees H2 H1 Platform performance

    & addressing the challenges of individual endpoints We are getting fancy!
  46. Platform Performance Reducing lock contention & CPU usage Smarter memory

    allocation & management Overhauling all endpoints Halving the time it takes for a log line to be processed (from sender read to aggregator line preparation)
  47. Getting fancy BigQuery improvements New endpoints: Kafka More integrations with

    cloud services Make endpoints easier to debug
  48. None
  49. Want More?

  50. Want more endpoints? Want metrics? Want easier structured logging? Want

    VCL counters + secondly aggregation + a higher SLA? Want More?
  51. Want more endpoints? Want metrics? Want easier structured logging? Want

    VCL counters + secondly aggregation + a higher SLA? Dom Fee Want More?
  52. Want more endpoints? Want metrics? Want easier structured logging? Want

    VCL counters + secondly aggregation + a higher SLA? Dom Fee Want More?
  53. tl;dr LOGGING Fastly lets you extend the visibility of your

    system to the edge & gain meaningful insights in near real-time Is a pipeline with very specific constraints & guarantees Exciting things are coming!
  54. (l,d)ogs of Fastly https://github.com/Randommood/Altitude2018