Save 37% off PRO during our Black Friday Sale! »

No-Frills Observability

No-Frills Observability

147daa6a064cd3eece85c46634812fb5?s=128

Eben Freeman

May 18, 2018
Tweet

Transcript

  1. No-Frills Observability Eben Freeman @_emfree_ | eben@honeycomb.io

  2. Hi, I’m Eben! currently: honeycomb.io follow along at speakerdeck.com/emfree/no-frills-observability

  3. What this talk is not about - Carefully defining the

    word “observability” - Problems at very large scale - Exceptionally challenging instrumentation problems (very high-perf / low-level)
  4. What this talk is not about "I HAVE NO TOOLS

    BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS. My only logging option is to hire monks to transcribe the subjective experience of watching my machines die as I weep tears of blood." -- James Mickens, The Night Watch https://www.usenix.org/system/files/1311_05-08_mickens.pdf
  5. What this talk is not about - Carefully defining the

    word “observability” - Problems at very large scale - Exceptionally challenging instrumentation problems What this talk is about - A high-leverage approach to improving observability for your organization.
  6. Assumptions - You work on a software product with users

    - You care about those users - You write services and run them in production - It’s valuable to understand what happens in production
  7. Assumptions - You work on a software product with users

    - You care about those users - You write services and run them in production - It’s valuable to understand what happens in production, in order to - Diagnose major issues - Understand what your users are doing (or not doing) - Measure the effect of changes - Identify general improvements
  8. An Observability Stack Code instrumentation Transport Storage and retrieval emit

    data from the services you own send it somewhere retain the data so you can ask questions
  9. An Observability Stack Code instrumentation Transport Storage and retrieval Kafka,

    rsyslog, Logstash, HTTP, ... Elasticsearch/Kibana, Graphite, Zipkin/Jaeger, ... log.Error(...) statsd.Increment(...) opentracing.StartSpan(...)
  10. But wait Raise your hand if you’ve said or heard

    something like this: “When there’s an error, I don’t have enough context to figure out why it happened” “We have all of these metrics, but I don’t know what any of them really mean” “I want to try a new tool, but that requires changing everything in our code” “I know I should instrument this new codepath, but I don’t know what’ll be useful”
  11. But wait Raise your hand if you’ve said or heard

    something like this: “When there’s an error, I don’t have enough context to figure out why it happened” “We have all of these metrics, but I don’t know what any of them really mean” “I want to try a new tool, but that requires changing everything in our code” “I know I should instrument this new codepath, but I don’t know what’ll be useful” These are essentially cultural problems, not intrinsically hard problems in computer science.
  12. Your observability stack: the missing layer Ad-hoc instrumentation is doomed

    to mediocrity! metrics logs traces
  13. Your observability stack: the missing layer Code instrumentation Transport Storage

    and retrieval Instrumentation patterns
  14. Mature organizations tend to develop their own instrumentation library and

    API. This is a way to encourage effective practices and make it easier for everyone to write observable software. Your observability stack: the missing layer Code instrumentation Transport Storage and retrieval Instrumentation patterns
  15. This is high-leverage work. You: - Provide a blueprint for

    others to follow - Ensure that instrumentation captures the right context - Make it easier to evolve your observability over time. Your observability stack: the missing layer
  16. "Isn't this problem solved by OpenTracing, OpenCensus, Veneur, etc?" -

    To a point! - It's generally easier to adopt and evolve an internal API - The authors of these projects may not have perfectly anticipated your needs - Wrapping / adapting existing frameworks is arguably most effective! Hang on opentracing.io opencensus.io github.com/stripe/veneur
  17. You probably didn’t … - write your own database --

    but you do have your own schema - write your own HTTP router -- but you do have your own handlers - write your own Javascript framework -- but you do have your own UI components. What you build on top of foundations embodies what matters to you: - Your data model, your service, your visual style This work deserves thought and attention Some elements of your system are necessarily bespoke
  18. Getting Real

  19. Let’s sketch the outline of such a library: - Start

    with structure - Add context - Encourage explanation - Abstract transport - Embrace pragmatism Five examples of ways to help your team < 10 lines of code each Instrumentation Patterns
  20. Start with structure "I can't even figure out where to

    start looking in the logs." Ditch strings: - Strive for self-describing data - Avoid format strings in logging code This is not self-describing data: 127.0.0.1 - - [12/Oct/2017 17:36:36] "GET /login HTTP/1.1" 200 -
  21. Start with structure This is not self-describing data: 127.0.0.1 -

    - [12/Oct/2017 17:36:36] "GET /login HTTP/1.1" 200 - This is: { "upstream_address": "127.0.0.1", "date": "2017-10-21T17:36:36", "request_method": "GET", "request_path": "/login" "status": 200 }
  22. Start with structure Avoid direct string formatting: def login(): try:

    # . . . except Exception as e: log.info(“Error logging in: %s”, e)
  23. Start with structure Make it easy to emit structured data:

    import baseline # This is our instrumentation API def login(): try: # … except Exception as e: baseline.log( error=e, endpoint=”login” )
  24. Start with structure Make it easy to emit structured data:

    import baseline def login(): try: # … except Exception as e: baseline.log( error=e, endpoint="login" ) # baseline library import json def log(**data): print json.dumps(data) # you can ONLY pass key-value arguments Structure becomes the default
  25. “When there’s an error, I don’t have enough context to

    figure out what happened.” Add context
  26. “When there’s an error, I don’t have enough context to

    figure out what happened” Attaching additional context to instrumentation is generally cheap. Not having that context is generally expensive. Examples: - Customer ID - User ID - Build ID or git SHA - Request ID - Feature flags - Call site (function name and line number) Automate when possible Add context
  27. Context might be scoped to a process, a function, a

    request . . . # baseline library env = os.environ["ENV"] build_id = os.environ["BUILD_ID"] def log(**data): data["environment"] = env data["build_id"] = build_id print json.dumps(data) Add context
  28. Add context An example: migrating services to Kubernetes - Attach

    an infra_type field to all events - Very useful during the migration, not needed otherwise - One-line change
  29. Incremental context def login(): event = baseline.Event() event.bind("endpoint", "login") user

    = authUser(request) event.bind("user_id", user.id) # . . . event.send() Make it easy to incrementally build up context -- deal in events
  30. Incremental context def login(): event = baseline.Event() event.bind("endpoint", "login") user

    = authUser(request) event.bind("user_id", user.id) # . . . event.send() Make it easy to incrementally build up context -- deal in events # baseline class Event(object): def __init__(self): self.data = {} def bind(self, key, value): self.data[key] = value def send(self): print json.dumps(self.data)
  31. Incremental context def login(): event = baseline.Event() event.bind("endpoint", "login") user

    = authUser(request) event.bind("user_id", user.id) # . . . event.send() Make it easy to incrementally build up context -- deal in events Idea: automatically time this duration
  32. Encourage explanation “We have all of these metrics and fields,

    but I don’t know what any of them mean.” event.bind(“memory_inuse_merge_max”, max_alloc)
  33. Encourage explanation “We have all of these metrics, but I

    don’t know what any of them mean.” event.bind(“memory_inuse_merge_max”, max_alloc)
  34. Encourage explanation “We have all of these metrics, but I

    don’t know what any of them mean.” event.bind("memory_inuse_merge_max", max_alloc, help="Maximum measured heap size during the merge phase of a query")
  35. Encourage explanation “We have all of these metrics, but I

    don’t know what any of them mean.” event.bind("memory_inuse_merge_max", max_alloc, help="Maximum measured heap size during the merge phase of a query") - send-on-first-use - just treat as documentation-in-code
  36. Abstract away transport details You probably have multiple tools for

    - logs - metrics - exceptions - traces - etc. Abstracting their protocol details is generally worthwhile!
  37. Abstract away transport details "Reading JSON logs when I'm developing

    locally is hard" Responses: ✗ Shut up and eat your vegetables! ✔ Can we fix that?
  38. Abstract away transport details "Reading JSON logs when I'm developing

    locally is hard" - output JSON or protobufs or whatever in production, pretty-print in development: # baseline running_in_terminal = sys.stdin.isatty() class Event(object): def send(self): if running_in_terminal: pretty_print(self.data) else: # ...
  39. Embrace pragmatism “I can't figure this out without debug logs,

    and we don't have those in prod”
  40. Embrace pragmatism class Event(object): def __init__(self, request_id, debug=False): # ...

    def send(self): if self.debug and self.request_id % 1000 != 0: # drop debug events for 999 out of 1000 requests return # otherwise, actually send the event “I can't figure this out without debug logs, and we don't have those in prod”
  41. Embrace pragmatism “I can't figure this out without debug logs,

    and we don't have those in prod” - record debug logs for 1 in 1000 requests - may seem like a hack - but does help debug any problem affecting more than 0.1% of requests - three lines of code
  42. Embrace pragmatism “I can't figure this out without debug logs,

    and we don't have those in prod” - record debug logs for 1 in 1000 requests - may seem like a hack - but does help debug any problem affecting more than 0.1% of requests - three lines of code
  43. More ideas Safeguards - make sure transport doesn't block the

    critical path - truncate ludicrously large events - rate-limit or drop events under pressure
  44. More ideas - Unit tests for instrumentation: - can rate-limit

    or drop events under pressure - Culture of testing? Awesome! - make it easy to unit-test instrumentation def test_instrumentation_called(): event_sink = baseline.MockTransport() with test_request() as req: login(req) assert len(event_sink.recorded_events()) > 0 assert "user_id" in mock.recorded_events[0]
  45. Key points - Every one of these is an incremental

    improvement - Not a multi-month project - You choose the pain points to prioritize - Small lines-of-code changes, large team impact
  46. In conclusion Invest in instrumenting: - Start with structure -

    Add context - Encourage explanation - Abstract transport - Embrace pragmatism Improve incrementally - Focus on the problems that matter to you - Solicit team feedback
  47. Thank you! heckle at: @_emfree_ | eben@honeycomb.io These slides: speakerdeck.com/emfree/no-frills-observabilit

    y P.S. Please come say hi, I have duct tape (for your infra)