BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS. My only logging option is to hire monks to transcribe the subjective experience of watching my machines die as I weep tears of blood." -- James Mickens, The Night Watch https://www.usenix.org/system/files/1311_05-08_mickens.pdf
word “observability” - Problems at very large scale - Exceptionally challenging instrumentation problems What this talk is about - A high-leverage approach to improving observability for your organization.
- You care about those users - You write services and run them in production - It’s valuable to understand what happens in production, in order to - Diagnose major issues - Understand what your users are doing (or not doing) - Measure the effect of changes - Identify general improvements
something like this: “When there’s an error, I don’t have enough context to figure out why it happened” “We have all of these metrics, but I don’t know what any of them really mean” “I want to try a new tool, but that requires changing everything in our code” “I know I should instrument this new codepath, but I don’t know what’ll be useful”
something like this: “When there’s an error, I don’t have enough context to figure out why it happened” “We have all of these metrics, but I don’t know what any of them really mean” “I want to try a new tool, but that requires changing everything in our code” “I know I should instrument this new codepath, but I don’t know what’ll be useful” These are essentially cultural problems, not intrinsically hard problems in computer science.
API. This is a way to encourage effective practices and make it easier for everyone to write observable software. Your observability stack: the missing layer Code instrumentation Transport Storage and retrieval Instrumentation patterns
To a point! - It's generally easier to adopt and evolve an internal API - The authors of these projects may not have perfectly anticipated your needs - Wrapping / adapting existing frameworks is arguably most effective! Hang on opentracing.io opencensus.io github.com/stripe/veneur
start looking in the logs." Ditch strings: - Strive for self-describing data - Avoid format strings in logging code This is not self-describing data: 127.0.0.1 - - [12/Oct/2017 17:36:36] "GET /login HTTP/1.1" 200 -
figure out what happened” Attaching additional context to instrumentation is generally cheap. Not having that context is generally expensive. Examples: - Customer ID - User ID - Build ID or git SHA - Request ID - Feature flags - Call site (function name and line number) Automate when possible Add context
don’t know what any of them mean.” event.bind("memory_inuse_merge_max", max_alloc, help="Maximum measured heap size during the merge phase of a query") - send-on-first-use - just treat as documentation-in-code
locally is hard" - output JSON or protobufs or whatever in production, pretty-print in development: # baseline running_in_terminal = sys.stdin.isatty() class Event(object): def send(self): if running_in_terminal: pretty_print(self.data) else: # ...
def send(self): if self.debug and self.request_id % 1000 != 0: # drop debug events for 999 out of 1000 requests return # otherwise, actually send the event “I can't figure this out without debug logs, and we don't have those in prod”
or drop events under pressure - Culture of testing? Awesome! - make it easy to unit-test instrumentation def test_instrumentation_called(): event_sink = baseline.MockTransport() with test_request() as req: login(req) assert len(event_sink.recorded_events()) > 0 assert "user_id" in mock.recorded_events