No-Frills Observability

No-Frills Observability Eben Freeman @_emfree_ | [email protected]

Hi, I’m Eben! currently: honeycomb.io follow along at speakerdeck.com/emfree/no-frills-observability

What this talk is not about - Carefully defining the
word “observability” - Problems at very large scale - Exceptionally challenging instrumentation problems (very high-perf / low-level)

What this talk is not about "I HAVE NO TOOLS
BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS. My only logging option is to hire monks to transcribe the subjective experience of watching my machines die as I weep tears of blood." -- James Mickens, The Night Watch https://www.usenix.org/system/files/1311_05-08_mickens.pdf

What this talk is not about - Carefully defining the
word “observability” - Problems at very large scale - Exceptionally challenging instrumentation problems What this talk is about - A high-leverage approach to improving observability for your organization.

Assumptions - You work on a software product with users
- You care about those users - You write services and run them in production - It’s valuable to understand what happens in production

Assumptions - You work on a software product with users
- You care about those users - You write services and run them in production - It’s valuable to understand what happens in production, in order to - Diagnose major issues - Understand what your users are doing (or not doing) - Measure the effect of changes - Identify general improvements

An Observability Stack Code instrumentation Transport Storage and retrieval emit
data from the services you own send it somewhere retain the data so you can ask questions

An Observability Stack Code instrumentation Transport Storage and retrieval Kafka,
rsyslog, Logstash, HTTP, ... Elasticsearch/Kibana, Graphite, Zipkin/Jaeger, ... log.Error(...) statsd.Increment(...) opentracing.StartSpan(...)

But wait Raise your hand if you’ve said or heard
something like this: “When there’s an error, I don’t have enough context to figure out why it happened” “We have all of these metrics, but I don’t know what any of them really mean” “I want to try a new tool, but that requires changing everything in our code” “I know I should instrument this new codepath, but I don’t know what’ll be useful”

But wait Raise your hand if you’ve said or heard
something like this: “When there’s an error, I don’t have enough context to figure out why it happened” “We have all of these metrics, but I don’t know what any of them really mean” “I want to try a new tool, but that requires changing everything in our code” “I know I should instrument this new codepath, but I don’t know what’ll be useful” These are essentially cultural problems, not intrinsically hard problems in computer science.

Your observability stack: the missing layer Ad-hoc instrumentation is doomed
to mediocrity! metrics logs traces

Your observability stack: the missing layer Code instrumentation Transport Storage
and retrieval Instrumentation patterns

Mature organizations tend to develop their own instrumentation library and
API. This is a way to encourage effective practices and make it easier for everyone to write observable software. Your observability stack: the missing layer Code instrumentation Transport Storage and retrieval Instrumentation patterns

This is high-leverage work. You: - Provide a blueprint for
others to follow - Ensure that instrumentation captures the right context - Make it easier to evolve your observability over time. Your observability stack: the missing layer

"Isn't this problem solved by OpenTracing, OpenCensus, Veneur, etc?" -
To a point! - It's generally easier to adopt and evolve an internal API - The authors of these projects may not have perfectly anticipated your needs - Wrapping / adapting existing frameworks is arguably most effective! Hang on opentracing.io opencensus.io github.com/stripe/veneur

You probably didn’t … - write your own database --
but you do have your own schema - write your own HTTP router -- but you do have your own handlers - write your own Javascript framework -- but you do have your own UI components. What you build on top of foundations embodies what matters to you: - Your data model, your service, your visual style This work deserves thought and attention Some elements of your system are necessarily bespoke

Getting Real

Let’s sketch the outline of such a library: - Start
with structure - Add context - Encourage explanation - Abstract transport - Embrace pragmatism Five examples of ways to help your team < 10 lines of code each Instrumentation Patterns

Start with structure "I can't even figure out where to
start looking in the logs." Ditch strings: - Strive for self-describing data - Avoid format strings in logging code This is not self-describing data: 127.0.0.1 - - [12/Oct/2017 17:36:36] "GET /login HTTP/1.1" 200 -

Start with structure This is not self-describing data: 127.0.0.1 -
- [12/Oct/2017 17:36:36] "GET /login HTTP/1.1" 200 - This is: { "upstream_address": "127.0.0.1", "date": "2017-10-21T17:36:36", "request_method": "GET", "request_path": "/login" "status": 200 }

Start with structure Avoid direct string formatting: def login(): try:
# . . . except Exception as e: log.info(“Error logging in: %s”, e)

Start with structure Make it easy to emit structured data:
import baseline # This is our instrumentation API def login(): try: # … except Exception as e: baseline.log( error=e, endpoint=”login” )

Start with structure Make it easy to emit structured data:
import baseline def login(): try: # … except Exception as e: baseline.log( error=e, endpoint="login" ) # baseline library import json def log(**data): print json.dumps(data) # you can ONLY pass key-value arguments Structure becomes the default

“When there’s an error, I don’t have enough context to
figure out what happened.” Add context

“When there’s an error, I don’t have enough context to
figure out what happened” Attaching additional context to instrumentation is generally cheap. Not having that context is generally expensive. Examples: - Customer ID - User ID - Build ID or git SHA - Request ID - Feature flags - Call site (function name and line number) Automate when possible Add context

Context might be scoped to a process, a function, a
request . . . # baseline library env = os.environ["ENV"] build_id = os.environ["BUILD_ID"] def log(**data): data["environment"] = env data["build_id"] = build_id print json.dumps(data) Add context

Add context An example: migrating services to Kubernetes - Attach
an infra_type field to all events - Very useful during the migration, not needed otherwise - One-line change

Incremental context def login(): event = baseline.Event() event.bind("endpoint", "login") user
= authUser(request) event.bind("user_id", user.id) # . . . event.send() Make it easy to incrementally build up context -- deal in events

= authUser(request) event.bind("user_id", user.id) # . . . event.send() Make it easy to incrementally build up context -- deal in events # baseline class Event(object): def __init__(self): self.data = {} def bind(self, key, value): self.data[key] = value def send(self): print json.dumps(self.data)

= authUser(request) event.bind("user_id", user.id) # . . . event.send() Make it easy to incrementally build up context -- deal in events Idea: automatically time this duration

Encourage explanation “We have all of these metrics and fields,
but I don’t know what any of them mean.” event.bind(“memory_inuse_merge_max”, max_alloc)

Encourage explanation “We have all of these metrics, but I
don’t know what any of them mean.” event.bind(“memory_inuse_merge_max”, max_alloc)

don’t know what any of them mean.” event.bind("memory_inuse_merge_max", max_alloc, help="Maximum measured heap size during the merge phase of a query")

don’t know what any of them mean.” event.bind("memory_inuse_merge_max", max_alloc, help="Maximum measured heap size during the merge phase of a query") - send-on-first-use - just treat as documentation-in-code

Abstract away transport details You probably have multiple tools for
- logs - metrics - exceptions - traces - etc. Abstracting their protocol details is generally worthwhile!

Abstract away transport details "Reading JSON logs when I'm developing
locally is hard" Responses: ✗ Shut up and eat your vegetables! ✔ Can we fix that?

Abstract away transport details "Reading JSON logs when I'm developing
locally is hard" - output JSON or protobufs or whatever in production, pretty-print in development: # baseline running_in_terminal = sys.stdin.isatty() class Event(object): def send(self): if running_in_terminal: pretty_print(self.data) else: # ...

Embrace pragmatism “I can't figure this out without debug logs,
and we don't have those in prod”

Embrace pragmatism class Event(object): def __init__(self, request_id, debug=False): # ...
def send(self): if self.debug and self.request_id % 1000 != 0: # drop debug events for 999 out of 1000 requests return # otherwise, actually send the event “I can't figure this out without debug logs, and we don't have those in prod”

Embrace pragmatism “I can't figure this out without debug logs,
and we don't have those in prod” - record debug logs for 1 in 1000 requests - may seem like a hack - but does help debug any problem affecting more than 0.1% of requests - three lines of code

More ideas Safeguards - make sure transport doesn't block the
critical path - truncate ludicrously large events - rate-limit or drop events under pressure

More ideas - Unit tests for instrumentation: - can rate-limit
or drop events under pressure - Culture of testing? Awesome! - make it easy to unit-test instrumentation def test_instrumentation_called(): event_sink = baseline.MockTransport() with test_request() as req: login(req) assert len(event_sink.recorded_events()) > 0 assert "user_id" in mock.recorded_events[0]

Key points - Every one of these is an incremental
improvement - Not a multi-month project - You choose the pain points to prioritize - Small lines-of-code changes, large team impact

In conclusion Invest in instrumenting: - Start with structure -
Add context - Encourage explanation - Abstract transport - Embrace pragmatism Improve incrementally - Focus on the problems that matter to you - Solicit team feedback

Thank you! heckle at: @_emfree_ | [email protected] These slides: speakerdeck.com/emfree/no-frills-observabilit
y P.S. Please come say hi, I have duct tape (for your infra)

No-Frills Observability

No-Frills Observability

More Decks by Eben Freeman

Other Decks in Programming

Featured

Transcript