known unknowns complex/complicated context concepts we need to learn in order to progress for example: new APIs, system architectures but also: why does the system behave like this?
unknown knowns intuition/muscle memory years of practice and experience example: your mother language https://www.freecodecamp.org/news/how-to-discover-your- unknown-knowns/
The holy grail of observability is the ability to be able to ask any question, understand any previously unseen state your system may get itself into; without having to ship new code to handle that state (bc that implies you knew enough to predict it) -- Charity Majors (@mipsytipsy) “ “
Logging controlled and used by devs (mostly) you wouldn't want to check each of 20 replicas by hand, would you? centralized logging 12factors showed us the way
logging? what logging? ¯\_( ツ)_/¯ for process in processes_to_archive: with transaction(): archived_process = archivist.create_from(process) process_manager.delete(process) self.dispatch("process_archived", archived_process) tip: always log destructive actions!
does this code answer why? def activate_process(self, process): if ( process.name and self.is_valid_name(process.name) and process.is_approved and process.owner.is_supervisor() ): self.process_controller.set_active(process) but does this code answer why not?
does this code answer why not? def activate_process(self, process): if not process.name: logger.warning(f"Process {process} has empty name") return if not self.is_valid_name(process.name): logger.warning(f"Process name {process.name} isn't valid") return if not process.is_approved: logger.warning(f"Process {process} requires approval") return # ...
Targeted logging provide detailed, DEBUG level logs in production for specific services/users with issues without redeploying https://tersesystems.com/blog/2019/07/22/targeted-diagnostic- logging-in-production/
Metrics numeric data measured over time controlled by devs, used by everyone need smart aggregation/retention rules https://blog.digitalocean.com/observability-and-metrics/
Distributed tracing correlate flow of events across distributed system essential in the world of microservices Zipkin, Jaeger, Lightstep, OpenTracing, hovewer... https://thenewstack.io/opentracing-opencensus-merge-into-a- single-new-project-opentelemetry/
*three pillars is bullshit each pillar is flawed observability requires: high throughput high cardinality no sampling long retention all at once (and a pony) https://www.infoq.com/news/2019/02/rethinking-observability/