Upgrade to Pro — share decks privately, control downloads, hide ads and more …

System Observability: We can improve only what ...

Rain
April 12, 2022

System Observability: We can improve only what we can observe

System Observability: We can improve only what we can observe

Rain

April 12, 2022
Tweet

More Decks by Rain

Other Decks in Technology

Transcript

  1. Definition on Wikipedia Observability is a measure of how well

    internal states of a system can be inferred from knowledge of its external outputs.
  2. While in Development Things what we do can be summed

    up as - Handle incoming operations - Given a series of operations - Track system status - Check attribute on specific checkpoints - Understand the system and act accordingly - Fix bugs, ready to release if it works well, …
  3. While on Production Things what we do can be summed

    up as - Handle incoming operations - Which is heavier and more uncertain - Track system status - With more organized and intuitive presentation - Understand the system and act accordingly - Identify performance bottlenecks, investigate doubtful requests, …
  4. Metrics Usually a quantized value through aggregating a series of

    data point, often associated with alert trigger rules - System Level - CPU, RAM, disk, disk I/O, … - Programming Language Runtime - GC time average, #goroutines, … - Application Level - QPS/RPS, 5xx rate, response time, #connection, …
  5. Metrics Usually a quantized value through aggregating a series of

    data point, often associated with event trigger rules - External Services - AWS Service liveness, Payment API latency, … - Business Logic Level - Video playback success rate (used by @Netflix) - Story post success rate (maybe for social networking services) - Advertisement push success rate (maybe for martech services)
  6. Instrumentation Tools Take advantage of the encapsulated SDK / API

    and the uniform specifications - OpenTelmetry - Prometheus - Grafana - ELK / EFK stack - …
  7. Outputs Design - What are the appropriate metrics? - What

    information should be logged? - What tasks should be traced?
  8. Start with the Questions - What are the appropriate metrics?

    - Does the app process killed by os due to out of memory? - Does the journey type campaigns work well currently? - What information should be logged? - Which processing stage will go wrong while a campaign is oversized? - Does a particular campaign to go wrong due to AWS SQS send message error? - What tasks should be traced? - Which database query is the critical performance bottleneck? - Does each dynamic content variable passed to the corresponding parser correctly?
  9. system operates abnormally impact on our customers check monitors or

    dashboards come up with solutions trace source code ask the author for help … find clues in the databases It should be obvious
  10. system operates abnormally impact on our customers check monitors or

    dashboards come up with solutions trace source code ask the author for help … find clues in the databases A big step forward, but it needs time
  11. system operates abnormally impact on our customers check monitors or

    dashboards come up with solutions trace source code ask the author for help … find clues in the databases And Go Beyond