System Observability: We can improve only what we can observe

System Observability 101 We can improve only what we can
observe

Definition on Wikipedia Observability is a measure of how well
internal states of a system can be inferred from knowledge of its external outputs.

How we achieve observability in development

While Developing an Air Conditioner

And We Usually Do Something Like …

While in Development Things what we do can be summed
up as - Handle incoming operations - Given a series of operations - Track system status - Check attribute on specific checkpoints - Understand the system and act accordingly - Fix bugs, ready to release if it works well, …

While on Production Things what we do can be summed
up as - Handle incoming operations - Which is heavier and more uncertain - Track system status - With more organized and intuitive presentation - Understand the system and act accordingly - Identify performance bottlenecks, investigate doubtful requests, …

How should we achieve observability in development on production

Provide External Outputs - Metrics - Logs - Trace

Metrics Usually a quantized value through aggregating a series of
data point, often associated with alert trigger rules - System Level - CPU, RAM, disk, disk I/O, … - Programming Language Runtime - GC time average, #goroutines, … - Application Level - QPS/RPS, 5xx rate, response time, #connection, …

Metrics Usually a quantized value through aggregating a series of
data point, often associated with event trigger rules - External Services - AWS Service liveness, Payment API latency, … - Business Logic Level - Video playback success rate (used by @Netflix) - Story post success rate (maybe for social networking services) - Advertisement push success rate (maybe for martech services)

Trace Grafana

Trace djdt/flamegraph

Practice System Observability

Instrumentation Tools Take advantage of the encapsulated SDK / API
and the uniform specifications - OpenTelmetry - Prometheus - Grafana - ELK / EFK stack - …

Outputs Design - What are the appropriate metrics? - What
information should be logged? - What tasks should be traced?

Objective Be able to answer questions about the state of
the system

Start with the Questions - What are the appropriate metrics?
- Does the app process killed by os due to out of memory? - Does the journey type campaigns work well currently? - What information should be logged? - Which processing stage will go wrong while a campaign is oversized? - Does a particular campaign to go wrong due to AWS SQS send message error? - What tasks should be traced? - Which database query is the critical performance bottleneck? - Does each dynamic content variable passed to the corresponding parser correctly?

The Evolution of Observability

system operates abnormally impact on our customers check monitors or
dashboards come up with solutions trace source code ask the author for help … find clues in the databases It should be obvious

dashboards come up with solutions trace source code ask the author for help … find clues in the databases A big step forward, but it needs time

dashboards come up with solutions trace source code ask the author for help … find clues in the databases And Go Beyond

System Observability: We can improve only what ...

System Observability: We can improve only what we can observe

Rain

More Decks by Rain

Other Decks in Technology

Featured

Transcript

System Observability 101 We can improve only what we can

Definition on Wikipedia Observability is a measure of how well

How we achieve observability in development

While Developing an Air Conditioner

And We Usually Do Something Like …

While in Development Things what we do can be summed

While on Production Things what we do can be summed

How should we achieve observability in development on production

Provide External Outputs - Metrics - Logs - Trace

Metrics Usually a quantized value through aggregating a series of

Metrics Usually a quantized value through aggregating a series of

Trace Grafana

Trace djdt/flamegraph

Practice System Observability

Instrumentation Tools Take advantage of the encapsulated SDK / API

Outputs Design - What are the appropriate metrics? - What

Objective Be able to answer questions about the state of

Start with the Questions - What are the appropriate metrics?

The Evolution of Observability

system operates abnormally impact on our customers check monitors or

system operates abnormally impact on our customers check monitors or

system operates abnormally impact on our customers check monitors or