Monitoring in the time of Cloud Native

in the time of Velocity 2017 New York, NY

@copyconstruct @copyconstruct @copyconstruct

Cloud Native @copyconstruct

containers kubernetes service meshes microservices immutable infrastructure … ... @copyconstruct

5 @copyconstruct

@copyconstruct

☹ ☹ ☹ @copyconstruct

@copyconstruct

an embarrassment of riches! @copyconstruct

Decision Making in the time of Cloud Native @copyconstruct

It’s tempting, especially when enamored by a new piece of
technology that promises the moon, to retrofit our problem space with the solution space of said technology, however minimal or non-existent the intersection @copyconstruct

Goal of Talk @copyconstruct

A field guide for evaluation @copyconstruct

o strengths and weaknesses of each category of tools o
problems they solve o tradeoffs they make o ease of adoption/integration into an existing infrastructure @copyconstruct

What to “monitor” and how in a cloud native environment?
@copyconstruct

Monitoring in The time of Cloud native @copyconstruct

@copyconstruct

monitoring @copyconstruct

@copyconstruct

As we adopt increasingly complex architectures, the number of “things
that can go wrong” exponentially increases @copyconstruct

era of embracing failure @copyconstruct

era of complexity @copyconstruct

how do we design monitoring for such systems? how do
we design these systems themselves? @copyconstruct

The goal of “monitoring” hasn’t changed, even if the scope
has shrunk the challenge now lies in identifying and minimizing the bits of “monitoring” that still remain human centric @copyconstruct

infrastructure management is becoming more automated application lifecycle management is
becoming harder @copyconstruct

Observability is about being able to understand how a system
is behaving in production @copyconstruct

Monitoring is being on the lookout for failures, which in
turn requires us to be able to predict these failures proactively @copyconstruct

interlude @copyconstruct

blackbox monitoring @copyconstruct

@copyconstruct

“it’s so nice being in an org that communicates quantitatively
about systems” @copyconstruct

whitebox monitoring @copyconstruct

@copyconstruct

Data are simply facts or figures — bits of information, but not
information itself @copyconstruct

Data are simply facts or figures — bits of information, but not
information itself When data are processed, interpreted, organized, structured or presented so as to make them meaningful or useful, they are called information. Information provides context for data. @copyconstruct

purpose driven @copyconstruct

purpose driven not origin driven @copyconstruct

@copyconstruct

The Three Pillars of Observability @copyconstruct

@copyconstruct

logs @copyconstruct

@copyconstruct

both traces and metrics are an abstraction built on top
of logs that pre-process and encode information along two orthogonal axes, one being request centric, the other being system centric @copyconstruct

Traces @copyconstruct

@copyconstruct

Instrument specific points in your application, proxy, framework, library, middleware
and anything else that might lie in the path of execution of a request @copyconstruct

@copyconstruct

metrics @copyconstruct

“a set of numbers that give information about a particular
process or activity” @copyconstruct

“a list of numbers relating to a particular activity, which
is recorded at regular periods of time and then studied. Time series are typically used to study, for example, sales, orders, income, etc.” @copyconstruct

@copyconstruct

evaluation @copyconstruct

logs @copyconstruct

+1 easy to instrument and generate @copyconstruct

+1 easy to instrument and generate +1 provides rich local
context @copyconstruct

context -1 performance of logging libraries @copyconstruct

context -1 performance of logging libraries -1 no guaranteed delivery @copyconstruct

context -1 performance of logging libraries -1 no guaranteed delivery -1 application performance @copyconstruct

“A fun thing I had seen while at [redacted] was
that turning off most logging almost doubled performance on the instances we were running on because logs ate through AWS’ EC2 classic’s packet allocations like mad. It was interesting for us to discover that more than 50% of our performance would be lost to trying to control and monitor performance” @copyconstruct

context -1 performance of logging libraries -1 no guaranteed delivery -1 application performance -1 no dynamic sampling @copyconstruct

-1 buffering might be required @copyconstruct

-1 buffering might be required -1 quotas/ rate limits @copyconstruct

-1 buffering might be required -1 quotas/ rate limits -1
“actionable data” @copyconstruct

“actionable data” -1 ELK @copyconstruct

“actionable data” -1 ELK -1 $$$$ @copyconstruct

+1 metrics transfer and storage has a constant overhead @copyconstruct

@copyconstruct

+1 metrics transfer and storage has a constant overhead +1
cheap @copyconstruct

cheap +1 statistical & probabilistic analysis @copyconstruct

cheap +1 statistical & probabilistic analysis +1 alerting @copyconstruct

cheap +1 statistical & probabilistic analysis +1 alerting -1 system scoped @copyconstruct

@copyconstruct

traces @copyconstruct

+1 captures the lifetime of requests as they flow through
the various components of a distributed system @copyconstruct

the various components of a distributed system -1 hard to instrument @copyconstruct

“We’ve been implementing a request tracing service for over a
year and it’s not complete yet. The challenge with these type of tools is that, we need to add code around each span to truly understand what’s happening during the lifetime of our requests. The frustrating part is that if the code is not instrumented or header is not carrying the id, that code becomes a risky blind spot for operations” @copyconstruct

the various components of a distributed system -1 hard to instrument -1 depends on how causality is tracked @copyconstruct

the various components of a distributed system -1 hard to instrument -1 depends on how causality is tracked -1 request scoped @copyconstruct

Best practices @copyconstruct

Logs @copyconstruct

o Quotas @copyconstruct

o Quotas o Dynamic Sampling @copyconstruct

o Quotas o Dynamic Sampling o Logging is a Stream
Processing Problem @copyconstruct

Filter to outlier countries from where users viewed this
article fewer than 100 times in total @copyconstruct

Filter to outlier page loads that performed more than 100
database queries Or, show me only page loads from Indonesia that took more than 10 seconds to load @copyconstruct

Enriched events business event + timer/counter/histogram @copyconstruct

A new hope for the future OpenLogging/OpenEvent @copyconstruct

“Prometheus is much more than just the server. I see
Prometheus as a set of standards and projects, with the server being just one part of a much greater whole” @copyconstruct

@copyconstruct

traces @copyconstruct

@copyconstruct

conclusion @copyconstruct

@copyconstruct

Choose your own Observability Adventure! @copyconstruct

@copyconstruct

Thank you @copyconstruct @copyconstruct

Monitoring in the time of Cloud Native

Monitoring in the time of Cloud Native

More Decks by Cindy Sridharan

Other Decks in Technology

Featured

Transcript