containers
kubernetes
service meshes
microservices
immutable infrastructure
…
...
@copyconstruct
Slide 5
Slide 5 text
5
@copyconstruct
Slide 6
Slide 6 text
@copyconstruct
Slide 7
Slide 7 text
@copyconstruct
Slide 8
Slide 8 text
☹
☹
☹
@copyconstruct
Slide 9
Slide 9 text
@copyconstruct
Slide 10
Slide 10 text
an embarrassment
of riches!
@copyconstruct
Slide 11
Slide 11 text
Decision
Making
in the time of
Cloud Native
@copyconstruct
Slide 12
Slide 12 text
It’s tempting, especially when enamored by a
new piece of technology that promises the
moon, to retrofit our problem space with the
solution space of said technology,
however minimal or non-existent the
intersection
@copyconstruct
Slide 13
Slide 13 text
Goal of Talk
@copyconstruct
Slide 14
Slide 14 text
A field guide for
evaluation
@copyconstruct
Slide 15
Slide 15 text
o strengths and weaknesses of each category of
tools
o problems they solve
o tradeoffs they make
o ease of adoption/integration into an existing
infrastructure
@copyconstruct
Slide 16
Slide 16 text
What to “monitor”
and how
in a cloud native
environment?
@copyconstruct
Slide 17
Slide 17 text
Monitoring in
The time of
Cloud native
@copyconstruct
Slide 18
Slide 18 text
Monitoring in
The time of
Cloud native
@copyconstruct
Slide 19
Slide 19 text
@copyconstruct
Slide 20
Slide 20 text
monitoring
@copyconstruct
Slide 21
Slide 21 text
@copyconstruct
Slide 22
Slide 22 text
@copyconstruct
Slide 23
Slide 23 text
@copyconstruct
Slide 24
Slide 24 text
@copyconstruct
Slide 25
Slide 25 text
@copyconstruct
Slide 26
Slide 26 text
As we adopt increasingly complex
architectures, the number of
“things that can go wrong”
exponentially increases
@copyconstruct
Slide 27
Slide 27 text
era of embracing failure
@copyconstruct
Slide 28
Slide 28 text
era of complexity
@copyconstruct
Slide 29
Slide 29 text
how do we design monitoring for
such systems?
how do we design these
systems themselves?
@copyconstruct
Slide 30
Slide 30 text
The goal of “monitoring” hasn’t changed,
even if the scope has shrunk
the challenge now lies in identifying and
minimizing the bits of “monitoring” that
still remain human centric
@copyconstruct
Slide 31
Slide 31 text
infrastructure management is becoming
more automated
application lifecycle management is
becoming harder
@copyconstruct
Slide 32
Slide 32 text
Observability is
about being able to understand
how a system is behaving in production
@copyconstruct
Slide 33
Slide 33 text
Monitoring is being on the
lookout for failures,
which in turn requires us to be able to
predict these failures proactively
@copyconstruct
Slide 34
Slide 34 text
interlude
@copyconstruct
Slide 35
Slide 35 text
blackbox monitoring
@copyconstruct
Slide 36
Slide 36 text
@copyconstruct
Slide 37
Slide 37 text
“it’s so nice being
in an org that
communicates quantitatively
about systems”
@copyconstruct
Slide 38
Slide 38 text
whitebox monitoring
@copyconstruct
Slide 39
Slide 39 text
@copyconstruct
Slide 40
Slide 40 text
Data are
simply facts or
figures — bits of
information,
but not
information
itself
@copyconstruct
Slide 41
Slide 41 text
Data are
simply facts or
figures — bits of
information,
but not
information
itself
When data are processed,
interpreted, organized,
structured or presented so
as to make them
meaningful or useful, they
are called information.
Information provides
context for data.
@copyconstruct
Slide 42
Slide 42 text
purpose driven
@copyconstruct
Slide 43
Slide 43 text
purpose driven
not
origin driven
@copyconstruct
Slide 44
Slide 44 text
@copyconstruct
Slide 45
Slide 45 text
@copyconstruct
Slide 46
Slide 46 text
@copyconstruct
Slide 47
Slide 47 text
@copyconstruct
Slide 48
Slide 48 text
@copyconstruct
Slide 49
Slide 49 text
@copyconstruct
Slide 50
Slide 50 text
@copyconstruct
Slide 51
Slide 51 text
@copyconstruct
Slide 52
Slide 52 text
@copyconstruct
Slide 53
Slide 53 text
@copyconstruct
Slide 54
Slide 54 text
@copyconstruct
Slide 55
Slide 55 text
@copyconstruct
Slide 56
Slide 56 text
The Three Pillars of
Observability
@copyconstruct
Slide 57
Slide 57 text
@copyconstruct
Slide 58
Slide 58 text
logs
@copyconstruct
Slide 59
Slide 59 text
@copyconstruct
Slide 60
Slide 60 text
@copyconstruct
Slide 61
Slide 61 text
both traces and metrics are an abstraction built
on top of logs that pre-process and encode
information along two orthogonal axes, one
being request centric, the other being system centric
@copyconstruct
Slide 62
Slide 62 text
Traces
@copyconstruct
Slide 63
Slide 63 text
@copyconstruct
Slide 64
Slide 64 text
Instrument specific points in your
application, proxy, framework, library,
middleware and anything else that might lie
in the path of execution of a request
@copyconstruct
Slide 65
Slide 65 text
@copyconstruct
Slide 66
Slide 66 text
@copyconstruct
Slide 67
Slide 67 text
@copyconstruct
Slide 68
Slide 68 text
metrics
@copyconstruct
Slide 69
Slide 69 text
“a set of numbers that give information
about a particular process or activity”
@copyconstruct
Slide 70
Slide 70 text
“a list of numbers relating to a particular
activity, which is recorded at regular periods
of time and then studied.
Time series are typically used to study, for
example, sales, orders, income, etc.”
@copyconstruct
Slide 71
Slide 71 text
@copyconstruct
Slide 72
Slide 72 text
@copyconstruct
Slide 73
Slide 73 text
@copyconstruct
Slide 74
Slide 74 text
evaluation
@copyconstruct
Slide 75
Slide 75 text
logs
@copyconstruct
Slide 76
Slide 76 text
+1 easy to instrument and generate
@copyconstruct
Slide 77
Slide 77 text
+1 easy to instrument and generate
+1 provides rich local context
@copyconstruct
Slide 78
Slide 78 text
+1 easy to instrument and generate
+1 provides rich local context
-1 performance of logging libraries
@copyconstruct
Slide 79
Slide 79 text
+1 easy to instrument and generate
+1 provides rich local context
-1 performance of logging libraries
-1 no guaranteed delivery
@copyconstruct
Slide 80
Slide 80 text
+1 easy to instrument and generate
+1 provides rich local context
-1 performance of logging libraries
-1 no guaranteed delivery
-1 application performance
@copyconstruct
Slide 81
Slide 81 text
“A fun thing I had seen while at [redacted] was that
turning off most logging almost doubled performance
on the instances we were running on because logs ate
through AWS’ EC2 classic’s packet allocations like mad.
It was interesting for us to discover that more than 50%
of our performance would be lost to trying to control
and monitor performance”
@copyconstruct
Slide 82
Slide 82 text
+1 easy to instrument and generate
+1 provides rich local context
-1 performance of logging libraries
-1 no guaranteed delivery
-1 application performance
-1 no dynamic sampling
@copyconstruct
Slide 83
Slide 83 text
-1 buffering might be required
@copyconstruct
Slide 84
Slide 84 text
-1 buffering might be required
-1 quotas/ rate limits
@copyconstruct
+1 metrics transfer and storage has a
constant overhead
@copyconstruct
Slide 90
Slide 90 text
@copyconstruct
Slide 91
Slide 91 text
@copyconstruct
Slide 92
Slide 92 text
+1 metrics transfer and storage has a
constant overhead
+1 cheap
@copyconstruct
Slide 93
Slide 93 text
+1 metrics transfer and storage has a
constant overhead
+1 cheap
+1 statistical & probabilistic analysis
@copyconstruct
Slide 94
Slide 94 text
+1 metrics transfer and storage has a
constant overhead
+1 cheap
+1 statistical & probabilistic analysis
+1 alerting
@copyconstruct
Slide 95
Slide 95 text
+1 metrics transfer and storage has a
constant overhead
+1 cheap
+1 statistical & probabilistic analysis
+1 alerting
-1 system scoped
@copyconstruct
Slide 96
Slide 96 text
@copyconstruct
Slide 97
Slide 97 text
traces
@copyconstruct
Slide 98
Slide 98 text
+1 captures the lifetime of requests as they
flow through the various components of a
distributed system
@copyconstruct
Slide 99
Slide 99 text
+1 captures the lifetime of requests as they
flow through the various components of a
distributed system
-1 hard to instrument
@copyconstruct
Slide 100
Slide 100 text
“We’ve been implementing a request tracing service for
over a year and it’s not complete yet. The challenge with
these type of tools is that, we need to add code around
each span to truly understand what’s happening during
the lifetime of our requests. The frustrating part is that if
the code is not instrumented or header is not carrying the
id, that code becomes a risky blind spot for operations”
@copyconstruct
Slide 101
Slide 101 text
+1 captures the lifetime of requests as they
flow through the various components of a
distributed system
-1 hard to instrument
-1 depends on how causality is tracked
@copyconstruct
Slide 102
Slide 102 text
+1 captures the lifetime of requests as they
flow through the various components of a
distributed system
-1 hard to instrument
-1 depends on how causality is tracked
-1 request scoped
@copyconstruct
Slide 103
Slide 103 text
Best practices
@copyconstruct
Slide 104
Slide 104 text
Logs
@copyconstruct
Slide 105
Slide 105 text
o Quotas
@copyconstruct
Slide 106
Slide 106 text
o Quotas
o Dynamic Sampling
@copyconstruct
Slide 107
Slide 107 text
o Quotas
o Dynamic Sampling
o Logging is a Stream Processing Problem
@copyconstruct
Slide 108
Slide 108 text
Filter to outlier countries from where users viewed
this article fewer than 100 times in total
@copyconstruct
Slide 109
Slide 109 text
Filter to outlier page loads that performed more
than 100 database queries
Or, show me only page loads from Indonesia that
took more than 10 seconds to load
@copyconstruct
Slide 110
Slide 110 text
Enriched events
business event + timer/counter/histogram
@copyconstruct
Slide 111
Slide 111 text
No content
Slide 112
Slide 112 text
No content
Slide 113
Slide 113 text
A new hope for the future
OpenLogging/OpenEvent
@copyconstruct
Slide 114
Slide 114 text
metrics
@copyconstruct
Slide 115
Slide 115 text
“Prometheus is much more than just the
server. I see Prometheus as a set of
standards and projects, with the server
being just one part of a much greater whole”
@copyconstruct
Slide 116
Slide 116 text
@copyconstruct
Slide 117
Slide 117 text
@copyconstruct
Slide 118
Slide 118 text
No content
Slide 119
Slide 119 text
traces
@copyconstruct
Slide 120
Slide 120 text
@copyconstruct
Slide 121
Slide 121 text
conclusion
@copyconstruct
Slide 122
Slide 122 text
@copyconstruct
Slide 123
Slide 123 text
No content
Slide 124
Slide 124 text
@copyconstruct
Slide 125
Slide 125 text
@copyconstruct
Slide 126
Slide 126 text
@copyconstruct
Slide 127
Slide 127 text
@copyconstruct
Slide 128
Slide 128 text
@copyconstruct
Slide 129
Slide 129 text
@copyconstruct
Slide 130
Slide 130 text
No content
Slide 131
Slide 131 text
@copyconstruct
Slide 132
Slide 132 text
Choose your own Observability Adventure!
@copyconstruct