Ops
Who here uses Nagios?
Monit+MMonit? ... Serverspec?
Slide 4
Slide 4 text
Dev
Who writes unit tests? Integration
tests, eg. using browser-driving
full-stack tools like Selenium,
Capybara, etc?
Slide 5
Slide 5 text
What is
Monitoring?
Slide 6
Slide 6 text
You're developing a product; an
app or something.
Let's say we have a bunch of
machines running that app.
Slide 7
Slide 7 text
Load Balancers
Slide 8
Slide 8 text
Load Balancers
App Servers
Slide 9
Slide 9 text
Load Balancers
App Servers
Data Store
Slide 10
Slide 10 text
We're monitoring. What do we
do? Well, first we should probably
make sure that the servers are
actually up. Easy!
Slide 11
Slide 11 text
Well, what about more specific
things.
Is PostgreSQL running on the
database? Can we see its PID?
Slide 12
Slide 12 text
Is Postgres accepting
connections?
Slide 13
Slide 13 text
Is it accepting connections with
the right username + password for
the app? Maybe we stuff up a
config rollout.
Slide 14
Slide 14 text
Okay, but does it have the PG
extensions the app needs, eg. for
UUID generation?
Slide 15
Slide 15 text
Is the app's database named
correctly?
Slide 16
Slide 16 text
Can the app see the tables it
needs in the database?
Slide 17
Slide 17 text
Can it write to those tables?
Maybe we screwed up the
permissions.
Slide 18
Slide 18 text
THIS IS GETTING A BIT MUCH.
Slide 19
Slide 19 text
Do we have to do this for every
service or node that we're
running?
Where do we stop?
Slide 20
Slide 20 text
Run the App.
Well, maybe the best way of doing
this is running the app itself.
We could write a bash+curl script
that, like, tests just logging in.
Slide 21
Slide 21 text
Run the App's Tests.
But is that testing everything the app needs
to use? Maybe it'll break on the next click.
Why not go the whole hog? Our app has an
integration test suite (or should have). We
spent a lot of money on it!
Slide 22
Slide 22 text
Story Time
Let's say we have a multi-tenant,
hosted, Software-as-a-Service app
that users buy instances/accounts
for. VM Hosting, Chat, whatever.
Slide 23
Slide 23 text
Local Dev. Env
We'd have unit tests that you run
on your local box.
Slide 24
Slide 24 text
Local Dev. Env
But also those big browser-driven
tests as well. The test runner is still
local, against a local copy of your
app.
Slide 25
Slide 25 text
Local Dev. Env
Production
Staging
We have staging and production
environments too.
Slide 26
Slide 26 text
Local Dev. Env
Production
Staging
Why don't we:
* Spin up a new account on staging.
* Run the integration tests against that
new account.
* Throw away the account afterwards.
Slide 27
Slide 27 text
Local Dev. Env
Production
Staging
It could be a custom app kicking off
these test runs, but it could easily be
Jenkins.
Slide 28
Slide 28 text
Local Dev. Env
Production
Staging
Do the same for production!
Have these tests run over and
over again. Chew up some of your
production capacity, but have
greater surety that your app works
when placed into the staging and
production environments you've
configured and rolled out.
Slide 29
Slide 29 text
Local Dev. Env
Production
Staging
We're testing the
app+infrastructure interface.
We're testing that the, say, file
upload feature on your chat app
actually works with the
infrastructure it's relying on.
Slide 30
Slide 30 text
Local Dev. Env
Production
Staging
It's not super-easy or perfect, and
testing interactions with external
systems (particularly payment
ones) is hard, and might just
involve turning off parts of your
test and instrumenting detection
of errors instead.
Slide 31
Slide 31 text
Local Dev. Env
Production
Staging
And finally, to be clear, this isn't
replacing your environment tests
(eg. available disk/RAM/CPU) or
error-rate instrumentation; this is
to alleviate the need for a ton of
individual fine-grained service
checks that would be better
tested by an app being hit by your
existing test suite.
Slide 32
Slide 32 text
Testing Monitoring
Back to the title.
Instead of Testing and Monitoring
as separate, discrete things, I'd
argue that…
Slide 33
Slide 33 text
Testing
Testing
+
Monitoring
… Testing is a part of Good
Monitoring.
Slide 34
Slide 34 text
Fin.
Rob Howard
@damncabbage
https://speakerdeck.com/damncabbage/
Thanks! One final thing…
Slide 35
Slide 35 text
I work at OrionVM and we're hiring; we're building
cloud hosting (physical) infrastructure, and we're
after an Ops person (networking+routing, physical
server wrangling, configuration management) and a
Ruby/JS dev (UI) to help out.
Slide 36
Slide 36 text
Fin.
Rob Howard
@damncabbage
https://speakerdeck.com/damncabbage/