The Testing – Monitoring Continuum (DevOps Sydney, 2015)

Slide 1

Slide 1 text

The Testing↔Monitoring Continuum

Slide 2

Slide 2 text

Question Time

Slide 3

Slide 3 text

Ops Who here uses Nagios? Monit+MMonit? ... Serverspec?

Slide 4

Slide 4 text

Dev Who writes unit tests? Integration tests, eg. using browser-driving full-stack tools like Selenium, Capybara, etc?

Slide 5

Slide 5 text

What is Monitoring?

Slide 6

Slide 6 text

You're developing a product; an app or something. Let's say we have a bunch of machines running that app.

Slide 7

Slide 7 text

Load Balancers

Slide 8

Slide 8 text

Load Balancers App Servers

Slide 9

Slide 9 text

Load Balancers App Servers Data Store

Slide 10

Slide 10 text

We're monitoring. What do we do? Well, first we should probably make sure that the servers are actually up. Easy!

Slide 11

Slide 11 text

Well, what about more specific things. Is PostgreSQL running on the database? Can we see its PID?

Slide 12

Slide 12 text

Is Postgres accepting connections?

Slide 13

Slide 13 text

Is it accepting connections with the right username + password for the app? Maybe we stuff up a config rollout.

Slide 14

Slide 14 text

Okay, but does it have the PG extensions the app needs, eg. for UUID generation?

Slide 15

Slide 15 text

Is the app's database named correctly?

Slide 16

Slide 16 text

Can the app see the tables it needs in the database?

Slide 17

Slide 17 text

Can it write to those tables? Maybe we screwed up the permissions.

Slide 18

Slide 18 text

THIS IS GETTING A BIT MUCH.

Slide 19

Slide 19 text

Do we have to do this for every service or node that we're running? Where do we stop?

Slide 20

Slide 20 text

Run the App. Well, maybe the best way of doing this is running the app itself. We could write a bash+curl script that, like, tests just logging in.

Slide 21

Slide 21 text

Run the App's Tests. But is that testing everything the app needs to use? Maybe it'll break on the next click. Why not go the whole hog? Our app has an integration test suite (or should have). We spent a lot of money on it!

Slide 22

Slide 22 text

Story Time Let's say we have a multi-tenant, hosted, Software-as-a-Service app that users buy instances/accounts for. VM Hosting, Chat, whatever.

Slide 23

Slide 23 text

Local Dev. Env We'd have unit tests that you run on your local box.

Slide 24

Slide 24 text

Local Dev. Env But also those big browser-driven tests as well. The test runner is still local, against a local copy of your app.

Slide 25

Slide 25 text

Local Dev. Env Production Staging We have staging and production environments too.

Slide 26

Slide 26 text

Local Dev. Env Production Staging Why don't we: * Spin up a new account on staging. * Run the integration tests against that new account. * Throw away the account afterwards.

Slide 27

Slide 27 text

Local Dev. Env Production Staging It could be a custom app kicking off these test runs, but it could easily be Jenkins.

Slide 28

Slide 28 text

Local Dev. Env Production Staging Do the same for production! Have these tests run over and over again. Chew up some of your production capacity, but have greater surety that your app works when placed into the staging and production environments you've configured and rolled out.

Slide 29

Slide 29 text

Local Dev. Env Production Staging We're testing the  app+infrastructure interface. We're testing that the, say, file upload feature on your chat app actually works with the infrastructure it's relying on.

Slide 30

Slide 30 text

Local Dev. Env Production Staging It's not super-easy or perfect, and testing interactions with external systems (particularly payment ones) is hard, and might just involve turning off parts of your test and instrumenting detection of errors instead.

Slide 31

Slide 31 text

Local Dev. Env Production Staging And finally, to be clear, this isn't replacing your environment tests (eg. available disk/RAM/CPU) or error-rate instrumentation; this is to alleviate the need for a ton of individual fine-grained service checks that would be better tested by an app being hit by your existing test suite.

Slide 32

Slide 32 text

Testing Monitoring Back to the title. Instead of Testing and Monitoring as separate, discrete things, I'd argue that…

Slide 33

Slide 33 text

Testing Testing  +  Monitoring … Testing is a part of Good Monitoring.

Slide 34

Slide 34 text

Fin.   Rob Howard  @damncabbage https://speakerdeck.com/damncabbage/ Thanks! One final thing…

Slide 35

Slide 35 text

I work at OrionVM and we're hiring; we're building cloud hosting (physical) infrastructure, and we're after an Ops person (networking+routing, physical server wrangling, configuration management) and a Ruby/JS dev (UI) to help out.

Slide 36

Slide 36 text

Fin.   Rob Howard  @damncabbage https://speakerdeck.com/damncabbage/