Mitigating Sturgeon's Revelation

Mitigating Sturgeon’s Revelation Jon Topper

Chronology 2000 2005 2010 2015 1995

Sturgeon’s Revelation Crap Not Crap

Github Repositories https://github.com/blog/1724-10-million-repositories

The Operations Battleground vs

Mitigation Strategy

Mitigation Strategy: Unit Testing Behavioural bugs Logic problems Syntax issues
Performance problems Security vulnerabilities Resource consumption issues !

Mitigation Strategy: Integration Testing Interface compatibility Protocol interoperability Concurrency problems
! !

Problems with Testing

Problems with Testing Dev Ops

Real World Problems Software fault Disc failure Server / instance
failure Power / cooling interruption Data centre / Availability Zone outage Region outage

Failure Model Sudden total outage Intermittent availability Degraded performance

Design for Failure Virtual Private Cloud (euwest1-auth) Virtual Private Cloud
(euwest1-live) Virtual Private Cloud (euwest1-test) VPC Subnet (public_a) VPC Subnet (public_b) VPC Subnet (private_a) VPC Subnet (private_b) VPC Subnet (public_a) VPC Subnet (private_a) VPC Subnet (public_a) VPC Subnet (public_b) VPC Subnet (private_a) VPC Subnet (private_b) VPC Subnet (public_b) VPC Subnet (private_b) Availability Zone (eu-west-1a) Availability Zone (eu-west-1b) Availability Zone (eu-west-1a) Availability Zone (eu-west-1b) Availability Zone (eu-west-1a) Availability Zone (eu-west-1b) mon DNS core ldap jenkins jenkins corea ldap-vola coreb ldap-volb gwa gwb appserver db-app (master) db-app (standby) corea ldap-vola coreb ldap-volb loga elasticsearch-vola gwa gwb appserver db-app (master) db-app (standby) gwa DNS Package Repo DNS Package Repo db-mon (master) db-mon (standby) Auto scaling Group (appserver) Auto scaling Group (appserver) Auto scaling Group (mon) logb elasticsearch-volb gwb mgmt staging live Frontend Frontend Package Repo

High Availability n+1 everywhere Failover or load balancing Replication or
centralisation of state Needs designing in up-front

Code for Failure for each user_request! look up data in
memcached! if no data! look up data in the database! store result in memcached! endif! return data! end

Code for Failure for each user_request! look up data in
memcached! if no data! look up data in the database! store result in memcached! endif! return data! end DNS failure service down request times out “the” database? re-used dead pool connection sql query bad huge data set returned

Code for Failure Check for error conditions Not all errors
are equal Retry failed requests …but not indeﬁnitely Back off exponentially Degrade gracefully

Monitoring 200 OK! frack

Monitoring Check individual cluster members Monitor passively as well as
actively Tune your triggers

Logging Don’t remove debug logging when you’re ﬁnished Don’t spew
error-level noise when everything’s ﬁne Aggregate logs centrally for index and search Use structured log entries where possible !

Keep It Simple

Beware the Cutting Edge New technologies aren’t battle-tested Community knowledge
takes time to accrue Searching ServerFault is easier than using gdb

Don’t Use Cargo Cult Solutions “There is a rich &
vibrant oral tradition about how to write fast programs, and almost all of it is horseshit" Carlos Bueno

“Reality is that which, when you stop believing in it,
doesn't go away.” ! Philip K Dick

http://www.scalefactory.com/ [email protected] @jtopper jtopper / scalefactory

Mitigating Sturgeon's Revelation

Mitigating Sturgeon's Revelation

The Scale Factory

More Decks by The Scale Factory

Other Decks in Technology

Featured

Transcript