Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mitigating Sturgeon's Revelation

Mitigating Sturgeon's Revelation

Sturgeon's revelation states that "ninety percent of everything is crap", and this is particularly true of computer systems. Given at the Hacker News London meetup in October 2014, this presentation talks about building platforms for the real world.

You can view a video recording of this talk: https://vimeo.com/110478911

The Scale Factory

October 09, 2014
Tweet

More Decks by The Scale Factory

Other Decks in Technology

Transcript

  1. Mitigation Strategy: Unit Testing Behavioural bugs Logic problems Syntax issues

    Performance problems Security vulnerabilities Resource consumption issues !
  2. Real World Problems Software fault Disc failure Server / instance

    failure Power / cooling interruption Data centre / Availability Zone outage Region outage
  3. Design for Failure Virtual Private Cloud (euwest1-auth) Virtual Private Cloud

    (euwest1-live) Virtual Private Cloud (euwest1-test) VPC Subnet (public_a) VPC Subnet (public_b) VPC Subnet (private_a) VPC Subnet (private_b) VPC Subnet (public_a) VPC Subnet (private_a) VPC Subnet (public_a) VPC Subnet (public_b) VPC Subnet (private_a) VPC Subnet (private_b) VPC Subnet (public_b) VPC Subnet (private_b) Availability Zone (eu-west-1a) Availability Zone (eu-west-1b) Availability Zone (eu-west-1a) Availability Zone (eu-west-1b) Availability Zone (eu-west-1a) Availability Zone (eu-west-1b) mon DNS core ldap jenkins jenkins corea ldap-vola coreb ldap-volb gwa gwb appserver db-app (master) db-app (standby) corea ldap-vola coreb ldap-volb loga elasticsearch-vola gwa gwb appserver db-app (master) db-app (standby) gwa DNS Package Repo DNS Package Repo db-mon (master) db-mon (standby) Auto scaling Group (appserver) Auto scaling Group (appserver) Auto scaling Group (mon) logb elasticsearch-volb gwb mgmt staging live Frontend Frontend Package Repo
  4. High Availability n+1 everywhere Failover or load balancing Replication or

    centralisation of state Needs designing in up-front
  5. Code for Failure for each user_request! look up data in

    memcached! if no data! look up data in the database! store result in memcached! endif! return data! end
  6. Code for Failure for each user_request! look up data in

    memcached! if no data! look up data in the database! store result in memcached! endif! return data! end DNS failure service down request times out “the” database? re-used dead pool connection sql query bad huge data set returned
  7. Code for Failure Check for error conditions Not all errors

    are equal Retry failed requests …but not indefinitely Back off exponentially Degrade gracefully
  8. Logging Don’t remove debug logging when you’re finished Don’t spew

    error-level noise when everything’s fine Aggregate logs centrally for index and search Use structured log entries where possible !
  9. Beware the Cutting Edge New technologies aren’t battle-tested Community knowledge

    takes time to accrue Searching ServerFault is easier than using gdb
  10. Don’t Use Cargo Cult Solutions “There is a rich &

    vibrant oral tradition about how to write fast programs, and almost all of it is horseshit" Carlos Bueno
  11. “Reality is that which, when you stop believing in it,

    doesn't go away.” ! Philip K Dick