Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Learn to fail better

OpsCon
November 10, 2015

Learn to fail better

Christine will show how you can be better at failure by highlighting some of the most radical ideas from “The Practice of Cloud System Administration”. Topics will include: randomly powering off machines; doing risky processes often; why the most highly reliable systems are built on cheap hardware that breaks a lot; and why there is no root cause for an outage. And much more! - Christina J. Hogan #RoadToOpsCon #OpsConMilan

OpsCon

November 10, 2015
Tweet

More Decks by OpsCon

Other Decks in Technology

Transcript

  1. Christina Hogan the-cloud-book.com Learn to Fail Better Some Radical Ideas

    from The Practice of Cloud System Administration www.informit.com/TPOSA Discount code SYSADMIN37
  2. Who is Christina Hogan? Sysadmin since 1988 Formula 1 aerodynamicist

    for 6 years Senior network engineer at AT&T
  3. Updated and reorganized! THOMAS A. LIMONCELLI • STRATA R. CHALUP

    • CHRISTINA J. HOGAN THIRD EDITION T H E P R A C T I C E O F System and Network Administr ation V O L U M E 1 300+ pages of new material. Volume 1 Volume 2 Available by Nov ‘16
  4. cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud

    cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud cloud. C
  5. Distributed Computing • Divide work among many machines • Coordinated

    central or decentralized • Examples: • Genomics or CFD: 100s machines working on a dataset • Web Service: 10 machines each taking 1/10th of the web traffic for StackExchange.com • Storage: xx,000 machines holding all of Gmail’s messages
  6. Distributed computing can do more “work” than the largest single

    computer. More storage. More computing power. More memory. More throughput.
  7. Challenges of Scale Millions of users • Bigger risks •

    Failures more visible • Spiraling costs • operational • hardware In response, radical ideas on • Reducing risk • Increasing reliability • Automation • Efficient solutions • Reducing overhead
  8. Make peace with failure Parts are imperfect Networks are imperfect

    Systems are imperfect Code is imperfect People are imperfect
  9. You can buy the best, most reliable computer in the

    world. It is still going to fail. If it doesn’t, you’ll still need to take it down for maintenance.
  10. 3 ways to fail better 1. Use cheaper, less reliable,

    hardware. 2. If a process/procedure is risky, do it a lot. 3. Don’t punish people for outages. Radical(?) Ideas
  11. High-End Server RAID Dual PS UPS Gold Maintenance High-End Server

    RAID Dual PS UPS Gold Maintenance High-End Server RAID Dual PS UPS Gold Maintenance High-End Server RAID Dual PS UPS Gold Maintenance Load Balancer Code Changes to Coordinate and Distribute Work High-End Server RAID Dual PS UPS Gold Maintenance
  12. High-End Server RAID Dual PS UPS Gold Maintenance High-End Server

    RAID Dual PS UPS Gold Maintenance High-End Server RAID Dual PS UPS Gold Maintenance High-End Server RAID Dual PS UPS Gold Maintenance Load Balancer Code Changes to Coordinate and Distribute Work High-End Server RAID Dual PS UPS Gold Maintenance Load Balancer $$ $$ $$
  13. Reliability through software • Resiliency through software: • Costs to

    develop. Free to deploy. • Resiliency through hardware: • Costs every time you buy a new machine.
  14. $$ $$ $$$$ Write code so that the system is

    distributed. Best hardware. Double-spending
  15. We run services, not servers. • A “server” is a

    server even if it is powered off. • A “service” is powered up, running, and useful. Services are important because people depend on them.
  16. These techniques work for large grids of machines… …and every-day

    systems too. Efficient Efficient Efficient Efficient Efficient Load Balancer Load Balancer
  17. Big resiliency is cheaper Load Balancer 50% 50% 50% overhead

    Load Balancer 10% overhead 90% 90% 90% 90% 90% 90% 90% 90% 90% 90%
  18. Big resiliency is cheaper Load Balancer 50% overhead Load Balancer

    10% overhead 90% 90% 90% 90% 90% 90% 90% 90% 90% 90% 50% 100%
  19. Big resiliency is cheaper Load Balancer 100% 50% overhead Load

    Balancer 10% overhead 100% 100% 100% 100% 100% 100% 100% 100% 100%
  20. Cheaper resilience leads to more resilience • Low overhead &

    cheaper servers => • More resilience is affordable • N+2, N+3, … not just N+1 • More reliable service
  21. N+3 Resilience Load Balancer 25% 4x cost Load Balancer 1.16

    x cost 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 25% 25% 25%
  22. Less reliable servers may need more resilience Load Balancer N+1

    2x cost Load Balancer N+3 1.16 x cost 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 50% 50%
  23. The right amount of redundancy is good. Too much is

    a waste. Aim for an SLA target so you know when to stop.
  24. Efficiency comes from starting with an SLA and buying enough

    resiliency to meet it (not exceed it).
  25. Risky Behaviors are inherently risky • Smoking • Shooting yourself

    in the foot • Blindfolded chainsaw juggling
  26. Risky Processes can be improved through practice • Software Upgrades

    • Database Failovers • Network Trunk Failovers • Hardware Hot Swaps
  27. • StackExchange.com has a “DR” site in Oregon. • StackExchange.com

    runs on SQL Server with “AlwaysOn” Availability Groups plus… Redis, HAproxy, ISC BIND, CloudFlare, IIS, and many home- grown applications StackExchange.com Failover from NY or Oregon
  28. Process was risky • Took 10+ hours • Required “hands

    on” by 3 teams. • Certain people were S.P.O.F. • Found 30+ “improvements needed”
  29. Why? • Don’t want to find the problems in a

    crisis • Each drill “surfaces” areas of improvement. • Each member of the team gains experience and builds confidence. • “Smaller Batches” are better
  30. Datacenter Powerdown • Controlled shutdown every 3 months • Well-documented,

    up-to-date process • A/C leak - water next to power • Flawless emergency shutdown
  31. Software Upgrades • Traditional • Months of planning • Incompatibility

    issues • Very expensive • Very visible mistakes • By the time we’re done, time to start over again. • Distributed Computing • High frequency (many times a day or week) • Fully automated • Easy to fix failures • Cheap… encourages experiments
  32. Small batches are better Fewer changes each batch: • If

    there are bugs, easier to identify source Reduced lead time: • It is easier to debug code written recently. Environment has changed less: • Fewer “external changes” to break on Happier, more motivated, employees: • Instant gratification for all involved
  33. Risk is inversely proportional to how recently a process has

    been used more recent less recent Backups that have never been restored LB web servers that fail all the time Continuous Software Deployment Software Upgrades every 3 years most risky least risky
  34. • Randomly reboots machines. • Keeps Netflix “on its toes”.

    • Part of the Simian Army: • Chaos Monkey (hosts) • Chaos Kong (data centers) • Latency Monkey (adds random performance delays) Netflix “Chaos Monkey”
  35. Out-dated attitudes about outages • Expect perfection: 100% uptime •

    Punish exceptions: • fire someone to “prove we’re serious” • Results: • People hide problems • People stop communicating • Discourages transparency • Small problems get ignored, turn into big problems
  36. New thinking on outages • Set uptime goals: 99.9% +/-

    0.05 • Anticipate outages: • Strategic resiliency techniques, oncall system • Drills to keep in practice, improve process • Results: • Encourages transparency, communication • Small problems addressed, fewer big problems • Over-all uptime improved
  37. After the outage, publish a postmortem document • People involved

    write a “blameless postmortem” • Identifies what happened, how, what can be done to prevent similar problems in the future. • Published widely internally and externally. • Instead of blame, people take responsibility: • Responsibility for implementing long-term fixes. • Responsibility for educating other teams how to learn from this.
  38. I dunno about anybody else, but I really like getting

    these post-mortem reports. Not only is it nice to know what happened, but it’s also great to see how you guys handled it in the moment and how you plan to prevent these events going forward. Really neato. Thanks for the great work :) —-Anna
  39. Take-homes • “cloud computing” = “distributed computing” 1. Use cheaper,

    less reliable, hardware • Create reliability through software (when possible) • Pay only for the reliability you need 2. If a process/procedure is risky, do it a lot • Practice makes perfect • “Small Batches” improves quality and morale 3. Don’t punish people for outages • Focus on accountability and take responsibility
  40. ?

  41. Christine Hogan Senior Network Engineer, AT&T the-cloud-book.com Radical Ideas on

    how to Fail Better from The Practice of Cloud System Administration Very Reasonable
  42. If you liked this talk… …there’s more like it in

    http://the-cloud-book.com Save 37% www.informit.com/TPOSA Discount code SYSADMIN37 THOMAS A. LIMONCELLI • STRATA R. CHALUP • CHRISTINA J. HOGAN DESIGNING AND OPERATING LARGE DISTRIBUTED SYSTEMS T H E P R A C T I C E O F Cloud System Administr ation V O L U M E 2
  43. Q&A