Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Cloud Operations

Avatar for Cloud Genius Cloud Genius
October 27, 2013

Introduction to Cloud Operations

Avatar for Cloud Genius

Cloud Genius

October 27, 2013
Tweet

More Decks by Cloud Genius

Other Decks in Education

Transcript

  1. Cloud Operations Management 2 The art of consistently creating and

    deploying reliable software to an unreliable platform that scales horizontally
  2. Navigating Cloud Complexity 3 ¨  MYTH: ¤  Clouds just run

    by themselves ¨  FACT: ¤  Operations will keep you busy n  Not racking/stacking physical machines n  Instead clicking/dragging virtual machines
  3. Failures Will Happen 4 “You don’t choose the moment, the

    moment chooses you. You only get to choose how prepared you are when it does.” - Fire Chief Mike Burtch
  4. User Expectation: Cloud Just Works 5 ¨  Would your cloud

    be always available? Availability = uptime uptime+ downtime ! " # $ % &
  5. Availability and Downtime 6 Availability Downtime One 9 (90%) 36.5

    days in a year Two 9s (99%) 3.65 days in a year Three 9s (99.9% 8.76 hours in a year Four 9s (99.99%) 52 minutes in a year Five 9s (99.999%) 5 minutes in a year Six 9s (99.9999%) 31 seconds in a year
  6. What Drives Availability? 7 MTTF MTTD MTTR MTTF MTBF Time

    Correct Behavior Diagnose Repair Correct Behavior First Failure Begin Repair End Repair Second Failure MTTF : Mean Time To Failure MTTD : Mean Time To Diagnose MTTR : Mean Time To Repair MTBF : Mean Time Between Failure
  7. Availability 8 Availability = MTTF MTBF ! " # $

    % & MTTF : Mean Time To Failure MTTD : Mean Time To Diagnose MTTR : Mean Time To Repair MTBF : Mean Time Between Failure
  8. Availability 9 Availability = MTTF MTTF + MTTD + MTTR

    ! " # $ % & MTTF : Mean Time To Failure MTTD : Mean Time To Diagnose MTTR : Mean Time To Repair MTBF : Mean Time Between Failure
  9. Mean Time to Diagnose 10 ¨  Time to Detect ¨ 

    Time to Notify ¨  Time to Respond ¨  Time to understand the problem
  10. Mean Time to Repair 11 ¨  Time to find information

    ¨  Time to restart/reset a service ¨  Time to solve the problem ¨  (Mean time for blame game) J
  11. Mean Time to Failure 12 ¨  Detect component fatigue ¨ 

    Check trends in capacity & usage ¨  Monitoring service abuses ¨  Security checks ¨  Auditing/verification checks ¨  Testing changes
  12. Finger pointing 14 ¨  Its not my code, its your

    machines Problem Arrrgh freaking out not talking fault finding blaming cya whining hiding hurt ego figuring it out fixing things fixed
  13. Being Productive 15 ¨  Its our problem Problem Arrrgh figuring

    it out fixing things fixed feeling guilty moving on with life
  14. OODA Loop (USAF Colonel John Boyd) 17 ¨  Observe, Orient,

    Decide, and Act ¤  Originally applied to the combat operations process ¤  Now applied to understand operations and processes http://en.wikipedia.org/wiki/OODA_loop
  15. Summary 20 ¨  Proactive Prevention versus Diagnosis and Repair ¨ 

    Things will fail ¨  Plan ahead to minimize downtime ¨  Plan ahead to improve availability