Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons Learned From A Data Center Meltdown

Avatar for Al Crowley Al Crowley
December 05, 2025
9

Lessons Learned From A Data Center Meltdown

No amount of planning can prevent routine hardware failures or the occasional site-wide hardware disaster. Many times your advanced planning will be an exact match for the problems the world throws at you. Every once in awhile, something unexpected happens. Tragically, our human made plans often have human fallibilities that don't handle the unexpected very well. This presentation will discuss some general practices you can use during a major disruption in your primary data center to make the best of a bad situation. We will also discuss what you can do right now to make the inevitable next occurrence less disruptive. The information presented will be based on lessons learned from multiple real world incidents, both big and small.

Avatar for Al Crowley

Al Crowley

December 05, 2025
Tweet

Transcript

  1. 2 This presentation contains proprietary data that may not be

    disclosed or released to any individual or party without written consent from TCG, Inc. WHERE AM I COMING FROM? •Grant management, research data, budget data •Low user volume, high complexity systems. •University and government provided hosting
  2. 3

  3. 4 This presentation contains proprietary data that may not be

    disclosed or released to any individual or party without written consent from TCG, Inc. “Be prepared.” - Every Boy Scout Ever
  4. JUSTIFYING COSTS • Emergency repairs • Lost productivity • Lost

    goodwill • Data replacement 5 This presentation contains proprietary data that may not be disclosed or released to any individual or party without written consent from TCG, Inc. Lack of investment or unwillingness to invest
  5. 6 This presentation contains proprietary data that may not be

    disclosed or released to any individual or party without written consent from TCG, Inc. “No plan survives contact with the enemy.” - Helmuth Von Moltke
  6. WHAT CAN YOU DO? Develop guidelines that make the best

    of a bad situation Try to make the inevitable less disruptive 7 This presentation contains proprietary data that may not be disclosed or released to any individual or party without written consent from TCG, Inc.
  7. First step: standard maintenance message Second step: deploy a read

    only copy 8 This presentation contains proprietary data that may not be disclosed or released to any individual or party without written consent from TCG, Inc.
  8. LESSON LEARNED: TIME Time to get the right folks to

    the table Escalation procedures Response times in your SLA? 11
  9. LESSON LEARNED: TEST YOUR COOP Test periodically New issues frequently

    come up over time as your main site evolves 16
  10. POSTMORTEM Blameless Focus on improvements 24 This presentation contains proprietary

    data that may not be disclosed or released to any individual or party without written consent from TCG, Inc.
  11. POSTMORTEM What did we do at NITRC? 25 This presentation

    contains proprietary data that may not be disclosed or released to any individual or party without written consent from TCG, Inc. “Plan B” became our “Plan A”