Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How Not to Go Boom: Lessons for SREs from Oil Refineries

How Not to Go Boom: Lessons for SREs from Oil Refineries

Bad software doesn’t explode. You can describe it as exploding when it throws an exception, corrupts some data, or makes your computer unusable, but it doesn’t explode. When code doesn’t work, the solution is to figure out where the logic is incorrect and fix it. While SREs may be called engineers, we rarely face the consequences of engineers in other industries.

In contrast, when a chemical engineer makes a mistake designing a refinery, the consequences are very different. We’ve all seen videos of the repercussions online. Big, loud explosions reducing massive facilities to chunks of twisted metal. The reality is working with unstable chemicals is a lot harder than keeping track of pointers in C.

Yet despite the differences, industrial process plants can be surprisingly similar to a complex software system. Where refineries will use pressure relief valves, web services will degrade gracefully. Regardless if you’re protecting against thermal runaway in a plant or a cascading failure in a data center, the fundamental ideas can be shared by both domains.

In this talk, I’ll explore the techniques and ideas used to build and operate refineries and how we can use them to make our software systems more resilient and reliable.

A6cafa6da915d3b3f513e63f1ad2c127?s=128

Emil Stolarsky

March 29, 2018
Tweet

Transcript

  1. How Not to Go Boom Lessons for SREs from Oil

    Refineries Emil Stolarsky | @EmilStolarsky
  2. None
  3. Resiliency

  4. None
  5. - Craig Fugate, Director of FEMA (2009 –2017) “If you

    get there and the Waffle House is closed? That's really bad.”
  6. Oil Refineries

  7. Design for failure

  8. Explosion Isolation Systems

  9. Pressure Relief Systems

  10. Safe and Rapid Isolation of Piping Systems

  11. - Trevor Kletz, Chemical Process Safety Expert “If you think

    safety is expensive, try having an accident.”
  12. Fault Tree Analysis

  13. None
  14. A B C D E

  15. A B C D E

  16. A B C D E

  17. A B C D E

  18. A B C D E

  19. A B C D E

  20. A B C D E

  21. A B C D E

  22. A B C D E 2% 2% 2% 2% 2%

    2% 2% 2%
  23. A B C D E p(E)= 2%·2% 2% 2% 2%

    2% 2% 2% 2% 2% p(C)= 2%·2%·2% p(B)= 2%+2%
  24. A B C D E p(E)= 2%·2% p(D)= 2%+p(E) 2%

    2% 2% 2% 2% 2% 2% 2% p(A)= p(B) + p(C) + p(D) p(C)= 2%·2%·2% p(B)= 2%+2%
  25. A B C D E p(E)= 0.04% p(D)= 2%+p(E) 2%

    2% 2% 2% 2% 2% 2% 2% p(A)= p(B) + p(C) + p(D) p(C)= 0.0008% p(B)= 4%
  26. A B C D E p(E)= 0.04% p(D)= 2.04% 2%

    2% 2% 2% 2% 2% 2% 2% p(A)= p(B) + p(C) + p(D) p(C)= 0.0008% p(B)= 4%
  27. A B C D E p(E)= 0.04% p(D)= 2.04% 2%

    2% 2% 2% 2% 2% 2% 2% p(A)=6.0408% p(C)= 0.0008% p(B)= 4%
  28. A B C D E p(E)= 0.04% p(D)= ??% 2%

    2% 2% 2% 2% 2% 2% 2% p(A)=??% p(C)= 0.0008% p(B)= 4%
  29. A B C D E p(E)= 0.04% p(D)= 2%·0.04% 2%

    2% 2% 2% 2% 2% 2% 2% p(A)=??% p(C)= 0.0008% p(B)= 4%
  30. A B C D E p(E)= 0.04% p(D)= 0.0008% 2%

    2% 2% 2% 2% 2% 2% 2% p(A)=??% p(C)= 0.0008% p(B)= 4%
  31. A B C D E p(E)= 0.04% p(D)= 0.0008% 2%

    2% 2% 2% 2% 2% 2% 2% p(A)=4.0016% p(C)= 0.0008% p(B)= 4%
  32. A B C D E p(E)= 0.04% p(D)= 0.0008% 2%

    2% 2% 2% 2% 2% 2% 2% p(A)=4.0016% p(C)= 0.0008% p(B)= 4% p(A)=6.0408%
  33. Learning from Failure

  34. None
  35. Center for Chemical Process Safety

  36. U.S. Chemical Safety and Hazard Investigation Board

  37. None
  38. Steam Boilers

  39. Thank you.