Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SRECont16

 SRECont16

SRECon16の気になったセッションの共有資料

tsuyoshi nakamura

August 31, 2016
Tweet

More Decks by tsuyoshi nakamura

Other Decks in Technology

Transcript

  1. Agenda 1.  Learn about other companies of SRE 1.  In

    case of Microsoft Azure SRE 2.  In case of New Relic 3.  In case of Pinterest 4.  In case of Netflix 2.  ࠷ޙ·ͱΊతͳ
  2. In case of Microsoft Azure SRE Caskey L. Dickson and

    Jake Welch https://www.usenix.org/sites/default/files/conference/protected-files/srecon16_slides_welch.pdf https://www.usenix.org/conference/srecon16/program/presentation/dickson
  3. Why do? •  Builds relationships and trust between the teams

    •  SRE learns about the service •  Dramatically speeds up ‘newbie to expert’ process •  Ճ଎౓తʹproductΛ੒௕ͤ͞Δ •  Exposes details that otherwise would be difficult (or painful) to learn of •  ൿ఻ͷλϨԽͷഉআ •  Creates a shared backlog of improvements •  ՝୊ͷڞ༗
  4. Tone •  Not an attack on the service •  Not

    a judgment of past choices •  Focus on ‘How’ questions not ‘Why’ questions •  Why’s can be seen as judgmental •  Every participant must understand this •  Managing emotions is critical to a safe discussion environment
  5. Tone •  Not an attack on the service •  Not

    a judgment of past choices •  Focus on ‘How’ questions not ‘Why’ questions •  Why’s can be seen as judgmental •  Every participant must understand this •  Managing emotions is critical to a safe discussion environment
  6. History •  ࠓͱͳͬͯ͸AWSʹ100% hosted͍ͯ͠Δ͕ɺҎલ͸ΦϯϓϨϛε؀ڥ •  CloudαʔϏε͕ීٴ͢Δલͷ࿩ •  1. Individual servers

    matter. •  2. Failure is expensive, so it must be prevented. •  3. Capacity planning can make or break you. •  4. Sometimes your destiny is still outside your control. Operational Materialism ӡ༻෺࣭ओٛʁ
  7. Now •  1. Cloud servers can, and do, fail at

    any time, for any reason. •  2. Trying to prevent this server failure is an endless source of suffering for SREs and DBAs alike. •  Trying to prevent server failure leads only to suffering •  3. Accepting the impermanence of our servers, we should design systems that are failure-resilient, not failure-resistant. •  Cloud-based servers can fail at any time, for any reason. •  Automated replacement •  Configuration management tools •  4. We can break the cycle of suffering and create a better experience for end users, internal customers, and colleagues Operational Buddhism ෹ͷΑ͏ͳ੩͔ͳ৺ͰݟकΓଓ͚Δʁw
  8. ࠷ޙ·ͱΊతͳ ²  ·͊ձࣾʹΑͬͯroleͷ෦෼Ͱҧ͍͸౰વ͋Δ ²  DevOpsͷ࣌Ͱ΋ײ͚ͨ͡Ͳɺ݁ہαʔϏεΛεϐʔυײ΋ͬͯάϩʔε্͍ͤͯ͘͞Ͱ Ͳ͏ͯ͠΋ΆͯΜώοτ͕ੜ·Εͯ͠·͏ ²  ͦͷΆͯΜώοτΛͲ͏΍ͬͯर͍͔ͬͯ͘ʹ࢝·ͬͯΔؾ͕͢Δ ²  νʔϜΛ༏ઌͯ͠ಈ͍͍ͯΕ͹ࣗવͱSREతͳλεΫΛ͜ͳ͍ͯ͠ΔࣄʹͳΔͱࢥ͏͚Ͳ

    ²  ͦͷ෦෼Λ͔ͬ͠ΓධՁ͠·͠ΐ͏ͱͳͬͯSREతͳλά͕෇͍ͨͱࢥ͏෦෼͕͋Δ ²  ٕज़తͳ΋ͷΑΓ΋Ή͠ΖϚΠϯυతͳ΋ͷ͕ॏཁʁʂ ²  PMతͳཁૉ΋৭ʑͱೖͬͯΔؾ͕͢Δ ²  “SRE should not be a Servant” ²  ษڧʹͳΔ৘ใ ²  https://github.com/dastergon/awesome-sre/blob/master/README.md