SRECont16

SREcon 2016-08-26 ࣾ಺ษڧձ Tsuyoshi Nakamura

https://www.usenix.org/conference/srecon16

ษڧձͰॳΊͯ஌Γɺ֤ SessionͷಈըɺεϥΠυΛؤ ுͬͯ௥͍͔͚ͨ

Agenda 1.  Learn about other companies of SRE 1.  In
case of Microsoft Azure SRE 2.  In case of New Relic 3.  In case of Pinterest 4.  In case of Netﬂix 2.  ࠷ޙ·ͱΊతͳ

In case of Microsoft Azure SRE Caskey L. Dickson and
Jake Welch https://www.usenix.org/sites/default/ﬁles/conference/protected-ﬁles/srecon16_slides_welch.pdf https://www.usenix.org/conference/srecon16/program/presentation/dickson

Service Roast ໨తɿܽ఺ͩͬͨΓɺઃܭ্ͷߟྀ࿙Εɺօ͕͢Ͱʹ஌ͬͯΔϓϩμΫτ ͷ՝୊Λཧղ͠ɺ໌֬ʹࣔ͢ Dev͔Βࡂ֐෮چ·ͰαʔϏεશମͷϥΠϑαΠΫϧΛ೺Ѳ վળ͢΂͖఺Λ͋͛ɺܧଓతʹվળͷҝΛଓ͚Δ

Why do? •  Builds relationships and trust between the teams
•  SRE learns about the service •  Dramatically speeds up ‘newbie to expert’ process •  Ճ଎౓తʹproductΛ੒௕ͤ͞Δ •  Exposes details that otherwise would be difﬁcult (or painful) to learn of •  ൿ఻ͷλϨԽͷഉআ •  Creates a shared backlog of improvements •  ՝୊ͷڞ༗

Tone •  Not an attack on the service •  Not
a judgment of past choices •  Focus on ‘How’ questions not ‘Why’ questions •  Why’s can be seen as judgmental •  Every participant must understand this •  Managing emotions is critical to a safe discussion environment

In case of New Relic Alice Goldfuss https://www.usenix.org/conference/srecon16/program/presentation/goldfuss https://www.usenix.org/sites/default/ﬁles/conference/protected-ﬁles/srecon16_slides_goldfuss_0.pdf

Summary •  ੓෎΍܉ͷΠϯσϯτରԠϓϩηε͔Βద༻ͨ͠νʔϜ •  Incident Command SystemͷԠ༻ •  ΞϝϦΞͩͱ݁ߏ༗໊Β͍͠ • 
ͦΕͧΕͷ໾ׂ͕໌֬ʹఆٛ •  શମӨڹΛಛʹߟྀ͞Ε͍ͯΔ

In case of Pinterest Ernie Souhrada https://www.usenix.org/conference/srecon16/program/presentation/souhrada https://www.usenix.org/sites/default/ﬁles/conference/protected-ﬁles/srecon16_slides_souhrada.pdf

History •  ࠓͱͳͬͯ͸AWSʹ100% hosted͍ͯ͠Δ͕ɺҎલ͸ΦϯϓϨϛε؀ڥ •  CloudαʔϏε͕ීٴ͢Δલͷ࿩ •  1. Individual servers
matter. •  2. Failure is expensive, so it must be prevented. •  3. Capacity planning can make or break you. •  4. Sometimes your destiny is still outside your control. Operational Materialism ӡ༻෺࣭ओٛʁ

Now •  1. Cloud servers can, and do, fail at
any time, for any reason. •  2. Trying to prevent this server failure is an endless source of suffering for SREs and DBAs alike. •  Trying to prevent server failure leads only to suffering •  3. Accepting the impermanence of our servers, we should design systems that are failure-resilient, not failure-resistant. •  Cloud-based servers can fail at any time, for any reason. •  Automated replacement •  Conﬁguration management tools •  4. We can break the cycle of suffering and create a better experience for end users, internal customers, and colleagues Operational Buddhism ෹ͷΑ͏ͳ੩͔ͳ৺ͰݟकΓଓ͚Δʁw

In case of Netflix Jonah Horowitz https://www.usenix.org/conference/srecon16/program/presentation/horowitz https://www.usenix.org/sites/default/files/conference/protected-files/srecon16_slides_horowitz.pdf

topic •  190ΧࠃͰαʔϏεల։͍ͯ͠ΔͷʹSRE͸5໊ʂʁ •  SREs are expensive & hard to
ﬁnd •  Freedom & Responsibility

࠷ޙ·ͱΊతͳ ²  ·͊ձࣾʹΑͬͯroleͷ෦෼Ͱҧ͍͸౰વ͋Δ ²  DevOpsͷ࣌Ͱ΋ײ͚ͨ͡Ͳɺ݁ہαʔϏεΛεϐʔυײ΋ͬͯάϩʔε্͍ͤͯ͘͞Ͱ Ͳ͏ͯ͠΋ΆͯΜώοτ͕ੜ·Εͯ͠·͏ ²  ͦͷΆͯΜώοτΛͲ͏΍ͬͯर͍͔ͬͯ͘ʹ࢝·ͬͯΔؾ͕͢Δ ²  νʔϜΛ༏ઌͯ͠ಈ͍͍ͯΕ͹ࣗવͱSREతͳλεΫΛ͜ͳ͍ͯ͠ΔࣄʹͳΔͱࢥ͏͚Ͳ
²  ͦͷ෦෼Λ͔ͬ͠ΓධՁ͠·͠ΐ͏ͱͳͬͯSREతͳλά͕෇͍ͨͱࢥ͏෦෼͕͋Δ ²  ٕज़తͳ΋ͷΑΓ΋Ή͠ΖϚΠϯυతͳ΋ͷ͕ॏཁʁʂ ²  PMతͳཁૉ΋৭ʑͱೖͬͯΔؾ͕͢Δ ²  “SRE should not be a Servant” ²  ษڧʹͳΔ৘ใ ²  https://github.com/dastergon/awesome-sre/blob/master/README.md

SRECont16

SRECont16

tsuyoshi nakamura

More Decks by tsuyoshi nakamura

Other Decks in Technology

Featured

Transcript

SREcon 2016-08-26 ࣾ಺ษڧձ Tsuyoshi Nakamura

https://www.usenix.org/conference/srecon16

ษڧձͰॳΊͯ஌Γɺ֤ SessionͷಈըɺεϥΠυΛؤ ுͬͯ௥͍͔͚ͨ

Agenda 1.  Learn about other companies of SRE 1.  In

In case of Microsoft Azure SRE Caskey L. Dickson and

Service Roast ໨తɿܽ఺ͩͬͨΓɺઃܭ্ͷߟྀ࿙Εɺօ͕͢Ͱʹ஌ͬͯΔϓϩμΫτ ͷ՝୊Λཧղ͠ɺ໌֬ʹࣔ͢ Dev͔Βࡂ֐෮چ·ͰαʔϏεશମͷϥΠϑαΠΫϧΛ೺Ѳ վળ͢΂͖఺Λ͋͛ɺܧଓతʹվળͷҝΛଓ͚Δ

Why do? •  Builds relationships and trust between the teams

Tone •  Not an attack on the service •  Not

Tone •  Not an attack on the service •  Not

In case of New Relic Alice Goldfuss https://www.usenix.org/conference/srecon16/program/presentation/goldfuss https://www.usenix.org/sites/default/ﬁles/conference/protected-ﬁles/srecon16_slides_goldfuss_0.pdf

Summary •  ੓෎΍܉ͷΠϯσϯτରԠϓϩηε͔Βద༻ͨ͠νʔϜ •  Incident Command SystemͷԠ༻ •  ΞϝϦΞͩͱ݁ߏ༗໊Β͍͠ •

In case of Pinterest Ernie Souhrada https://www.usenix.org/conference/srecon16/program/presentation/souhrada https://www.usenix.org/sites/default/ﬁles/conference/protected-ﬁles/srecon16_slides_souhrada.pdf

History •  ࠓͱͳͬͯ͸AWSʹ100% hosted͍ͯ͠Δ͕ɺҎલ͸ΦϯϓϨϛε؀ڥ •  CloudαʔϏε͕ීٴ͢Δલͷ࿩ •  1. Individual servers

Now •  1. Cloud servers can, and do, fail at

In case of Netflix Jonah Horowitz https://www.usenix.org/conference/srecon16/program/presentation/horowitz https://www.usenix.org/sites/default/files/conference/protected-files/srecon16_slides_horowitz.pdf

topic •  190ΧࠃͰαʔϏεల։͍ͯ͠ΔͷʹSRE͸5໊ʂʁ •  SREs are expensive & hard to