Slide 1

Slide 1 text

SREcon 2016-08-26 ࣾ಺ษڧձ Tsuyoshi Nakamura

Slide 2

Slide 2 text

https://www.usenix.org/conference/srecon16

Slide 3

Slide 3 text

ษڧձͰॳΊͯ஌Γɺ֤ SessionͷಈըɺεϥΠυΛؤ ுͬͯ௥͍͔͚ͨ

Slide 4

Slide 4 text

Agenda 1.  Learn about other companies of SRE 1.  In case of Microsoft Azure SRE 2.  In case of New Relic 3.  In case of Pinterest 4.  In case of Netflix 2.  ࠷ޙ·ͱΊతͳ

Slide 5

Slide 5 text

In case of Microsoft Azure SRE Caskey L. Dickson and Jake Welch https://www.usenix.org/sites/default/files/conference/protected-files/srecon16_slides_welch.pdf https://www.usenix.org/conference/srecon16/program/presentation/dickson

Slide 6

Slide 6 text

Service Roast ໨తɿܽ఺ͩͬͨΓɺઃܭ্ͷߟྀ࿙Εɺօ͕͢Ͱʹ஌ͬͯΔϓϩμΫτ ͷ՝୊Λཧղ͠ɺ໌֬ʹࣔ͢ Dev͔Βࡂ֐෮چ·ͰαʔϏεશମͷϥΠϑαΠΫϧΛ೺Ѳ վળ͢΂͖఺Λ͋͛ɺܧଓతʹվળͷҝΛଓ͚Δ

Slide 7

Slide 7 text

Why do? •  Builds relationships and trust between the teams •  SRE learns about the service •  Dramatically speeds up ‘newbie to expert’ process •  Ճ଎౓తʹproductΛ੒௕ͤ͞Δ •  Exposes details that otherwise would be difficult (or painful) to learn of •  ൿ఻ͷλϨԽͷഉআ •  Creates a shared backlog of improvements •  ՝୊ͷڞ༗

Slide 8

Slide 8 text

Tone •  Not an attack on the service •  Not a judgment of past choices •  Focus on ‘How’ questions not ‘Why’ questions •  Why’s can be seen as judgmental •  Every participant must understand this •  Managing emotions is critical to a safe discussion environment

Slide 9

Slide 9 text

Tone •  Not an attack on the service •  Not a judgment of past choices •  Focus on ‘How’ questions not ‘Why’ questions •  Why’s can be seen as judgmental •  Every participant must understand this •  Managing emotions is critical to a safe discussion environment

Slide 10

Slide 10 text

In case of New Relic Alice Goldfuss https://www.usenix.org/conference/srecon16/program/presentation/goldfuss https://www.usenix.org/sites/default/files/conference/protected-files/srecon16_slides_goldfuss_0.pdf

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

Summary •  ੓෎΍܉ͷΠϯσϯτରԠϓϩηε͔Βద༻ͨ͠νʔϜ •  Incident Command SystemͷԠ༻ •  ΞϝϦΞͩͱ݁ߏ༗໊Β͍͠ •  ͦΕͧΕͷ໾ׂ͕໌֬ʹఆٛ •  શମӨڹΛಛʹߟྀ͞Ε͍ͯΔ

Slide 13

Slide 13 text

In case of Pinterest Ernie Souhrada https://www.usenix.org/conference/srecon16/program/presentation/souhrada https://www.usenix.org/sites/default/files/conference/protected-files/srecon16_slides_souhrada.pdf

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

History •  ࠓͱͳͬͯ͸AWSʹ100% hosted͍ͯ͠Δ͕ɺҎલ͸ΦϯϓϨϛε؀ڥ •  CloudαʔϏε͕ීٴ͢Δલͷ࿩ •  1. Individual servers matter. •  2. Failure is expensive, so it must be prevented. •  3. Capacity planning can make or break you. •  4. Sometimes your destiny is still outside your control. Operational Materialism ӡ༻෺࣭ओٛʁ

Slide 16

Slide 16 text

Now •  1. Cloud servers can, and do, fail at any time, for any reason. •  2. Trying to prevent this server failure is an endless source of suffering for SREs and DBAs alike. •  Trying to prevent server failure leads only to suffering •  3. Accepting the impermanence of our servers, we should design systems that are failure-resilient, not failure-resistant. •  Cloud-based servers can fail at any time, for any reason. •  Automated replacement •  Configuration management tools •  4. We can break the cycle of suffering and create a better experience for end users, internal customers, and colleagues Operational Buddhism ෹ͷΑ͏ͳ੩͔ͳ৺ͰݟकΓଓ͚Δʁw

Slide 17

Slide 17 text

In case of Netflix Jonah Horowitz https://www.usenix.org/conference/srecon16/program/presentation/horowitz https://www.usenix.org/sites/default/files/conference/protected-files/srecon16_slides_horowitz.pdf

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

topic •  190ΧࠃͰαʔϏεల։͍ͯ͠ΔͷʹSRE͸5໊ʂʁ •  SREs are expensive & hard to find •  Freedom & Responsibility

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

࠷ޙ·ͱΊతͳ ²  ·͊ձࣾʹΑͬͯroleͷ෦෼Ͱҧ͍͸౰વ͋Δ ²  DevOpsͷ࣌Ͱ΋ײ͚ͨ͡Ͳɺ݁ہαʔϏεΛεϐʔυײ΋ͬͯάϩʔε্͍ͤͯ͘͞Ͱ Ͳ͏ͯ͠΋ΆͯΜώοτ͕ੜ·Εͯ͠·͏ ²  ͦͷΆͯΜώοτΛͲ͏΍ͬͯर͍͔ͬͯ͘ʹ࢝·ͬͯΔؾ͕͢Δ ²  νʔϜΛ༏ઌͯ͠ಈ͍͍ͯΕ͹ࣗવͱSREతͳλεΫΛ͜ͳ͍ͯ͠ΔࣄʹͳΔͱࢥ͏͚Ͳ ²  ͦͷ෦෼Λ͔ͬ͠ΓධՁ͠·͠ΐ͏ͱͳͬͯSREతͳλά͕෇͍ͨͱࢥ͏෦෼͕͋Δ ²  ٕज़తͳ΋ͷΑΓ΋Ή͠ΖϚΠϯυతͳ΋ͷ͕ॏཁʁʂ ²  PMతͳཁૉ΋৭ʑͱೖͬͯΔؾ͕͢Δ ²  “SRE should not be a Servant” ²  ษڧʹͳΔ৘ใ ²  https://github.com/dastergon/awesome-sre/blob/master/README.md