Where Chaos Engineering comes from, and what's next

E4619fc2a039391a1677beeac58dd487?s=47 itkq
April 13, 2019

Where Chaos Engineering comes from, and what's next

E4619fc2a039391a1677beeac58dd487?s=128

itkq

April 13, 2019
Tweet

Transcript

 1. 2.

  whoami • @itkq [ˈɪtəkəʊ] • SRE @ Cookpad • SLI/SLO,

  Մ༻ੑ, ... • Web γεςϜΛࣗ཯ࣦͤͯ͞৬͍ͨ͠ Web System Artchitecture ݚڀձ #4 (@itkq) 2
 2. 3.

  ൃද಺༰ • ࣗݾ঺հ • ΧΦεΤϯδχΞϦϯάͷྲྀߦ • ϨδϦΤϯεΤϯδχΞϦϯάͱ Safety-II • SRE

  ͱΧΦεΤϯδχΞϦϯάͷؔ܎ • ΞϯνϑϥδϟΠϧͳ Web γεςϜ͕੒ཱ͢Δ͔ Web System Artchitecture ݚڀձ #4 (@itkq) 3
 3. 5.

  ΧΦεΤϯδχΞϦϯάͷྲྀߦ • ʮΧΦεΤϯδχΞϦϯάʯ͕ॳΊͯొ৔ͨ͠ͷ͸ 2015 ೥1 • "Chaos Engineering" ϖʔύʔ͸ 2016

  ೥2 • "Chaos Engineering" ຊ͸ 2017 ೥3 • Gremilin: Failure as a Service4 (2017ʙ) • ࣮ફ͸ͱ΋͔֓͘೦ͱͯ͠͸޿·͍ͬͯΔ 4 h$p:/ /principlesofchaos.org 3 h$ps:/ /www.oreilly.com/library/view/chaos-engineering/9781491988459/ 2 Chaos Engineering, Ali Basiri, Niosha Behnam, Ruud de Rooij, Lorin Hochstein, Luke Kosewski, Jus@n Reynolds, Casey Rosenthal 1 h$ps:/ /medium.com/ne2lix-techblog/chaos-engineering-upgraded-878d341f15fa Web System Artchitecture ݚڀձ #4 (@itkq) 5
 4. 8.

  ΧΦεΤϯδχΞϦϯά͸Կ͕৽͔ͬͨ͠ͷ͔ Amazon, Google, Microso2, Facebook ͳͲͷاۀͰ͸ɺࣗ਎ͷγ εςϜͷϨδϦΤϯεΛςετ͢ΔͨΊͷಉ༷ͳٕज़Λద༻ͯ͠ ͍ͨɻզʑͷۀքʹݱΕͨ͜ͷن཯Λܗ੒͢ΔΞΫςΟϏςΟΛ ʮΧΦεΤϯδχΞϦϯάʯͱݺͿ —

  Ne%lix2 2 Chaos Engineering, Ali Basiri, Niosha Behnam, Ruud de Rooij, Lorin Hochstein, Luke Kosewski, Jus@n Reynolds, Casey Rosenthal Web System Artchitecture ݚڀձ #4 (@itkq) 8
 5. 10.

  ϨδϦΤϯεΤϯδχΞϦϯά5 • ࣾձɾٕज़γεςϜͷϨδϦΤϯτੑΛ޲্ͤ͞ΔͨΊͷํ๏ ࿦6 • ϨδϦΤϯε: ஄ྗੑɾ෮ݩྗɾճ෮ྗͷ༏Εͨঢ়ଶΛࢦ֓͢೦ 6 ϨδϦΤϯεΤϯδχΞϦϯά͕໨ࢦ҆͢શ Safety-II

  ͱͦͷ࣮ݱ๏, ๺ଜਖ਼੖, IEICE Fundamentals Review Vol.8 No.2 5 Resilience Engineering - Concepts and Precepts, Ashgate Publishing Ltd., E.Hollnagel, D.D.Woods, and N.Leveson, Eds., Aldershot, England, 2006 Web System Artchitecture ݚڀձ #4 (@itkq) 10
 6. 13.

  ʮ҆શʯ֓೦ͷݶք • ա৒ͳӨڹྗ • ʮࣄނθϩ͕ݱࡏ n ೔ܧଓதʯͷΑ͏ͳඪޠ • ࠜڌͷͳ͍ࣗݾա৴ •

  ʮ͜Ε͚ͩపఈͯ҆͠શΛक͍ͬͯΔͷ͔ͩΒաࠅࣄނͳͲ ى͜Δ͸͕ͣͳ͍ʯ Web System Artchitecture ݚڀձ #4 (@itkq) 13
 7. 14.

  Safety-I ͔Β Safety-II ΁ • Safety-I • ैདྷͷ൱ఆܗ͔ͭ੩తʹఆٛ͞Εͨʮ҆શʯ • Safety-II10

  • ʮγεςϜ͕େ֎ཚͳͲʹΑͬͯ௨ৗ࣌ͷಈ࡞ঢ়ଶΛҡ࣋Ͱ͖ͳ͍৔߹ɺੑೳ͸௿Լͤͯ͞΋ಈ࡞Ͱ͖Δʯ • ʮঢ়گ͕ճ෮ͨ͠Β଎΍͔ʹݩͷঢ়ଶ·ͨ͸ͦΕʹ४͡Δঢ়ଶʹ෮چͰ͖Δʯ • ϨδϦΤϯτੑͷ͋Δڍಈ͕Ͱ͖Δ͜ͱ • Safety-I Λશ໘൱ఆ͍ͯ͠ΔͷͰ͸ͳ͘ɺͦͷઌʹ͋Δ΋ͷ 10 E.Hollnagel, Safety-I and Safety-II - The Past and Future of Safety Management, Ashgate Publishing Ltd., Surrey, England, 2014. Web System Artchitecture ݚڀձ #4 (@itkq) 14
 8. 15.

  جૅͱͳΔ4ͭͷओཁͳೳྗ • ରॲͰ͖Δ • ྟػԠมͳରॲ·ͰΛؚΉ • ؂ࢹͰ͖Δ • ϓϩΞΫςΟϒʹରԠ͢ΔೳྗΛ࣋ͭ͜ͱ͕๬·͍͠ •

  ༧ݟͰ͖Δ • ඞͣ͠΋σʔλۦಈͱ͸ݶΒͳ͍ • ֶशͰ͖Δ • ্هͷೳྗΛઈ͑ؒͳ͘޲্ͤ͞Δ͜ͱ Web System Artchitecture ݚڀձ #4 (@itkq) 15
 9. 16.

  ҆શ΁ͷ౤ࢿײͷࠩҟ • Safety-I ϕʔε • Կ΋ى͜Βͳ͍͜ͱ͕๬·͍͠ͱ͍͏҉໧ͷԾఆʹΑΓɺ౤ ࢿߦಈ͸ֻ͚ࣺͯอݥͷΑ͏ʹଊ͑ΒΕ఍߅͕ੜ·Ε΍͍͢ • Safety-II ϕʔε

  • ໨త͸ಈ࡞ͷܧଓͰ͋ΔͨΊɺͦͷՄೳੑΛߴΊΔ౤ࢿ͸Ծ ʹେ֎ཚ΍؀ڥͷมԽ͕ى͜Βͣͱ΋ਖ਼౰ੑΛओுͰ͖Δ Web System Artchitecture ݚڀձ #4 (@itkq) 16
 10. 18.

  Web γεςϜͰ͸Ͳ͏͔? • ৗʹมԽ͢Δ • ۜͷ஄ؙ͸ͳ͍ • ӡ༻Λݮ఺๏ͰධՁ͞ΕΔͷ͸ͭΒ͍ • ...

  • ʮ҆શʯ=>ʮ৴པੑʯɺʮࣄނʯ=>ʮো֐ʯʹஔ͖׵͑ͯΈΔ ͱͲ͏͔ Web System Artchitecture ݚڀձ #4 (@itkq) 18
 11. 19.

  Web γεςϜͷ৴པੑ • ଟ͘ͷ Web γεςϜͰॏཁͳࢦඪ • ৴པੑ 100% Λ໨ඪʹ͢Δ͜ͱ͸ؒҧ͍7

  • Web γεςϜͷ৴པੑΛ੍ޚ͢Δํ๏࿦ => SRE • SRE ͸ϨδϦΤϯεΤϯδχΞϦϯάͷ1ͭͷ࣮૷ͱ͍͑Δ 7 Site Reliability Engineering, Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy Web System Artchitecture ݚڀձ #4 (@itkq) 19
 12. 20.

  SLO ͱΤϥʔόδΣοτ • SLO (αʔϏεϨϕϧ໨ඪ) Λຬ͍ͨͯ͠ΔݶΓ͸ϦϦʔεՄೳ • => ΤϥʔόδΣοτ͕࢒͍ͬͯΔ •

  ΤϥʔόδΣοτ͕࢒͍ͬͯͳ͍৔߹ • γεςϜͷϨδϦΤϯεΛߴΊΔ • SLO Λ؇ΊΔ Web System Artchitecture ݚڀձ #4 (@itkq) 20
 13. 21.

  SRE จ຺ͰͷϨδϦΤϯτੑͷධՁ • SLO ͸௚઀తͳධՁج४ʹ͸ͳΒͳ͍ • ҰํͰ Web γεςϜͷো֐͸ࣗવࡂ֐ʹൺ΂Δͱ࠶ݱ͠΍͍͢ •

  ো֐͕ى͜ΔͷΛ଴͚ͭͩͰͳ͘ɺো֐ΛΤϛϡϨʔτ͢Δ ͜ͱͰγεςϜͷϨδϦΤϯτੑͷϓϩηεධՁ͕ߦ͑Δ • ͜Ε͕ΧΦεΤϯδχΞϦϯά Web System Artchitecture ݚڀձ #4 (@itkq) 21
 14. 22.

  ΧΦεΤϯδχΞϦϯά࠶ߟ • Web γεςϜͷ৴པੑ 100% Λ໨ඪʹ͢Δ͜ͱ͸ؒҧ͍ • SRE ͷ໨త͸ "αʔϏεͷ

  SLO ΛԼճΔ͜ͱͳ͘มߋͷ଎౓ͷ࠷େԽΛ௥ ٻ͢Δ͜ͱ7" • ΤϥʔόδΣοτΛ࢖͍Ռͨ͞ͳ͍ͨΊʹ͸ Safety-II ΛߴΊΔඞཁ͕͋Δ • ϨδϦΤϯτੑΛϓϩηεධՁ͢ΔͨΊͷํ๏ͱͯ͠ো֐ΛΤϛϡϨʔτ ͢ΔΧΦεΤϯδχΞϦϯά͕͋Δ 7 Site Reliability Engineering, Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy Web System Artchitecture ݚڀձ #4 (@itkq) 22
 15. 23.

  ΧΦεΤϯδχΞϦϯάͰಘΒΕΔ͜ͱ • Known unknown ͳ੬ऑੑͷݕূ • ྫ: ΠϯελϯεΛಥવམͱͨ͠৔߹ʹ SLI ͕Ͳ͏มԽ͢Δ͔

  • Unknown unknown ͳ໰୊ͷൃݟ • ྫ͑͹૬ؔ͢Δ΋ͷ • ྫ: ϨδϦΤϯεΛߴΊΔͨΊͷϑΥʔϧόοΫΩϟογϡ͕Ҿ͖ى͜͢ෆ੔߹11 11 h$ps:/ /medium.com/ne2lix-techblog/from-chaos-to-control-tes<ng-the-resiliency-of-ne2lixs-content-discovery- pla2orm-ce5566aef0a4 Web System Artchitecture ݚڀձ #4 (@itkq) 23
 16. 24.

  ΧΦεΤϯδχΞϦϯάͷݶք • Unknown unknown ͳো֐΍੬ऑੑ͸ΤϛϡϨʔτͰ͖ͳ͍ • ྫ͑͹఻છతͳ΋ͷ8 • ྫ: Linux

  Leap Second bug • "ଟ༷ੑ" ͸ཱͪ޲͔͏खஈͷ1ͭ8 • ଟ༷ੑ...? 8 h$ps:/ /www.gremlin.com/blog/adrian-cockro9-chaos-engineering-what-it-is-and-where-its-going-chaos- conf-2018/ Web System Artchitecture ݚڀձ #4 (@itkq) 24
 17. 26.

  ൓੬͞ (ΞϯνϑϥδϟΠϧੑ) • ੬͞ͷ൓ରͱ͞ΕΔ֓೦9 • ྫ • ے೑: ෛՙΛ͔͚Δ͜ͱͰҎલΑΓڧ͘ͳΔ •

  ৘ใ: ޿ΊΔΑΓյ͢౒ྗͷ΄͏͕ྐʹͳΔ 9 ൓੬ऑੑʦ্ʧ――ෆ࣮֬ͳੈքΛੜ͖ԆͼΔ།Ұͷߟ͑ํ, φγʔϜɾχίϥεɾλϨϒ (ஶ), ๬݄ Ӵ (຋༁), ઍ ༿ හੜ (຋༁) Web System Artchitecture ݚڀձ #4 (@itkq) 26
 18. 27.

  ൓੬͞ (ΞϯνϑϥδϟΠϧੑ) • ੬͞: িܸͰऑ͘ͳΔੑ࣭ • ϩόετੑ: িܸʹରͯ͠มԽ͠ͳ͍ੑ࣭ • ϨδϦΤϯτੑ:

  িܸʹରͯ͠దԠ͢Δੑ࣭ • ൓੬͞: িܸͰڧ͘ͳΔੑ࣭ Web System Artchitecture ݚڀձ #4 (@itkq) 27
 19. 28.

  ൓੬͍γεςϜͷ֊૚ߏ଄ • ൓੬͞ͷཪଆʹ͸֊૚ߏ଄͕͋Δ9 • Web γεςϜશମ͕൓੬͘ʮਐԽʯ͢ΔͨΊʹ͸ݸʑͷ Web γεςϜ͕੬͘ɺഁ୼͢ΔՄ ೳੑΛ͍࣋ͬͯΔ͜ͱ͕͔ܽͤͳ͍ •

  Web γεςϜ͕൓੬͍ͨΊʹ͸ݸʑͷίϯϙʔωϯτ͕੬͘ɺഁ୼͢ΔՄೳੑΛ͍࣋ͬͯΔ ͜ͱ͕͔ܽͤͳ͍ • ίϯϙʔωϯτ͕൓੬͍ͨΊʹ͸ݸʑͷϗετ͕੬͘ɺഁ୼͢ΔՄೳੑΛ͍࣋ͬͯΔ͜ͱ͕ ͔ܽͤͳ͍ => ϗετͷଟ༷ੑ 9 ൓੬ऑੑʦ্ʧ――ෆ࣮֬ͳੈքΛੜ͖ԆͼΔ།Ұͷߟ͑ํ, φγʔϜɾχίϥεɾλϨϒ (ஶ), ๬݄ Ӵ (຋༁), ઍ ༿ හੜ (຋༁) Web System Artchitecture ݚڀձ #4 (@itkq) 28