$30 off During Our Annual Pro Sale. View Details »

Where Chaos Engineering comes from, and what's ...

itkq
April 13, 2019

Where Chaos Engineering comes from, and what's next

itkq

April 13, 2019
Tweet

More Decks by itkq

Other Decks in Technology

Transcript

  1. whoami • @itkq [ˈɪtəkəʊ] • SRE @ Cookpad • SLI/SLO,

    Մ༻ੑ, ... • Web γεςϜΛࣗ཯ࣦͤͯ͞৬͍ͨ͠ Web System Artchitecture ݚڀձ #4 (@itkq) 2
  2. ൃද಺༰ • ࣗݾ঺հ • ΧΦεΤϯδχΞϦϯάͷྲྀߦ • ϨδϦΤϯεΤϯδχΞϦϯάͱ Safety-II • SRE

    ͱΧΦεΤϯδχΞϦϯάͷؔ܎ • ΞϯνϑϥδϟΠϧͳ Web γεςϜ͕੒ཱ͢Δ͔ Web System Artchitecture ݚڀձ #4 (@itkq) 3
  3. ΧΦεΤϯδχΞϦϯάͷྲྀߦ • ʮΧΦεΤϯδχΞϦϯάʯ͕ॳΊͯొ৔ͨ͠ͷ͸ 2015 ೥1 • "Chaos Engineering" ϖʔύʔ͸ 2016

    ೥2 • "Chaos Engineering" ຊ͸ 2017 ೥3 • Gremilin: Failure as a Service4 (2017ʙ) • ࣮ફ͸ͱ΋͔֓͘೦ͱͯ͠͸޿·͍ͬͯΔ 4 h$p:/ /principlesofchaos.org 3 h$ps:/ /www.oreilly.com/library/view/chaos-engineering/9781491988459/ 2 Chaos Engineering, Ali Basiri, Niosha Behnam, Ruud de Rooij, Lorin Hochstein, Luke Kosewski, Jus@n Reynolds, Casey Rosenthal 1 h$ps:/ /medium.com/ne2lix-techblog/chaos-engineering-upgraded-878d341f15fa Web System Artchitecture ݚڀձ #4 (@itkq) 5
  4. ΧΦεΤϯδχΞϦϯά͸Կ͕৽͔ͬͨ͠ͷ͔ Amazon, Google, Microso2, Facebook ͳͲͷاۀͰ͸ɺࣗ਎ͷγ εςϜͷϨδϦΤϯεΛςετ͢ΔͨΊͷಉ༷ͳٕज़Λద༻ͯ͠ ͍ͨɻզʑͷۀքʹݱΕͨ͜ͷن཯Λܗ੒͢ΔΞΫςΟϏςΟΛ ʮΧΦεΤϯδχΞϦϯάʯͱݺͿ —

    Ne%lix2 2 Chaos Engineering, Ali Basiri, Niosha Behnam, Ruud de Rooij, Lorin Hochstein, Luke Kosewski, Jus@n Reynolds, Casey Rosenthal Web System Artchitecture ݚڀձ #4 (@itkq) 8
  5. ϨδϦΤϯεΤϯδχΞϦϯά5 • ࣾձɾٕज़γεςϜͷϨδϦΤϯτੑΛ޲্ͤ͞ΔͨΊͷํ๏ ࿦6 • ϨδϦΤϯε: ஄ྗੑɾ෮ݩྗɾճ෮ྗͷ༏Εͨঢ়ଶΛࢦ֓͢೦ 6 ϨδϦΤϯεΤϯδχΞϦϯά͕໨ࢦ҆͢શ Safety-II

    ͱͦͷ࣮ݱ๏, ๺ଜਖ਼੖, IEICE Fundamentals Review Vol.8 No.2 5 Resilience Engineering - Concepts and Precepts, Ashgate Publishing Ltd., E.Hollnagel, D.D.Woods, and N.Leveson, Eds., Aldershot, England, 2006 Web System Artchitecture ݚڀձ #4 (@itkq) 10
  6. ʮ҆શʯ֓೦ͷݶք • ա৒ͳӨڹྗ • ʮࣄނθϩ͕ݱࡏ n ೔ܧଓதʯͷΑ͏ͳඪޠ • ࠜڌͷͳ͍ࣗݾա৴ •

    ʮ͜Ε͚ͩపఈͯ҆͠શΛक͍ͬͯΔͷ͔ͩΒաࠅࣄނͳͲ ى͜Δ͸͕ͣͳ͍ʯ Web System Artchitecture ݚڀձ #4 (@itkq) 13
  7. Safety-I ͔Β Safety-II ΁ • Safety-I • ैདྷͷ൱ఆܗ͔ͭ੩తʹఆٛ͞Εͨʮ҆શʯ • Safety-II10

    • ʮγεςϜ͕େ֎ཚͳͲʹΑͬͯ௨ৗ࣌ͷಈ࡞ঢ়ଶΛҡ࣋Ͱ͖ͳ͍৔߹ɺੑೳ͸௿Լͤͯ͞΋ಈ࡞Ͱ͖Δʯ • ʮঢ়گ͕ճ෮ͨ͠Β଎΍͔ʹݩͷঢ়ଶ·ͨ͸ͦΕʹ४͡Δঢ়ଶʹ෮چͰ͖Δʯ • ϨδϦΤϯτੑͷ͋Δڍಈ͕Ͱ͖Δ͜ͱ • Safety-I Λશ໘൱ఆ͍ͯ͠ΔͷͰ͸ͳ͘ɺͦͷઌʹ͋Δ΋ͷ 10 E.Hollnagel, Safety-I and Safety-II - The Past and Future of Safety Management, Ashgate Publishing Ltd., Surrey, England, 2014. Web System Artchitecture ݚڀձ #4 (@itkq) 14
  8. جૅͱͳΔ4ͭͷओཁͳೳྗ • ରॲͰ͖Δ • ྟػԠมͳରॲ·ͰΛؚΉ • ؂ࢹͰ͖Δ • ϓϩΞΫςΟϒʹରԠ͢ΔೳྗΛ࣋ͭ͜ͱ͕๬·͍͠ •

    ༧ݟͰ͖Δ • ඞͣ͠΋σʔλۦಈͱ͸ݶΒͳ͍ • ֶशͰ͖Δ • ্هͷೳྗΛઈ͑ؒͳ͘޲্ͤ͞Δ͜ͱ Web System Artchitecture ݚڀձ #4 (@itkq) 15
  9. ҆શ΁ͷ౤ࢿײͷࠩҟ • Safety-I ϕʔε • Կ΋ى͜Βͳ͍͜ͱ͕๬·͍͠ͱ͍͏҉໧ͷԾఆʹΑΓɺ౤ ࢿߦಈ͸ֻ͚ࣺͯอݥͷΑ͏ʹଊ͑ΒΕ఍߅͕ੜ·Ε΍͍͢ • Safety-II ϕʔε

    • ໨త͸ಈ࡞ͷܧଓͰ͋ΔͨΊɺͦͷՄೳੑΛߴΊΔ౤ࢿ͸Ծ ʹେ֎ཚ΍؀ڥͷมԽ͕ى͜Βͣͱ΋ਖ਼౰ੑΛओுͰ͖Δ Web System Artchitecture ݚڀձ #4 (@itkq) 16
  10. Web γεςϜͰ͸Ͳ͏͔? • ৗʹมԽ͢Δ • ۜͷ஄ؙ͸ͳ͍ • ӡ༻Λݮ఺๏ͰධՁ͞ΕΔͷ͸ͭΒ͍ • ...

    • ʮ҆શʯ=>ʮ৴པੑʯɺʮࣄނʯ=>ʮো֐ʯʹஔ͖׵͑ͯΈΔ ͱͲ͏͔ Web System Artchitecture ݚڀձ #4 (@itkq) 18
  11. Web γεςϜͷ৴པੑ • ଟ͘ͷ Web γεςϜͰॏཁͳࢦඪ • ৴པੑ 100% Λ໨ඪʹ͢Δ͜ͱ͸ؒҧ͍7

    • Web γεςϜͷ৴པੑΛ੍ޚ͢Δํ๏࿦ => SRE • SRE ͸ϨδϦΤϯεΤϯδχΞϦϯάͷ1ͭͷ࣮૷ͱ͍͑Δ 7 Site Reliability Engineering, Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy Web System Artchitecture ݚڀձ #4 (@itkq) 19
  12. SLO ͱΤϥʔόδΣοτ • SLO (αʔϏεϨϕϧ໨ඪ) Λຬ͍ͨͯ͠ΔݶΓ͸ϦϦʔεՄೳ • => ΤϥʔόδΣοτ͕࢒͍ͬͯΔ •

    ΤϥʔόδΣοτ͕࢒͍ͬͯͳ͍৔߹ • γεςϜͷϨδϦΤϯεΛߴΊΔ • SLO Λ؇ΊΔ Web System Artchitecture ݚڀձ #4 (@itkq) 20
  13. SRE จ຺ͰͷϨδϦΤϯτੑͷධՁ • SLO ͸௚઀తͳධՁج४ʹ͸ͳΒͳ͍ • ҰํͰ Web γεςϜͷো֐͸ࣗવࡂ֐ʹൺ΂Δͱ࠶ݱ͠΍͍͢ •

    ো֐͕ى͜ΔͷΛ଴͚ͭͩͰͳ͘ɺো֐ΛΤϛϡϨʔτ͢Δ ͜ͱͰγεςϜͷϨδϦΤϯτੑͷϓϩηεධՁ͕ߦ͑Δ • ͜Ε͕ΧΦεΤϯδχΞϦϯά Web System Artchitecture ݚڀձ #4 (@itkq) 21
  14. ΧΦεΤϯδχΞϦϯά࠶ߟ • Web γεςϜͷ৴པੑ 100% Λ໨ඪʹ͢Δ͜ͱ͸ؒҧ͍ • SRE ͷ໨త͸ "αʔϏεͷ

    SLO ΛԼճΔ͜ͱͳ͘มߋͷ଎౓ͷ࠷େԽΛ௥ ٻ͢Δ͜ͱ7" • ΤϥʔόδΣοτΛ࢖͍Ռͨ͞ͳ͍ͨΊʹ͸ Safety-II ΛߴΊΔඞཁ͕͋Δ • ϨδϦΤϯτੑΛϓϩηεධՁ͢ΔͨΊͷํ๏ͱͯ͠ো֐ΛΤϛϡϨʔτ ͢ΔΧΦεΤϯδχΞϦϯά͕͋Δ 7 Site Reliability Engineering, Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy Web System Artchitecture ݚڀձ #4 (@itkq) 22
  15. ΧΦεΤϯδχΞϦϯάͰಘΒΕΔ͜ͱ • Known unknown ͳ੬ऑੑͷݕূ • ྫ: ΠϯελϯεΛಥવམͱͨ͠৔߹ʹ SLI ͕Ͳ͏มԽ͢Δ͔

    • Unknown unknown ͳ໰୊ͷൃݟ • ྫ͑͹૬ؔ͢Δ΋ͷ • ྫ: ϨδϦΤϯεΛߴΊΔͨΊͷϑΥʔϧόοΫΩϟογϡ͕Ҿ͖ى͜͢ෆ੔߹11 11 h$ps:/ /medium.com/ne2lix-techblog/from-chaos-to-control-tes<ng-the-resiliency-of-ne2lixs-content-discovery- pla2orm-ce5566aef0a4 Web System Artchitecture ݚڀձ #4 (@itkq) 23
  16. ΧΦεΤϯδχΞϦϯάͷݶք • Unknown unknown ͳো֐΍੬ऑੑ͸ΤϛϡϨʔτͰ͖ͳ͍ • ྫ͑͹఻છతͳ΋ͷ8 • ྫ: Linux

    Leap Second bug • "ଟ༷ੑ" ͸ཱͪ޲͔͏खஈͷ1ͭ8 • ଟ༷ੑ...? 8 h$ps:/ /www.gremlin.com/blog/adrian-cockro9-chaos-engineering-what-it-is-and-where-its-going-chaos- conf-2018/ Web System Artchitecture ݚڀձ #4 (@itkq) 24
  17. ൓੬͞ (ΞϯνϑϥδϟΠϧੑ) • ੬͞ͷ൓ରͱ͞ΕΔ֓೦9 • ྫ • ے೑: ෛՙΛ͔͚Δ͜ͱͰҎલΑΓڧ͘ͳΔ •

    ৘ใ: ޿ΊΔΑΓյ͢౒ྗͷ΄͏͕ྐʹͳΔ 9 ൓੬ऑੑʦ্ʧ――ෆ࣮֬ͳੈքΛੜ͖ԆͼΔ།Ұͷߟ͑ํ, φγʔϜɾχίϥεɾλϨϒ (ஶ), ๬݄ Ӵ (຋༁), ઍ ༿ හੜ (຋༁) Web System Artchitecture ݚڀձ #4 (@itkq) 26
  18. ൓੬͞ (ΞϯνϑϥδϟΠϧੑ) • ੬͞: িܸͰऑ͘ͳΔੑ࣭ • ϩόετੑ: িܸʹରͯ͠มԽ͠ͳ͍ੑ࣭ • ϨδϦΤϯτੑ:

    িܸʹରͯ͠దԠ͢Δੑ࣭ • ൓੬͞: িܸͰڧ͘ͳΔੑ࣭ Web System Artchitecture ݚڀձ #4 (@itkq) 27
  19. ൓੬͍γεςϜͷ֊૚ߏ଄ • ൓੬͞ͷཪଆʹ͸֊૚ߏ଄͕͋Δ9 • Web γεςϜશମ͕൓੬͘ʮਐԽʯ͢ΔͨΊʹ͸ݸʑͷ Web γεςϜ͕੬͘ɺഁ୼͢ΔՄ ೳੑΛ͍࣋ͬͯΔ͜ͱ͕͔ܽͤͳ͍ •

    Web γεςϜ͕൓੬͍ͨΊʹ͸ݸʑͷίϯϙʔωϯτ͕੬͘ɺഁ୼͢ΔՄೳੑΛ͍࣋ͬͯΔ ͜ͱ͕͔ܽͤͳ͍ • ίϯϙʔωϯτ͕൓੬͍ͨΊʹ͸ݸʑͷϗετ͕੬͘ɺഁ୼͢ΔՄೳੑΛ͍࣋ͬͯΔ͜ͱ͕ ͔ܽͤͳ͍ => ϗετͷଟ༷ੑ 9 ൓੬ऑੑʦ্ʧ――ෆ࣮֬ͳੈքΛੜ͖ԆͼΔ།Ұͷߟ͑ํ, φγʔϜɾχίϥεɾλϨϒ (ஶ), ๬݄ Ӵ (຋༁), ઍ ༿ හੜ (຋༁) Web System Artchitecture ݚڀձ #4 (@itkq) 28