$30 off During Our Annual Pro Sale. View Details »

Research Paper Introduction #65 "Chaos engineering: Building confidence in system behavior through experiments"

Research Paper Introduction #65 "Chaos engineering: Building confidence in system behavior through experiments"

cafenero_777

July 20, 2022
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Research Paper Introduction #18


    “Chaos engineering: Building con
    fi
    dence in system behavior through experiments”

    ௨ࢉ#65
    @cafenero_777

    2021/01/28

    View Slide

  2. Agenda
    • ର৅࿦จ

    • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝

    1. Part I. IntroductionɿͳΜͰChaos EngineeringΛ΍ΔΜ͚ͩͬʁ

    2. Part II. The Principles of ChaosɿChaos EngineeringͬͯͲ͏͍͏΋ͷʁ

    3. Part III. Chaos In Practiceɿ࣮ࡍʹChaos EngineeringͬͯͲ͏΍Δͷʁ

    View Slide

  3. $ which
    • Chaos engineering: Building con
    fi
    dence in system behavior through experiments

    • Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, and Ali Basiri

    • Net
    fl
    ix, Inc. 2017

    • Report? book?

    • https://learning.oreilly.com/library/view/chaos-engineering/9781491988459/

    • Blog

    • https://www.oreilly.com/content/chaos-engineering

    View Slide

  4. ֓ཁͱಡ΋͏ͱͨ͠ཧ༝
    • ֓ཁ

    • γεςϜ͸ඞͣ”མͪΔ”͕ɺෳࡶͳγεςϜΛ੩తʹݪҼಛఆ͢Δͷ͸೉͍͠

    • ຊ൪؀ڥͰ͏·࣮͘ݧΛͯ͠ʢChaos EngineeringΛͯ͠ʣɺݪҼಛఆɾγεςϜվળʹܨ͛Δ

    • Net
    fl
    ixͰͷ10೥͙Β͍ͷCEܦݧଇΛڞ༗

    • ಡ΋͏ͱͨ͠ཧ༝

    • ࣄނΛݮΒ͍ͨ͠

    • Chaos Engineering͕ྑͦ͞͏ɻ໘നͦ͏ͩ͠ӡ༻஌ݟ΋ར༻Ͱ͖ͦ͏

    • ҎલಡΜͩ΋ͷ͸Ԡ༻ฤʢϑϨʔϜϫʔΫدΓͷ࿩ͩͬͨʣͩͬͨɻجૅฤΛ஌Γ͍ͨ

    • લճɿAutomating chaos experiments in production https://arxiv.org/abs/1905.04648

    View Slide

  5. PRINCIPLES OF CHAOS ENGINEERING
    https://principlesofchaos.org/

    View Slide

  6. Net
    fl
    ixͷʢࢲͷཧղ͍ͯ͠ΔʣCEྺ࢙
    • 2008ʹAWSશ໘ҠߦʢCDNҎ֎ʣ

    • Chaos Monkey: VMࢭΊΔ

    • 2012೥ΫϦεϚεΠϒʹ୯ҰϦʔδϣϯͰେো֐

    • Chaos Kong: RegionࢭΊΔ

    • յ͢͜ͱ͕CEͱ͍͏ޡղ͕ൃੜɻ৭ʑߟ͑Δ

    • 2016: ChAP (ChaosAutomation Platform), ϚΠΫϩαʔϏεʹಋೖ

    • 2017: PRINCIPLES OF CHAOS ENGINEERINGΛެ։

    • 2019: Automating chaos experiments in production

    View Slide

  7. Net
    fl
    ixͷʢࢲͷཧղ͍ͯ͠ΔʣCEྺ࢙
    • 2008ʹAWSશ໘ҠߦʢCDNҎ֎ʣ

    • Chaos Monkey: VMࢭΊΔ

    • 2012೥ΫϦεϚεΠϒʹ୯ҰϦʔδϣϯͰେো֐

    • Chaos Kong: RegionࢭΊΔ

    • յ͢͜ͱ͕CEͱ͍͏ޡղ͕ൃੜɻ৭ʑߟ͑Δ

    • 2016: ChAP (ChaosAutomation Platform), ϚΠΫϩαʔϏεʹಋೖ

    • 2017: PRINCIPLES OF CHAOS ENGINEERINGΛެ։

    • 2019: Automating chaos experiments in production
    ࠓ೔ͷ࿩͸͜ͷลΓ

    View Slide

  8. Part I. Introduction
    • ͳΜͰChaos EngineeringΛ΍ΔΜ͚ͩͬʁ

    View Slide

  9. Why Do Chaos Engineering? (1/2)
    • ෼ࢄγεςϜͰ͸༧ظ͠ʹ͍͘Πϕϯτ͕ൃੜ

    • ίϯϙʔωϯτଟ͍ɺ૬ޓ࡞༻͢Δ

    • disk/NW failure, ΞΫηεٸ૿ͰαʔϏεϨϕϧྼԽ

    • ى͖Δ”લ”ʹγεςϜͷଟ͘ͷऑ఺Λಛఆ͍ͨ͠

    • ͜ͷํ๏Λମܥతʹ·ͱΊͨͷ͕ΧΦεΤϯδχΞϦϯά

    • @Net
    fl
    ix

    • 2008-, AWSҠߦҎ߱։࢝ɺVM/Regionো֐ɺΫϦεϚεো֐ɻ

    • Chaos Monkey/Chaos Kong, FIT (Failure Injection Test)ɺSpinnakerͱ౷߹ɻPrinciple of Chaos Engineering

    • Chaos Moneky:ʢաڈ5೥ͰαʔϏεӨڹ͸̍౓͚ͩʣ

    • Chaos Kong: ݄ʹҰ౓ൃಈɺසൟʹ໰୊ൃੜɺɺɺ2೥໨Ͱ҆ఆ
    ྫ: Clos NW, ෼ࢄLB, ෼ࢄStorage, IaaS/PaaS/CaaS on k8s

    View Slide

  10. Why Do Chaos Engineering? (2/2)
    • ܦݧతௐࠪͷ෼໺ʹద༻ɺγεςϜΛཧղ͢ΔͨΊʹ࣮ݧ͢Δ

    • γεςϜఀࢭͷՄೳੑ͕͋Δऑ఺ΛֶͿ

    • ࠶ൃ๷ࢭࡦ(post motem)ͱͯ͠ࣄޙʹ΍ΔͷͰ͸ͳ͘ɺࣄલʹ΍Δ(proactive)

    • طଘͷςετͱඃΔࣄ͸ଟ͍ɿݒ೦ࣄ߲ɾπʔϧ౳

    • ςετ: ط஌ͷಛఆ৚݅ΛΫϦΞ͢Δ͔൱͔ɻyes or no ʢྫɿ֤Ξαʔγϣϯʣ

    • CE: ৽͍͠৘ใΛੜΈग़͢ɻྫɿNW஗ԆɾϥϯμϜʹྫ֎ൃੜɾλΠϜτϥϕϧɾI/O΍CPUߴෛՙ

    • લఏ৚݅

    • ໌Β͔ʹऑ఺͕෼͔͍ͬͯΔ৔߹͸·ͣ͸௚͢ɻͦͷޙʹCEͯ͠ؾ͍͍ͮͯͳ͔ͬͨऑ఺͕໌Β͔ʹ͢Δ

    • ؂ࢹγεςϜɿఆৗঢ়ଶͱ൑அͰ͖ΔΑ͏ʹՄࢹԽ

    View Slide

  11. Managing Complexity
    • ύϑΥʔϚϯεɾՄ༻ੑɾϑΥʔϧττϨϥϯε + ։ൃ଎౓͕ඞཁ

    • ϚΠΫϩαʔϏεΛ࠾༻ͯ͠ɺ֤νʔϜಠཱʹ։ൃɾӡ༻

    • ਓؒͷཧղΛ٘ਜ਼ʹͯ͠଎౓ɾॊೈੑΛ୲อɻ

    • ࣄલͷςετͰ͸ݟ͚ͭͮΒ͍

    • ط஌ͷΞαʔγϣϯ͚ͩͰ͸Χόʔ͖͠Εͳ͍

    • ྫɿΩϡʔɺΦʔτεέʔϧɺϩʔΧϧΩϟογϡɾεέʔϧμ΢ϯɻ

    ɹɹෛՙͷฏۉ஋͚ͩͰे෼ʁʁ

    • ϒϧ΢ΟοϓޮՌ

    • ݸʑͷϚΠΫϩαʔϏε͸߹ཧత͕ͩɺ࠷ऴӨڹ͕༧ظͰ͖ͳ͍৔߹͕͋Δ
    get
    meta-data
    get
    meta-data
    map/reduce
    Ϩεϙϯε
    ϦΫΤετ
    ิ଍ɿ࠷ۙྲྀߦΓͷΞϨ΋ಉ͡?
    ౴: DeepLearning

    View Slide

  12. Part II. The Principles of Chaos
    • Chaos EngineeringͬͯͲ͏͍͏΋ͷʁ

    View Slide

  13. Part II. The Principles of Chaos
    • ΧΦεΤϯδχΞϦϯά

    • ϥϯμϜʢແܭըʣͰ͸ͳ͍ɻ࢓ࣄͷΧΦεΛ༠ൃ͢Δ͜ͱ΋ͳ͍ɻ

    • ίϯϙʔωϯτ૬ޓ࡞༻ͷ݁ՌɺγεςϜ͕ΧΦεʹͳΔՄೳੑΛཧղ͢Δ͜ͱ

    • গ͚ͩ͠γεςϜʹΧΦεΛೖΕͯɺγεςϜ͕Ͳ͏ػೳ͢Δ͔ʁ

    • ఆৗঢ়ଶͷԾઆΛཱͯΔ

    • ࣮ੈքͷΠϕϯτΛมߋ͢Δ

    • prodͰ࣮ݧ͢Δ

    • ࣮ݧΛࣗಈԽͯ͠ܧଓతʹߦ͏

    • Bast RadiusʢӨڹൣғʣΛ࠷খԽ͓ͯ͘͠

    View Slide

  14. Hypothesize about Steady State (1/2)
    • ఆৗঢ়ଶͱ͸ʁʹϏδωεϩδοΫͷظ଴஋΍SLA

    • ֤छKPI (޿ࠂදࣔճ਺ɺΧʔτ௥Ճɺ໰͍߹Θͤྔ)

    • ϝτϦΫεɾΞϥʔτ؂ࢹ

    • CPU/Memory/NW, I/O, RPS, duration

    • εϧʔϓοτ, Τϥʔ཰, 99%ile

    • ࠶ੜϘλϯΛԡ͢଎౓: SPS

    • ΫϥΠΞϯτଆͱҰॹʹݕূͰ͖ΔͱΑΓྑ͍ɻ

    • पظతʹϝτϦΫεΛݟΔඞཁ͕͋Δ৔߹΋ɻ

    View Slide

  15. Hypothesize about Steady State (2/2)
    • ԾઆΛཱͯΔඞཁ͋Γ

    • Ծઆ͕ແ͍ͱɺͲͷϝτϦΫεʹ஫໨͢΂͖͔͑͞෼͔Βͳ͍ɻ

    • ʮ͋ΔCE࣮ݧΛͯ͠΋҆ఆঢ়ଶͷ··Ͱ͢ɻʯ͕Ծઆɻ

    • ྫɿpersonalized listΛࣦഊ-> ௨ৗͷlistΛදࣔͤ͞ɺSPS (start/sec)ʹӨڹΛ༩͑ͳ͍

    • ྫɿτϥϑΟοΫΛผϦʔδϣϯʹϦμΠϨΫτͤͯ͞΋SPS҆ఆ

    • ͖͍͠஋ͷઃఆ͸ʁ

    • Ծઆ͕ࣝผͰ͖Δ஋ɻʢଥ౰ͩͱࢥ͏ൣғɻओ؍తͳภࠩྔʣ

    View Slide

  16. Vary Real-World Events (1/2)
    • ൃੜΠϕϯτͷස౓ͱӨڹɺରॲίετɾෳࡶ͞Λݟੵ΋Δ

    • VMఀࢭɿසൃɾ୯७ɾίετ௿

    • ϦʔδϣϯఀࢭɿෳࡶͰίετߴ

    • จԽతཁҼ΋͋ΔछͷίετͱΈͳ͢

    • DC͸ݎ࿚ੑɾ҆ఆੑ >>> | ӽ͑ΒΕͳ͍น | >>> ढ़හੑ

    • ҰํͰΫϥ΢υར༻ʢHW੹೚ͷ֎෦ԽʣͰHWো֐͸౰વʹͳ͖ͬͯͨ
    Πϕϯτྫɿ
    ϋʔυ΢ΣΞো֐
    ػೳతͳόά
    ঢ়ଶෆҰகʹΑΔૹ৴Τϥʔ
    ωοτϫʔΫͷ஗ԆͱύʔςΟγϣϯ
    ೖྗͷେ͖ͳมಈɺ͓ΑͼϦτϥΠετʔϜ
    Ϧιʔεͷރׇ
    Ϗβϯνϯো֐ɾڝ߹ঢ়ଶ

    View Slide

  17. Vary Real-World Events (2/2)
    • Πϕϯτ

    • ߴෛՙɺ௨৴஗ԆɺNWύʔςΟγϣϯɺແޮͳূ໌ॻɺσʔλංେԽ

    • όάࠞೖίʔυͷσϓϩΠ

    • ΧφϦΞϦϦʔεͩͱখن໛͕ͩӨڹ͸ग़ͯ͠·͏

    • αʔϏεݺͼग़࣌͠ʹΤϥʔΛฦͯ͠ֆϛϡʔϨͱ͢Δͱ҆શʢྫɿಛఆΫϦΞϯτ͔ΒͷAPIͷΈࣦഊʣ

    • ো֐υϝΠϯʢӨڹͱൣғʣΛڱ͘͢ΔΑ͏ʹCE͢Δɻ

    • ผͷݪҼ͕ىͨ͜͠ࣄ৅ͩͬͨͱͯ͠΋੾Γ෼͚͠΍͍͢ɻ

    • ϦιʔεʢαʔϏεͷάϧʔϓʣ͸ڞ༗͞Ε͍ͯΔͨΊɺશͯΛো֐υϝΠϯͱͯ͠ઢҾ͖͸Ͱ͖ͳ͍఺ʹ஫ҙɻ

    • ॲཧͰ͖Δͱ༧૝͞ΕΔΠϕϯτͷΈࢼ͢ʢͦͯ͠௚͢ʣɻແҙຯʹյ͢ͷ͸ແҙຯɻ

    View Slide

  18. Run Experiments in Production
    • ίʔυͷਖ਼͠͞ʹՃ͑ͯγεςϜશମͷਖ਼͠͞Λ୲อ͍ͨ͠

    • prodͱಉ͡ঢ়ଶΛΤϛϡʔϨʔτͯ͠devͰࢼ͢ʁ

    • ೖྗ(input)΍֎෦γεςϜΛΤϛϡϨʔτ͢Δ͸ݶք͕͋Δ

    • CIతʹιʔείʔυɾઃఆϑΝΠϧ͸౎౓มΘΔɻ

    • Α͋͘Δݴ͍༁

    • ਓͷ໋͕௚઀͔͔ͬͯΔγεςϜ͸͠ΐ͏͕ͳ͍ɻ͕ɺ͘͝Ұ෦ͷϋζɻ

    • γεςϜʹճ෮ྗ͕ͳ͍ʢͬͱ͍͏ࣄΛӅ͍ͨ͠ʁʣ

    • ͡Ό͋Ͳ͏͢Δ͔ʁ

    • ͙͢ʹ࣮ݧఀࢭͰ͖ΔΑ͏ʹ͓ͯ͘͠

    • Blast RadiusʢӨڹൣғʣΛ࠷খʹ͓ͯ͘͠

    • ʢͲ͏ͯ͠΋prodͰͰ͖ͳ͚Ε͹ʣຊ൪؀ڥʹ͍ۙ؀ڥͰ࣮ݧ͢Δ

    View Slide

  19. Automate Experiments to Run Continuously
    • ςετࣗಈ࣮ߦ

    • ຊ൪؀ڥ͸CI/CD͞ΕΔ -> CE΋ࣗಈԽͯ͠ܧଓతʹνΣοΫ͢Δ

    • ཧ૝తʹ͸ίʔυมߋͷ౓ʹ࣮ߦʢΧΦεΧφϦΞʁʣ

    • ࢀߟɿਓͰͳ͘γεςϜ͕ҟৗΛ൑அɾ࣮ݧఀࢭɻӨڹςϯϓϨԽɻSpinnaker͕ຊ൪ʹ࣮ݧϊʔυىಈɻ

    • ςετࣗಈੜ੒

    • ·ͩى͖ͨ͜ͱͷͳ͍ɺى͖ͦ͏ͳΠϕϯτ΍ো֐Λಛఆͯ͠ਓ͕ؒ࡞Δɻʢཁ͸ηϯεʁʂʣ

    • Incident Trackerʹه࿥͞Ε͍ͯΔط஌ͷࣄ৅~ʹ࠷௿ݶ΍Δ΂͖ςετɻ
    ࢀߟɿLDFI: lineage-driven fault injection (ܥ౷ۦಈܕϑΥʔϧτΠϯδΣΫγϣϯ)
    Automating Failure Testing Research at Internet Scale, https://dl.acm.org/doi/10.1145/2987550.2987555

    View Slide

  20. Minimize Blast Radius
    • νΣϧϊϒΠϦݪൃࣄނͷݪҼ͸ϝϯςதͷ”ࢼݧ”ͩͬͨɾɾɾ

    • ྫɿ

    • গ਺ͷϢʔβʢclientʣʹ࢓ֻ͚Δ

    • গ਺ͷϢʔβʢclientҎ֎ʣʹ࢓ֻ͚Δ

    • શϢʔβɺͨͩ͠ಛఆՕॴʢαʔϏεɾϩδοΫʣ΁ϧʔςΟϯά

    • શϢʔβɺશαʔϏε

    • ػೳͷfallback͸ඞਢɺۓٸఀࢭػೳ͸ඞਢʢࣗಈऴ͕͓ྃ͢͢Ίʣ

    • ͙͢ʹରԠͰ͖ΔΑ͏ʹ೔தଳʢϔϧϓ͕ಘΒΕΔ࣌ؒଳʣʹ࣮ݧ͢Δɻ

    View Slide

  21. Part III. Chaos In Practice
    • ࣮ࡍʹChaos EngineeringͬͯͲ͏΍Δͷʁ

    • ΍Δ͔Ͳ͏͔໎͍ͬͯΔਓ΁

    • ͲͪΒ͕ྑ͍͔ɻ

    • ҙਤͨ͠λΠϛϯάͰখ͞ͳো֐Λىͯͦ͜͠ΕΛ௚͢

    • ҙਤ͠ͳ͍λΠϛϯάͰେ͖ͳো֐͕ى͖͔ͯΒ௚͢

    • ۜߦܥɺۚ༥ܥͰ͑͞΋CEΛ࢝Ί͍ͯΔ

    View Slide

  22. Designing Experiments
    1.ԾઆΛཱͯΔ
    2. ࣮ݧൣғΛܾΊΔ
    3. ϝτϦΫεΛܾΊΔ
    4. प஌͢Δ
    5. ࣮ݧ͢Δ
    6. ෼ੳ͢Δ
    7. BRΛ޿͛Δ
    8. ࣗಈԽ͢Δ

    View Slide

  23. Designing Experiments
    1.ԾઆΛཱͯΔ
    2. ࣮ݧൣғΛܾΊΔ
    3. ϝτϦΫεΛܾΊΔ
    4. प஌͢Δ
    5. ࣮ݧ͢Δ
    6. ෼ੳ͢Δ
    7. BRΛ޿͛Δ
    8. ࣗಈԽ͢Δ
    • A͕յΕͯ΋B͕ҙਤͨ͠ಈ࡞Λ͢Δ͔ʁ
    • ฏۉൃݟɾӨڹɾ෮چ࣌ؒ͸ʁਓͷಈ͖͸ʁ

    View Slide

  24. Designing Experiments
    1.ԾઆΛཱͯΔ
    2. ࣮ݧൣғΛܾΊΔ
    3. ϝτϦΫεΛܾΊΔ
    4. प஌͢Δ
    5. ࣮ݧ͢Δ
    6. ෼ੳ͢Δ
    7. BRΛ޿͛Δ
    8. ࣗಈԽ͢Δ
    • A͕յΕͯ΋B͕ҙਤͨ͠ಈ࡞Λ͢Δ͔ʁ
    • ฏۉൃݟɾӨڹɾ෮چ࣌ؒ͸ʁਓͷಈ͖͸ʁ
    Ծઆྫɿ
    RedisͰtimeoutͨ͠৔߹
    Act/Stb failoverͨ͠৔߹

    View Slide

  25. Designing Experiments
    1.ԾઆΛཱͯΔ
    2. ࣮ݧൣғΛܾΊΔ
    3. ϝτϦΫεΛܾΊΔ
    4. प஌͢Δ
    5. ࣮ݧ͢Δ
    6. ෼ੳ͢Δ
    7. BRΛ޿͛Δ
    8. ࣗಈԽ͢Δ
    • A͕յΕͯ΋B͕ҙਤͨ͠ಈ࡞Λ͢Δ͔ʁ
    • ฏۉൃݟɾӨڹɾ෮چ࣌ؒ͸ʁਓͷಈ͖͸ʁ
    Ծઆྫɿ
    RedisͰtimeoutͨ͠৔߹
    Act/Stb failoverͨ͠৔߹
    • BRΛ࠷খʹͯ͠ɺprod؀ڥͰߦ͏
    • dev؀ڥͰͷdry-run͢Δ

    View Slide

  26. Designing Experiments
    1.ԾઆΛཱͯΔ
    2. ࣮ݧൣғΛܾΊΔ
    3. ϝτϦΫεΛܾΊΔ
    4. प஌͢Δ
    5. ࣮ݧ͢Δ
    6. ෼ੳ͢Δ
    7. BRΛ޿͛Δ
    8. ࣗಈԽ͢Δ
    • A͕յΕͯ΋B͕ҙਤͨ͠ಈ࡞Λ͢Δ͔ʁ
    • ฏۉൃݟɾӨڹɾ෮چ࣌ؒ͸ʁਓͷಈ͖͸ʁ
    Ծઆྫɿ
    RedisͰtimeoutͨ͠৔߹
    Act/Stb failoverͨ͠৔߹
    • BRΛ࠷খʹͯ͠ɺprod؀ڥͰߦ͏
    • dev؀ڥͰͷdry-run͢Δ
    • ”ਖ਼ৗ”ͷఆٛͱɺͦΕͱͷࠩʢҟৗʣͷࢦඪܾΊ
    • ۓٸఀࢭ༻ͷ͖͍͠஋ܾΊ

    View Slide

  27. Designing Experiments
    1.ԾઆΛཱͯΔ
    2. ࣮ݧൣғΛܾΊΔ
    3. ϝτϦΫεΛܾΊΔ
    4. प஌͢Δ
    5. ࣮ݧ͢Δ
    6. ෼ੳ͢Δ
    7. BRΛ޿͛Δ
    8. ࣗಈԽ͢Δ
    • A͕յΕͯ΋B͕ҙਤͨ͠ಈ࡞Λ͢Δ͔ʁ
    • ฏۉൃݟɾӨڹɾ෮چ࣌ؒ͸ʁਓͷಈ͖͸ʁ
    Ծઆྫɿ
    RedisͰtimeoutͨ͠৔߹
    Act/Stb failoverͨ͠৔߹
    • BRΛ࠷খʹͯ͠ɺprod؀ڥͰߦ͏
    • dev؀ڥͰͷdry-run͢Δ
    • ”ਖ਼ৗ”ͷఆٛͱɺͦΕͱͷࠩʢҟৗʣͷࢦඪܾΊ
    • ۓٸఀࢭ༻ͷ͖͍͠஋ܾΊ
    • CE࣮ߦதͰ͋Δࣄɾཧ༝ɾ಺༰
    • ಛʹ࠷ॳ͸प஌ॏཁ

    View Slide

  28. Designing Experiments
    1.ԾઆΛཱͯΔ
    2. ࣮ݧൣғΛܾΊΔ
    3. ϝτϦΫεΛܾΊΔ
    4. प஌͢Δ
    5. ࣮ݧ͢Δ
    6. ෼ੳ͢Δ
    7. BRΛ޿͛Δ
    8. ࣗಈԽ͢Δ
    • A͕յΕͯ΋B͕ҙਤͨ͠ಈ࡞Λ͢Δ͔ʁ
    • ฏۉൃݟɾӨڹɾ෮چ࣌ؒ͸ʁਓͷಈ͖͸ʁ
    Ծઆྫɿ
    RedisͰtimeoutͨ͠৔߹
    Act/Stb failoverͨ͠৔߹
    • BRΛ࠷খʹͯ͠ɺprod؀ڥͰߦ͏
    • dev؀ڥͰͷdry-run͢Δ
    • ”ਖ਼ৗ”ͷఆٛͱɺͦΕͱͷࠩʢҟৗʣͷࢦඪܾΊ
    • ۓٸఀࢭ༻ͷ͖͍͠஋ܾΊ
    • CE࣮ߦதͰ͋Δࣄɾཧ༝ɾ಺༰
    • ಛʹ࠷ॳ͸प஌ॏཁ
    • ϝτϦΫεΛؾʹ͠ͳ͕Β΍Δ
    • ֎෦γεςϜʹ΋ؾΛ഑Δ

    View Slide

  29. Designing Experiments
    1.ԾઆΛཱͯΔ
    2. ࣮ݧൣғΛܾΊΔ
    3. ϝτϦΫεΛܾΊΔ
    4. प஌͢Δ
    5. ࣮ݧ͢Δ
    6. ෼ੳ͢Δ
    7. BRΛ޿͛Δ
    8. ࣗಈԽ͢Δ
    • A͕յΕͯ΋B͕ҙਤͨ͠ಈ࡞Λ͢Δ͔ʁ
    • ฏۉൃݟɾӨڹɾ෮چ࣌ؒ͸ʁਓͷಈ͖͸ʁ
    Ծઆྫɿ
    RedisͰtimeoutͨ͠৔߹
    Act/Stb failoverͨ͠৔߹
    • BRΛ࠷খʹͯ͠ɺprod؀ڥͰߦ͏
    • dev؀ڥͰͷdry-run͢Δ
    • ”ਖ਼ৗ”ͷఆٛͱɺͦΕͱͷࠩʢҟৗʣͷࢦඪܾΊ
    • ۓٸఀࢭ༻ͷ͖͍͠஋ܾΊ
    • CE࣮ߦதͰ͋Δࣄɾཧ༝ɾ಺༰
    • ಛʹ࠷ॳ͸प஌ॏཁ
    • ϝτϦΫεΛؾʹ͠ͳ͕Β΍Δ
    • ֎෦γεςϜʹ΋ؾΛ഑Δ
    • 1.ͷԾઆ͸ਖ਼͔͔ͬͨ͠ʁ
    • ؔ࿈νʔϜ΁ͷϑΟʔυόοΫ

    View Slide

  30. Designing Experiments
    1.ԾઆΛཱͯΔ
    2. ࣮ݧൣғΛܾΊΔ
    3. ϝτϦΫεΛܾΊΔ
    4. प஌͢Δ
    5. ࣮ݧ͢Δ
    6. ෼ੳ͢Δ
    7. BRΛ޿͛Δ
    8. ࣗಈԽ͢Δ
    • A͕յΕͯ΋B͕ҙਤͨ͠ಈ࡞Λ͢Δ͔ʁ
    • ฏۉൃݟɾӨڹɾ෮چ࣌ؒ͸ʁਓͷಈ͖͸ʁ
    Ծઆྫɿ
    RedisͰtimeoutͨ͠৔߹
    Act/Stb failoverͨ͠৔߹
    • BRΛ࠷খʹͯ͠ɺprod؀ڥͰߦ͏
    • dev؀ڥͰͷdry-run͢Δ
    • ”ਖ਼ৗ”ͷఆٛͱɺͦΕͱͷࠩʢҟৗʣͷࢦඪܾΊ
    • ۓٸఀࢭ༻ͷ͖͍͠஋ܾΊ
    • CE࣮ߦதͰ͋Δࣄɾཧ༝ɾ಺༰
    • ಛʹ࠷ॳ͸प஌ॏཁ
    • ϝτϦΫεΛؾʹ͠ͳ͕Β΍Δ
    • ֎෦γεςϜʹ΋ؾΛ഑Δ
    • 1.ͷԾઆ͸ਖ਼͔͔ͬͨ͠ʁ
    • ؔ࿈νʔϜ΁ͷϑΟʔυόοΫ
    • ෼ੳ݁Ռʹࣗ৴͕࣋ͯͨΒൣғΛ޿͛Δ
    • ৽ͨͳӨڹ͕ग़ͯ͘ΔՄೳੑʹ஫ҙ

    View Slide

  31. Designing Experiments
    1.ԾઆΛཱͯΔ
    2. ࣮ݧൣғΛܾΊΔ
    3. ϝτϦΫεΛܾΊΔ
    4. प஌͢Δ
    5. ࣮ݧ͢Δ
    6. ෼ੳ͢Δ
    7. BRΛ޿͛Δ
    8. ࣗಈԽ͢Δ
    • A͕յΕͯ΋B͕ҙਤͨ͠ಈ࡞Λ͢Δ͔ʁ
    • ฏۉൃݟɾӨڹɾ෮چ࣌ؒ͸ʁਓͷಈ͖͸ʁ
    Ծઆྫɿ
    RedisͰtimeoutͨ͠৔߹
    Act/Stb failoverͨ͠৔߹
    • BRΛ࠷খʹͯ͠ɺprod؀ڥͰߦ͏
    • dev؀ڥͰͷdry-run͢Δ
    • ”ਖ਼ৗ”ͷఆٛͱɺͦΕͱͷࠩʢҟৗʣͷࢦඪܾΊ
    • ۓٸఀࢭ༻ͷ͖͍͠஋ܾΊ
    • CE࣮ߦதͰ͋Δࣄɾཧ༝ɾ಺༰
    • ಛʹ࠷ॳ͸प஌ॏཁ
    • ϝτϦΫεΛؾʹ͠ͳ͕Β΍Δ
    • ֎෦γεςϜʹ΋ؾΛ഑Δ
    • 1.ͷԾઆ͸ਖ਼͔͔ͬͨ͠ʁ
    • ؔ࿈νʔϜ΁ͷϑΟʔυόοΫ
    • ෼ੳ݁Ռʹࣗ৴͕࣋ͯͨΒൣғΛ޿͛Δ
    • ৽ͨͳӨڹ͕ग़ͯ͘ΔՄೳੑʹ஫ҙ
    • खಈ࣮ߦ͔Βఆظࣗಈ࣮ߦ΁

    View Slide

  32. Designing Experiments
    1.ԾઆΛཱͯΔ
    2. ࣮ݧൣғΛܾΊΔ
    3. ϝτϦΫεΛܾΊΔ
    4. प஌͢Δ
    5. ࣮ݧ͢Δ
    6. ෼ੳ͢Δ
    7. BRΛ޿͛Δ
    8. ࣗಈԽ͢Δ
    • C͕յΕͯ΋D͕ҙਤͨ͠ಈ࡞Λ͢Δ͔ʁ
    • ฏۉൃݟɾӨڹɾ෮چ࣌ؒ͸ʁਓͷಈ͖͸ʁ

    View Slide

  33. Chaos Maturity Model (CMM)
    Sophistication
    Adoption
    • devͰखಈ࣮ߦͯ͠system metricsΛ؂ࢹ
    • prodΛ໛ͨ͠؀ڥɺࣗಈ࣮ߦɾ؂ࢹɺbusiness metrics؂ࢹ
    • ݁Ռ͸खಈूܭɻ࣮ݧ͸੩తఆٛ
    • prodͰࣗಈ࣮ߦʢ؂ࢹɾ݁Ռूܭʣ
    • ࣮ݧ͸CDͱ౷߹ɻରর࣮ݧՄೳʢbusiness metricsൺֱʣ
    • શͯͷ؀ڥͰࣗಈ࣮ߦ
    • A/BςετɺٴͼಈతͳBRมߋɻଛӹ༧ଌ΍Ωϟύϓϥ༧ଌ
    ΄΅࠾༻͞Ε͍ͯͳ͍ɻCEೝ஌͕௿͍
    o
    ffi
    cialʹ࣮ߦɺؔ৺Λ࣋ͭνʔϜ΋͋Δɻͨ·ʹCE࣮ߦ
    νʔϜ͕CE࣮ફʹઐ೦ɻλʔήοτΛߜͬͯఆظ࣮ߦɻIRͱGamedayͰFeedback
    සൟͳCE࣮ߦɻجຊతʹ͸CEΛߦ͏ɻجຊεΩϧԽɻCE as a default.
    Adoption
    Sophistication

    View Slide

  34. Conclusion
    • ෼ࢄγεςϜͷߏஙɾӡ༻ && ։ൃεϐʔυͷҡ࣋ɾ޲্͕ٻΊΒΕΔɻ

    • γεςϜͷresiliencyΛ޲্͢ΔͨΊʹChaos Engineering͕ޮՌతͰ͋Δɻ

    • CEࣗମ͸ඇৗʹए͍෼໺ɻπʔϧɾ஌ݟɾਪਐʹ͝ڠྗ͍ͩ͘͞ɻ

    View Slide

  35. ਺ֶతͳΧΦεͱ͸ʁʢ;Θͬͱͨ͠ղઆʣ
    • ܾఆ࿦తͳಈ͖Λ͢Δʢඍ෼ํఔࣜͰܾ·Δʣ

    • ϥϯμϜͰ͸ͳ͍ʢ֬཰աఔ͸ͳ͍ʣ

    • ಛ௃

    • ॳظ஋ʹහײ

    • ඇઢܗͳಈ͖

    • पظੑ͸ʢ͋ΔΑ͏ʹݟ͑Δ͕ʣແ͍

    • ൃࢄ͸͠ͳ͍

    • ྫɿؾ৅༧ใ

    ɹɹೋॏৼΓࢠ (https://youtu.be/zdW6nTNWbkc)

    • ༧ଌ͠೉͍͕ɺ࠶ݱੑ͸ʢݶఆత͕ͩʣ༗Δɻ͜ΕΛར༻ɻ
    ϩδεςΟοΫࣸ૾
    https://ja.wikipedia.org/wiki/%E3%83%AD%E3%82%B8%E3%82%B9%E3%83%86%E3%82%A3%E3%83%83%E3%82%AF%E5%86%99%E5%83%8F
    a=1.8

    View Slide

  36. EoP

    View Slide