Slide 1

Slide 1 text

Research Paper Introduction #9 “Automating chaos experiments in production” @cafenero_777 2020/04/14 ௨ࢉ#36

Slide 2

Slide 2 text

• ॕʂྠಡձҰप೥ʂ

Slide 3

Slide 3 text

$ which • Automating chaos experiments in production • Ali Basiri, Lorin Hochstein, Nora Jones, Haley Tucker • Net fl ix • ACM/IEEE ICSE-SEIP ’19 • (International Conference on Software Engineering, Software Engineering in Practice) • https://2019.icse-conferences.org/track/icse-2019-Software-Engineering-in- Practice?track=ICSE%20Software%20Engineering%20in%20Practice#program • https://arxiv.org/abs/1905.04648

Slide 4

Slide 4 text

Agenda • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • Introduction • CONTEXT: NETFLIX • ChAP • MONOCLE • EXPERIMENT GENERATION • RESULTS • CHALLENGES AND LESSONS LEARNED • CONCLUSION

Slide 5

Slide 5 text

֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • ֓ཁ • ෼ࢄγεςϜͷ݈શੑ֬อγεςϜʢChaos Monkeyʣͷ֓ཁͱ࣮ӡ༻ʹ͍ͭͯ • ಡ΋͏ͱͨ͠ཧ༝ • ෼ࢄγεςϜ҆ఆӡ༻ʢͦ΋ͦ΋Ͳ͏͍͏໨ઢͰݟΔ΂͖͔ʣ • Chaos Monkeyͷ࣮ӡ༻ • Podcastܦ༝ • https://misreading.chat/2019/07/09/episode-64-automating-chaos-experiments-in- production/

Slide 6

Slide 6 text

Introduction • ΠϯλʔωοταʔϏε=෼ࢄγεςϜ • Ϛγϯ2୆~਺ઍ୆ • ༧ظͤ͵ಈ࡞Λ෮׆ͤ͞Δઓུ • λΠϜΞ΢τɺϦτϥΠɺϑΥʔϧόοΫ • ҰൠʹɺͲΕ͙Β͍resiliency͕࣮ࡍʹޮ͔͘͸ෆ໌ • Chaos Engineering ʢ޻ֶతΞϓϩʔνʣ • Chaos experimentsΛࣗಈੜ੒͢ΔϓϥοτϑΥʔϜΛߏங͠3೥ӡ༻

Slide 7

Slide 7 text

CONTEXT: NETFLIX • Մ༻ੑॏཁ • ϚϧνσόΠεͳಈը഑৴αʔϏε • ਖ਼ৗʹετϦʔϛϯάͰ͖Δ͔͕࠷΋ॏཁ (ex. 99.99%, 4 nines) • Ϗδωεཁ݅ɻ௨৴ձࣾͷΑ͏ͳՄ༻ੑͰ͸ͳ͍ɻ • ϚΠΫϩαʔϏε • RPCΛհͯ͠૬ޓ௨৴͢ΔαʔϏε܈ • ex. ݕࡧػೳɺϝλσʔλදࣔ (HD, 5.1)ɺ+1Ϙλϯ • VizceralͰՄࢹԽ • ಠཱͯ͠αʔϏεσϓϩΠՄೳ • ো֐ൣғʢFault domainsʣΛখ͘͞Ͱ͖Δ

Slide 8

Slide 8 text

CONTEXT: NETFLIX (Cont.) • Resilience through timeouts, retries, and fallbacks • ίϯτϩʔϧϓϨʔϯˏύϒϦοΫΫϥ΢υ • HWෆྑɺNWෆྑো֐ • શͯͷRPCʹtimeout, retries, fallbackΛઃఆ • Java HystricϥΠϒϥϦΛίϚϯυϥούʔͱͯ͠ར༻ • fallbackྫɿsuggestػೳ͕յΕͨ৔߹͸σϑΥϧτ݁ՌΛදࣔ • ”ࣦഊ”͕සൟʹ࣮ߦ͞Εͳ͍ͷͰɺظ଴௨Γʹಈ࡞͢Δ͔৴པੑ͕௿͍ • සൟʹ࣮ߦͰ͖ΔϓϥοτϑΥʔϜߏங΁

Slide 9

Slide 9 text

ChAP: Chaos Automation Platform • Overview • αʔϏεҰ͕ͭྼԽͯ͠΋γεςϜશମ͕݈શੑΛҡ࣋Ͱ͖Δ͔ΛධՁ͢Δ • ϞσϧԽ • αʔϏε͕஗͘ͳΔʢϨεϙϯε࣌ؒ૿ՃʣɿHWϦιʔεރׇ • ނো͢ΔʢΤϥʔΛฦ͢ʣɿόάͷpush • FIT (Fault Injection Testing) γεςϜ • Net fl ixͰར༻͍ͯ͠Δڞ௨JavaϥΠϒϥϦ಺ͰFault InjectionΛϑοΫͯ͠ϝλσʔλΛຒΊࠐΉ • ࣦഊɿ࣮ߦͤͣʹྫ֎Λ౤͛Δ • Latencyɿ࣮ߦલʹΘ͟ͱ஗ΒͤΔ • REST, gRPC, Hystrix, EVcache, Cassandra client, etc

Slide 10

Slide 10 text

ChAP: Chaos Automation Platform (Cont.) • ྫɿbookmarkαʔϏεΛࣦഊͤ͞Δ • bookmark: Ҏલݟ͍ͯͨϏσΦͷγʔΫϙδγϣϯΛ؅ཧ͢ΔαʔϏεɻ࠶౓ࢹௌ͸్த͔ΒݟΒΕΔɻ • bookmark͕յΕͯ΋ɺਖ਼ৗʹετϦʔϛϯάͤ͞Δํ͕ॏཁ • ࣮ݧྫɿΞΫςΟϒͳ1%Λcanaryͱ͢Δ • UI͔Βૢ࡞ɿbookmarkαʔϏεݺͼग़͠Λࣦഊʹͯ͠ɺAPIΛ؍࡯ • baseline/canary༻ʹผͷVIPΛׂΓ౰ͯ

Slide 11

Slide 11 text

ChAP: Chaos Automation Platform (Cont.) • metricsपΓ • Atlas: telemetry system • ࠷ऴूܭ͸͜͜ɻλΠϜϥά5෼ • Mantis: streaming processing system • ֤ϚΠΫϩαʔϏεͷΠϕϯτΛॲཧ • ϏσΦ࠶ੜɾDL਺ΛΧ΢ϯτ͠ɺChAPʹຖඵૹ৴ • ҟৗ͕͋ͬͨ৔߹͸͙͢ʹ࣮ݧΛதࢭ • ςετ࣌ͷϦΫΤετͷྲྀΕ • Zuul: Front (Reverse Proxy) • ର৅ (1%)ΛϑΟϧλͯ͠ɺݺͼग़͢APIΛม͑Δ (API-baseline, API-canary) • ϦΫΤετʹfault injection meta-dataΛ෇༩ (bookmarkαʔϏεݺͼग़࣌͠ʹྫ֎)

Slide 12

Slide 12 text

WebUI baseline&canary provision telemetry system ʢλΠϜϥά5෼ఔ౓ʣ streaming processing system ʢλΠϜϥά1ඵʣ Front (Reverse Proxy) ΠϕϯτϞχλ dashboard؅ཧ dashboard canary analysis system CD ग़͠෼͚

Slide 13

Slide 13 text

ChAP: Chaos Automation Platform (Cont.) • Lumen: μογϡϘʔυ؅ཧ • baseline/canaryͰͷൺֱ • ओཁύϑΥʔϚϯεKPI ʢετϦʔϜ࠶ੜ੒ޭ਺SPS: Stream start/secʣ • health metrics: request rate, latency, error rate, CPU࢖༻཰ •

Slide 14

Slide 14 text

ChAP: Chaos Automation Platform (Cont.) • ҆શࡦ • Business hours: 9:00-17:00ͷΈɻΤϯδχΞ͕ਝ଎ʹରԠͰ͖Δ͸ͣ • Automation stop: ΧελϚʔΠϯύΫτ͕େ͖͍৔߹͸ૣΊʹࣗಈఀࢭ • Total Tra ff i c: ૯τϥϑΟοΫͷ5%ҎԼͷΈ࣮ݧՄೳ • Failover: Regionؒfailoverத͸࣮ݧͰ͖ͳ͍

Slide 15

Slide 15 text

MONOCLE • MONOCLE: ϢʔβʢΤϯδχΞʣ͕࣮ݧ͠΍͍͢Α͏ͳ࣮ݧࣗಈੜ੒πʔϧ܈ • ΤϯδχΞ͕ChAPͰ࣮ݧΛఆٛͯ͠ར༻ -> ChAPνʔϜ͕࣮ݧΛࣗಈੜ੒ͯ͠ɺΤϯδχΞ͕ͦΕΛར༻ɻ • Service introspection • RPC client/HystrixίϚϯυ͔Βґଘ৘ใΛऔಘ • ࣦഊͯ͠΋҆શͦ͏͔ʁϑΥʔϧόοΫઃఆ͞Ε͍ͯΔ͔ʁ౳ • telemetry systemͳͲ͔ΒλΠϜΞ΢τ஋Λऔಘ • աڈ2िؒͷ଴ͪ࣌ؒʢฏۉ, 90, 99, 99.5%iileʣ౳ • ࣮ݧͷࣗಈੜ੒ • ࣦഊɺlatency௥Ճ (λΠϜΞ΢τະຬ or λΠϜΞ΢τ+) • WebUI֬ೝ

Slide 16

Slide 16 text

EXPERIMENT GENERATION • Criticality score: ࣦഊͦ͠͏ͳ࣮ݧείΞ • ώϡʔϦεςΟοΫʹείΞ෇͚ • dependency priority (RPC client=1, Hystrix command=100) • աڈ7೔ؒʹ಺෦͔Β࣮ߦ͞Εׂͨ߹ (<0.1%=0,1%=10, <10%=100, else=1000) • ϦτϥΠ܎਺ (1+ϦτϥΠઃఆ਺) • ΠϯλϥΫγϣϯ਺ʢԿճݺ͹ΕΔ͔ʣ

Slide 17

Slide 17 text

EXPERIMENT GENERATION (Cont.) • Prioritization Score: ֤࣮ݧͷ”࣮ߦ͢΂͖͔Ͳ͏͔”ͷείΞ • ҎԼͷੵ • Criticality score: ࣦഊͦ͠͏͔ʁ • Safety score (safe=1, unsafe=-1) • ېࢭ͞Εͨcallґଘɺґଘσʔλ͕ݹ͍ɺfailͯ͠΋fallback͕ͳ͍౳ • Experimental weight (failure=3, latency=2, latency causing failure=1) • >0͔ͭߴ͍ॱʹ࣮ߦ͢Δ

Slide 18

Slide 18 text

RESULTS • λΠϜΞ΢τࢦఆͷϛεࣄྫ • 900msec଴ͪઃఆɺεϨουϓʔϧ͕૿ׂ͑ͯΓ౰ࣦͯഊɺfallbackʢΠϯύΫτେʣ • 99%ileͷlatencyΑΓང͔ʹߴ͔ͬͨɻɻ • %ile latencyʹԠͯ͡௿͘͢Δɻɻ

Slide 19

Slide 19 text

CHALLENGES AND LESSONS LEARNED • Ϟσϧ͕୯७ա͗Δ • FITͷো֐ͷछྨ͸1छྨͷΈɺ࣮ࡍ͸ෳ਺ى͜Δ͜ͱ΋͋Δ • ΞϓϦέʔγϣϯ಺ͰͷΠϯδΣΫγϣϯͷݶք • JavaϥΠϒϥϦͷσϓϩΠʹ͕͔͔࣌ؒΔʢ਺ϲ݄ʣ • JavaҎ֎ͷݴޠར༻ (Node.jsͳͲ)͕ਐΜͰ͍Δ • ݴޠ͝ͱʹ४උ͢Δͷ͕ख͕͔͔ؒΔ • Istio/αʔϏεϝογϡతͳΞϓϩʔνʁ • ར༻͞Εͳ͍ɻɻ • ηϧϑαʔϏεܕ͸ීٴ͠ͳ͔ͬͨɺɺ • ࣗಈੜ੒Ͱ࢖ͬͯ΋Β͑ΔΑ͏ʹɺਫ਼౓޲্ʢِཅੑ཰Λ௿͘ʣͤͨ͞ • ޻਺͔͔ͬͨɻɻ • ϚΠφʔͳσόΠεରԠ • શମͷ੒ޭ཰͚ͩݟͯ΋௥͑ͳ͍ɺɺ • ΤϥʔΧ΢ϯτ • ҰൠʹΤϥʔ཰͸௿͍ͨΊɺಛఆϢʔβʢσόΠεʣ͕େ͖͘د༩͢ Δ͜ͱ͕͋Δ • Τϥʔ͕ଟ͍ͱ݁Ռ͕෼ੳͰ͖ͳ͘ͳΔ • ՄࢹԽͷ෭࡞༻ • MONOCLEͰ৘ใऩूɾUIੜ੒͚ͩͰઃఆෆඋΛݟ͚ͭΒΕͨ

Slide 20

Slide 20 text

CONCLUSION • Chaos experimentsΛ҆શ͔ͭࣗಈͰ࣮ߦ • ์ஔ͢Δͱো֐ʹͳΓ͏ΔࣄྫΛൃݟ • ෛՙςετʹ΋Ԡ༻Ͱ͖ͦ͏ • Canary deploymentͷͨΊʹ࢖͏Ϣʔβ΋ग़͖ͯͨ

Slide 21

Slide 21 text

EoP