Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#45 “How to Fight Production Incidents?
 An Empirical Study on a Large-scale Cloud Service”

#45 “How to Fight Production Incidents?
 An Empirical Study on a Large-scale Cloud Service”

How to Fight Production Incidents?
 An Empirical Study on a Large-scale Cloud Service

SoCC '22 (Symposium on Cloud Computing, SIGMOD: Management of Data)
https://dl.acm.org/doi/10.1145/3542929.3563482
https://www.microsoft.com/en-us/research/publication/how-to-fight-production-incidents-an-empirical-study-on-a-large-scale-cloud-service/

cafenero_777

April 20, 2023
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Research Paper Introduction #45 “How to Fight Production Incidents?
 An

    Empirical Study on a Large-scale Cloud Service” ௨ࢉ#115 @cafenero_777 2022/04/20 1
  2. Agenda •ର৅࿦จ •֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. INTRODUCTION 2. METHODOLOGY 3. WHAT CAUSES

    INCIDENTS AND HOW WERE THEY MITIGATED? 4. WHAT CAUSES DELAY IN RESPONSE? 5. LESSONS LEARNT FOR RESILIENCY 6. MULTI-DEMENSIONAL INCIDENT ANALYSIS 7. RELATED WORK 8. CONCLUSION 2
  3. ର৅࿦จ •How to Fight Production Incidents? An Empirical Study on

    a Large-scale Cloud Service • Supriyo Ghosh, et al • Microsoft • SoCC '22 (Symposium on Cloud Computing, SIGMOD: Management of Data) • https://dl.acm.org/doi/10.1145/3542929.3563482 • https://www.microsoft.com/en-us/research/publication/how-to- fi ght-production- incidents-an-empirical-study-on-a-large-scale-cloud-service/ 3
  4. 1. INTRODUCTION 1/2 •େن໛αʔϏε͸ো֐͕ආ͚ΒΕͳ͍ • ࠜຊݪҼͷཧղɾ໰୊मਖ਼ɺӨڹ࣌ؒͷ୹ॖ͸΍͍ͬͯΔɺ͕ɻɻɻ •Microsoft-Teams: ௚ۙ1೥ɺ152݅ΠϯγσϯτΛௐࠪ • Azure

    IaaS্ͷweb scaleͳ෼ࢄαʔϏε • ֤regionʹ·͕ͨΔNW, ετϨʔδɺೝূΛ࢖͏ɿϝοηɺ௨࿩ɺձٞɺϦΞϧλΠϜ௨৴ͷͨΊ৴པੑॏཁ • ݕ஌ɺࠜຊݪҼɺ؇࿨: ൒෼͸ݕ஌Ͱ͖ͨʢ؆୯ͳվળͰݕ஌཰޲্ʣ • ࠜຊݪҼ: ໿൒෼͸ιʔείʔυόάͰ͸ͳ͍ • ༰ྔ໰୊ɺखಈσϓϩΠΤϥʔɺূ໌ॻ੾Ε • ؇࿨: 9ׂ͸όάमਖ਼Ҏ֎ʢϩʔϧόοΫɺߏ੒ɾઃఆมߋɺpod࠶ىಈͳͲʣͰ௚Δ • ؇࿨ରࡦͷ஗Εɿਓ͕བྷΉ͜ͱ͕ओͳݪҼʢσϓϩΠಋೖڐՄ͕߱Γͳ͍ɺจॻɾཧղෆ଍ɺਓؒͷख࡞ۀʣ 5
  5. 2. METHODOLOGY •ΠϯγσϯτDB͔Βબ୒ • 2021/05/15 - 2022/05/15ʹى͖ͨTeamsͷॏେΠϯγσϯτ152݅ • ෳ਺ςφϯτ΍ސ٬ʹӨڹ͠ʢͨՄೳੑ͕͋Γʣɺ͢Ͱʹղܾɾ؇࿨͠ɺͦͷํ๏͕׬શʹΘ͔͍ͬͯΔ ΋ͷɻશମΠϯγσϯτͷ2%

    •ཁҼΧςΰϦ෼͚ • ݕ஌ɾݪҼ෼ੳɾ؇࿨ͷ࣮ϓϩηεΛߟྀ͠ɺ6ͭͷཁҼͰղੳ • Teamsݻ༗ͷಛੑʹ஫ҙ • MSݻ༗ͷࣗಈπʔϧ͕Πϯγσϯτ؇࿨͍ͯ͠Δ • ࠶ݱੑෆ໌ͳΠϯγσϯτΛআ֎͍ͯ͠Δ 7
  6. 3. WHAT CAUSES INCIDENTS AND HOW WERE THEY MITIGATED? RCA෼ྨฤ

    •7ͭʹ෼ྨɿ4ׂ͸Code/con fi gόά͕ͩɺ6ׂ͸σϓϩΠ΍Πϯϑϥ͕ݪҼʂ • όάɿfeature fl agґଘ΍ಛఆػೳ͚ͩະαϙʔτͷͨΊʹى͜Δόά 25%ɺϑϥά΍ఆ਺ʢ͖͍͠஋΋ؚΉʣ25%ɺίʔυґଘ 20%ɺtype/validation/exception 17%, ޿ใޓ׵ੑ 15% • ΠϯϑϥɿCPUΩϟύ 33%, NWΩϟύ 40%, scaling໰୊ʢclusterͷҰ෦ར༻ʣ16%, ϝϯς࣌ͷΩϟογϡ࡟আىҼ 8% • σϓϩΠɿূ໌ॻ؅ཧ 55%, ޡͬͨύον 25%, खΦϖϛε 20% • ίϯϑΟάɿϛε 47%,ʢ৻ॏͳ෼ੳ͕ඞཁͳʣมߋ 42%, ߏ੒ಉظ 10% • ґଘؔ܎ɿόʔδϣϯඇޓ׵ 25%, ґଘαʔϏεͷ݈શੑ 20%, ֎෦ίʔυͷมߋ 28%ɺػೳґଘ 28% • DB/NW: ஗Ԇ 25%, Մ༻ੑɾ઀ଓੑ 31%, DBϑΝΠϧૢ࡞ 25%, ϦΫΤετෆՄ 19% • ೝূɿݖݶෆ଍ 42%, ূ໌ॻϩʔςʔγϣϯ 28%, ϙϦγʔมߋىҼͷೝূΤϥʔ 28% 9
  7. 3. WHAT CAUSES INCIDENTS AND HOW WERE THEY MITIGATED? ؇࿨ରࡦฤ

    •؇࿨ࡦ: 40%͸ίʔυ΍ߏ੒όά͕ͩɺ80%͸ͦΕΒͷमਖ਼ແ͠ͰҰ࣍ରԠ • ϩʔϧόοΫɿίʔυ 35%, ߏ੒ 24%, Ϗϧυ 41% • Πϯϑϥมߋ: ผϊʔυʹಀ͕͢ 16%, ผΫϥελ΁ 9%, ผrigion΁ 19%, ੑೳ্͛Δ 31% ෆཁϊʔυ࡟আ 10%, ࠶ىಈ 15% • ֎෦मਖ਼: ϩʔϧόοΫ 29%, ίʔυ΍ઃఆͷमਖ਼ 17%, ͦͷଞશͯ 54% • ઃఆमਖ਼ɿઃఆमਖ਼ 20%, ઃఆ໭͠ 25%, 25% (৽ػೳແޮԽɺػೳ໭͠ɺػೳϑΣΠϧΦʔόʔʣ • ίʔυमਖ਼: σόοά 42%, ϚδοΫφϯόʔ 17%, ྫ֎ॲཧ 25%, Ϟδϡʔϧमਖ਼ 17% • ΞυϗοΫରԠ: ΄΅खಈରԠʹ͍ۙ΋ͷɻূ໌ॻ΍伴ͷߋ৽ͳͲ • Transient: ʢNWͳͲͷʣࣗಈ෮چɺΞοϓσʔτ, ΞϓϦଆͷҰ࣌తͳෛՙͳͲ 10 40%͸ϩʔϧόοΫ΍ϑΣΠϧ Φʔόʔɺ࠶ىಈʢʂʣ ֎෦γεςϜͷमਖ਼಺༁: ίʔυ΍ઃఆͷमਖ਼͸17%͔͠ͳ͍
  8. 4. WHAT CAUSES DELAY IN RESPONSE? ෮چ࣌ؒͰൺֱ •Πϯγσϯτൃੜ͔Βऴྃ·Ͱͷ࣌ؒ = Time

    To Detect + Time To Mitigate 11 ґଘ΍όά: TDD, TTM௕͍ɺͭ·Γ؂ࢹݕ஌͸೉͍͠ Auth, DB/NW͸TDD୹͍͕, TTM௕Ί Deploy: TDD > TTM ݪҼʹ஫໨ͨ͠৔߹ ؇࿨ํ๏ʹ஫໨ͨ͠৔߹ खಈमਖ਼ΑΓϩʔϧόοΫ͕߹ཧతʂ Ad-hoc/Code fi x: खಈͷͨΊTTM௕͍ ϩʔϧόοΫ:TTM୹͍
  9. 4. WHAT CAUSES DELAY IN RESPONSE? ݕ஌஗Εͷཧ༝ •55%͸watchdogʢlatency, CPU, memory

    usageᮢ஋௒͑ʣͰࣗಈݕ஌ •45%͸֎෦ใࠂ: 30%͸ސ٬, 10%͸MS਎಺, ࢒Γ͸ࣗνʔϜ •watchdogͰݕ஌Ͱ͖ͳ͔ͬͨཧ༝ • Monitor Bug: ͖͍͠஋͕ߴ͗͢ 25%ɺseverityೝࣝϛε 25%ɺઃఆϛε΍ݕ஌ෆՄ 25% • Telemetry Coverage: Ϋϥ΢υ؀ڥґଘͳσʔλෆ଍ 31%, ಛఆγφϦΦ(HTTPίʔυͳͲ)ͷෆ଍ 31% • External E ff ect: ಛఆ֎෦αʔϏεͷ؂ࢹෆ଍ • No Monitor: ؂ࢹ࿙Ε 12 ؂ࢹͰ͖͍ͯͳ͍΋ͷTTD௕͍ʢؾ͖ͮʹ͍͘ʣ 17%͸؂ࢹ࿙Εɺσʔλ࿙Ε
  10. 4. WHAT CAUSES DELAY IN RESPONSE? ؇࿨஗Εͷཧ༝ •Doc-Procedures: ஌ࣝෆ଍ 20%,

    Doc඼࣭ 50%, ಛఆ؀ڥิ଍ෆඋ 30% •Deployment Delay: खಈঝೝ 25%, ΞΫηεݖݶෆ଍ 25%, ൓өΏͬ͘Γ 50% •Manual E ff ort/External Dependency: ରԠ࿈ܞෆ଍ 14%, ޡ਍ 30%, •Complex Root Cause: ൃੜස౓௿͍΋ͷ 18%, ϝτϦΫεෆ଍ 27%, σόοά࡞ۀ 55% 13 ʢ࣌ؒͷ͔͔ΔσόοάͳͲͰʣݪҼ͕෼͔ͬͯ΋ɺ
 ʢखॱॻෆ଍΍σϓϩΠ൓ө଴ͪͰʣҰ࣍ରԠ͕஗ΕΔɻ
 ͜Ε͕30%΋͋Δɻ
  11. 5. LESSONS LEARNT FOR RESILIENCY ࣗಈԽʹΑΔରࡦఏҊ •ࣗಈԽʹΑΔରࡦఏҊ • Manual/Con fi

    g Test: perf-test, γφϦΦςετɺvalidation/integration/unit test • Auto Alert/Triage: ͖͍͠஋ͷࣗಈԽɺࣗಈΤεΧϨʔγϣϯ • Auto Deploy: ϑΣΠϧΦʔόʔ΍ϦϦʔεͷࣗಈԽ • Unclear/None: طʹࣗಈԽࡁΈɻࣗಈͰ͸ରॲͰ͖ͳ͍ɻ 14
  12. 5. LESSONS LEARNT FOR RESILIENCY ֶΜͩ͜ͱ •Improve Monitoring/Testing: ςϨϝτϦ௥Ճɻޓ׵ੑςετ௥Ճ •Behavioral

    Change: σϓϩΠ͸஫ҙਂ͘΍ͬͯʂϑϥάon/o f ͸γφϦΦςετ΍ͬͯʂࣄલͷݖݶ֬ೝ΍ͬͯʂ •Doc/Training: υΩϡϝϯτվળɺࣄޙৼΓฦΓɺAPIυΩϡϝϯτ֬ೝͱվળ •Auto Mitigation: ূ໌ॻߋ৽ɺ؂ࠪɺϑΣΠϧΦʔόʔɺϊʔυεέʔϧΞ΢τ •External Coordination: ࣄલʹ࿈བྷνϟωϧ࡞͓ͬͯ͘ɺΤεΧϨʔγϣϯʢ࿈བྷʣͷઃܭ 15 20%͸ؒ઀తͳ͜ͱɻ ʢυΩϡϝϯτɾτϨʔχϯάɾ֤छ࣮ફʣ
  13. 6. MULTI-DIMENSIONAL INCIDENT ANALYSIS ؂ࢹࣦഊͱRC •୯७ͳ෼ੳͰ͸ࣄ৅ͷҰ෦͔͠ݟ͍͑ͯͳ͍ʁ • ྫ: ύϥϝλͷޓ׵ੑόάɻΞϥʔτ͸ى͖ͣɺϩʔϧόοΫ Ͱ͸ͳ͘hot

    fi xͯ͠͠·ͬͨɻ • ૬ؔݕఆͰΠϯγσϯτཧղΛࢼΈΔ • ؂ࢹ͕ແ͠/ΧόϨοδ௿: ίʔυόά͕େ෦෼ • "͔ͬ͠Γͱ"ςετ͢Δ͜ͱ͕ෆՄܽ • ݕ஌ෆՄ: ґଘ֎෦αʔϏεͷো֐ 17 Not failedͱͷൺֱΛݟͯΈΔͱ໘ന͍ɻ όάΛ؂ࢹͰݟ͚ͭΔͷ͸΄΅ແཧͩͱࢥͬͯྑ͍ɻ
  14. 6. MULTI-DIMENSIONAL INCIDENT ANALYSIS RCͱ؇࿨ࡦ •Rollback͸Ұൠతͳ؇࿨ࡦ • Con fi g

    BugͰ͢ΒFix (21%)ΑΓ࣌୹ͰRollback (47%) • con fi gͷߏ੒ςετͰճආͰ͖ΔՄೳੑ •Deployment Errorͷେ൒͸ূ໌ॻΤϥʔʢCon fi g FixͰରԠʣ • ༗ޮظݶνΣοΫΛపఈ͢Δ •DB/NWো֐ • ʢ͍͍ҙຯͰʣ֎෦ཁҼɻ৑௕ԽΛਪਐͯ͠΋Β͏ 18
  15. 6. MULTI-DIMENSIONAL INCIDENT ANALYSIS ؇࿨ࡦͱֶΜͩ͜ͱ •21%͸υΩϡϝϯτɾରԠτϨʔχϯάͰվળ • ඼࣭ʢಡΈ΍͢͞ɺTTM, ࠷৽৘ใʣ •

    ࣗಈԽ: ϫʔΫϑϩʔ࡞Γ •खॱॻͰରࡦ: ؂ࢹʢΧόϨοδʣෆ଍ͷิڧ •ख࡞ۀʹΑΔTTM௕ظԽ • ࣗಈԽͰ࣌୹ʢূ໌ॻߋ৽ɺfailover, auto-scalingʣ • ϞχλϦϯάͱϫʔΫϑϩʔͷηοτԽ 19