Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#49 “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems”

#49 “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems”

cafenero_777

January 25, 2024
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Research Paper Introduction #49 “Gray Failure: The Achilles’ Heel of

    Cloud-Scale Systems” ௨ࢉ#121 @cafenero_777 2023/09/21 1
  2. Agenda •ର৅࿦จ •֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. INTRODUCTION 2. CLOUD ANOMALIES WITH GRAY

    FAILURE 3. MODELING AND DEFINING GRAY FAILURE 4. DISCUSSION 5. RELATED WORK 6. CONCLUSION 2
  3. ର৅࿦จ •Gray Failure: The Achilles’ Heel of Cloud-Scale Systems •

    Peng Huang, et al • Microsoft Research, Microsoft Azure, Johns Hopkins University • HotOS '17 (Hot Topics in Operating Systems) • https://www.sigops.org/s/conferences/hotos/2017/ • https://dl.acm.org/doi/10.1145/3102980.3103005 • https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf 3
  4. ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ •֓ཁ • Ϋϥ΢υ؀ڥͰ͸ো֐ (failure)ͷݕग़Λେن໛ʹߦ͍ͬͯΔ • ൃݟ͞Εͳ͍ো֐͸์ஔ͞ΕɺͦΕ͕େن໛ো֐ʹͭͳ͕Δ • Gray failureͷߏ੒ཁૉΛߟ͑ͯɺো֐ݕग़ʹ׆͔͍ͨ͠

    •ಡ΋͏ͱͨ͠ཧ༝ • SlackͷAZରԠͷࡍʹ঺հ͍ͯͨ͠ͷͰɻ • Gray failureΛಡΈ͔ͨͬͨɻ 4 https://www.connectedpapers.com/main/50dc3bb8c5b8f7bed0c6d6231e042f56ad6af1d4/Gray-Failure%3A- The-Achilles'-Heel-of-Cloud%20Scale-Systems/graph
  5. 1. INTRODUCTION •Ϋϥ΢υ • ๛෋ͳ৑௕ίϯϙʔωϯτͰγεςϜܧଓΛҡ࣋ • ਝ଎͔࣮ͭ֬ʹো֐ݕ஌ͤ͞Δ •Gray failure •

    ໨ཱͨͳ͍ɺඍົͳো֐: ੑೳྼԽɺϥϯμϜύέϩεɺIOෆ҆ఆɺ༰ྔѹഭɺக໋తͰ͸ͳ͍ྫ֎ • كʹى͜Δ͔ʁ -> େن໛ͩͱසൟʹى͜Δ •৑௕Խͱো֐ݕ஌ɺ෮چͰରԠͰ͖Δ͔ʁ • ਖ਼ৗ͔ɺࢭ·͔ͬͨ (Fail stop)ɺΛલఏʹͯ͠͠·͏ͱɺѱԽ͢Δ৔߹΋͋Δ • Byzantine-fault-tolerant (BFT)εςʔτϚγϯ͸ʁ -> ෳࡶੑɾΦʔόʔϔουɾ௿଎ಈ࡞ͷऔΓೖΕࠔ೉ੑ •ࣄྫΛަ͑ͯGray Failureͷಛ௃ͱϦεΫΛઆ໌͠ɺdi ff erential observabilityʢࠩ෼؍ଌՄೳੑʣͰͷղܾࡦΛఏࣔ͢Δ 6
  6. 2. CLOUD ANOMALIES WITH GRAY FAILURE 1. Clos NWͰߴՄ༻ੑ޲্: ຊ౰ʁ

    • ϥϯμϜͳαΠϨϯτυϩοϓͰΞϓϦෆ۩߹΍஗Ԇ૿Ճ • Fan-out௨৴͕ଟ͍΄Ͳɺ1୆ͷো֐ͰඞͣӨڹΛड͚ͯ͠·͏ 2. ݕ஌Ͱ͖͍ͯΔ: ຊ౰ʁݕ஌ཻ౓ͷҧ͍ • ϩʔΧϧͷϋʔτϏʔτ͸݈શ͕ͩɺ֎෦Ϧʔν͕அ • controllerଆݕ஌ͱ࣮ঢ়گͱͷࠩ 3. ඞͣམͪΕ͹ྑ͍: ຊ౰ʁ • Data nodeׂ౰όάͰ͋Δnode͕༰ྔѹഭɾrebootΛ܁Γฦ͢ • system͕αʔϏεఀࢭ͠ɺଞͷ݈શͳdata clusterʹϨϓϦέʔγϣϯ։࢝ɺṧഭɺΛ܁Γฦͯ͠Χεέʔυো֐΁ɻɻ 4. ୲౰Λ෼཭ɾ໌֬Խ͢Ε͹ྑ͍: ຊ౰ʁ • Compute, Network/Storageʹ෼཭ • N/S͕ো֐ݕ஌Ͱ͖ͳ͍ͱɺC͕ʢṖͷݪҼͷʣো֐ͱͯ͠දग़ 7 r1 ͕ނো࣌ɺA->B௨৴͸r2-4 ʹᷖճͯ͠ແࣄͷϋζ͕ͩɻɻ P(n,m) = 1 − ( n − 1 n ) m N: # of Cores M: fan-out factor r1 ௨Βͳ͍֬཰
  7. 3. MODELING AND DIFINING GRAY FAILURE •Di ff erential observabilityͰͷఆٛ

    • AppଆͰ΋؍ଌ: query latency, remote I/OͳͲɻ • গͳ͘ͱ΋̍ͭͷApp͕unhealthyͷ৔߹Λgray failureͱఆٛ • 2͸యܕతͳ൒ࢮɻ3͸ϥοΩʔɻ4͸Ұൠతͳfail stop΍ΫϥογϡͳͲ •࣌ؒൃలͰ1->2->4ͷ৔߹΋͋Δ • యܕྫ͸ϝϞϦϦʔΫ •ϞσϧͰઆ໌ • Clow NWͷྫ: routing protocol (Reactor)͸Կ΋͠ͳ͍͕ɺApp (fan-out)͸஗ԆͳͲͷ໰୊Λݕ஌ • Storage clusterͷྫ: Storage Manager (Observer/Reactor)͸ؾ͔ͮͳ͍͕ɺApp(VM)͸I/OΤϥʔͳͲͷ໰୊Λݕ஌ •େن໛͔Ͳ͏͔͸ແؔ܎ɻখن໛Ͱ΋มͳঢ়ଶ͸ى͖͏Δ 8 DCN, Storage, PF (IaaS, PaaS, etc) Web server, user/operater S, Aͷঢ়ଶͷ৔߹෼͚
  8. 4. DISCUSSION •γεςϜͱΞϓϦͷ؍ଌ݁ՌͷΪϟοϓΛຒΊΔ • γεςϜଆͷOK/NG͚ͩʹཔΒͳ͍ɻAppଆͷ֤छΧ΢ϯλͷ׆༻͢ΔͳͲ •ΞϓϦ؍ଌΛۙࣅ͢Δ • ΞϓϦଆͷΧ΢ϯλΛશ෦࢖͏ͷ͸ݱ࣮తʹ͸ແཧ • pingmeshͷΑ͏ʹɺ࣮௨৴ΛΤϛϡϨʔτͯ͠౸ୡੑ΍஗ԆΛଌఆɻෛՙʹ஫ҙɻ

    •ن໛Λར༻͠ͳ͕Β޻෉͢Δ • Gray failure͕ຊ࣭తʹ෼ࢄݕ஌Ͱ͔͠ݕग़Ͱ͖ͳ͍ʢҰ෦͚ͩݟͯ΋ݕ஌Ͱ͖ͳ͍ʣ • Core system֎ʹ͋Δಠཱͨ͠plane, ͔ͭ, observer/reacterʹܨ͕͍ͬͯΔॴ͕΍Δͷ͕ྑͦ͞͏ • Appଆͷ୯ಠͰ͸ݟ͑ͮΒ͍gray failure΋ͷ͸ଟ͘ͷσʔλϙΠϯτΛ࢖ͬͯ౷ܭతʹਪ࿦͢Δ cf: pingmesh • cluster΍network, ToRʹϚοϐϯάͯ͠෼ੳ •࣌ؒతͳύλʔϯͷ׆༻ • Gray failure͕ݦࡏԽ͢Δ·ͰʹରԠ͢Δ • ޡݕ஌ʹ஫ҙ͢Δ 9
  9. 5. RELATED WORK •Pubic cloudͷো֐ใࠂͷ෼ੳ • Gray Failureͷఆ͕ٛͳ͍ •طଘͷख๏: Primary/Backup

    replication, RAID, Paxos, etc. • Fail stopΛલఏͱ͍ͯ͠Δɻ • Gray failureʹରͯ͠͸ෆे෼ͩͬͨΓඇޮ཰ͩͬͨΓɻ •Pigion: ෆ࣮֬Ͱ΋ো֐৘ใΛAppଆʹ௨஌͠ɺAppଆͰো֐ରԠ͢Δɻfail stopલఏɻ •Tree Augmented Naive Bayesian networks: low-level performance ͱ high-level SLOΛ૬ؔ෼ੳ • performanceىҼҎ֎͸૝ఆ֎ 10