Slide 1

Slide 1 text

Research Paper Introduction #49 “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems” ௨ࢉ#121 @cafenero_777 2023/09/21 1

Slide 2

Slide 2 text

Agenda •ର৅࿦จ •֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. INTRODUCTION 2. CLOUD ANOMALIES WITH GRAY FAILURE 3. MODELING AND DEFINING GRAY FAILURE 4. DISCUSSION 5. RELATED WORK 6. CONCLUSION 2

Slide 3

Slide 3 text

ର৅࿦จ •Gray Failure: The Achilles’ Heel of Cloud-Scale Systems • Peng Huang, et al • Microsoft Research, Microsoft Azure, Johns Hopkins University • HotOS '17 (Hot Topics in Operating Systems) • https://www.sigops.org/s/conferences/hotos/2017/ • https://dl.acm.org/doi/10.1145/3102980.3103005 • https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf 3

Slide 4

Slide 4 text

֓ཁͱಡ΋͏ͱͨ͠ཧ༝ •֓ཁ • Ϋϥ΢υ؀ڥͰ͸ো֐ (failure)ͷݕग़Λେن໛ʹߦ͍ͬͯΔ • ൃݟ͞Εͳ͍ো֐͸์ஔ͞ΕɺͦΕ͕େن໛ো֐ʹͭͳ͕Δ • Gray failureͷߏ੒ཁૉΛߟ͑ͯɺো֐ݕग़ʹ׆͔͍ͨ͠ •ಡ΋͏ͱͨ͠ཧ༝ • SlackͷAZରԠͷࡍʹ঺հ͍ͯͨ͠ͷͰɻ • Gray failureΛಡΈ͔ͨͬͨɻ 4 https://www.connectedpapers.com/main/50dc3bb8c5b8f7bed0c6d6231e042f56ad6af1d4/Gray-Failure%3A- The-Achilles'-Heel-of-Cloud%20Scale-Systems/graph

Slide 5

Slide 5 text

Gray Failure? •೔ຊޠͩͱ”൒ࢮ” •ਖ਼ৗʢനʣͰ΋ͳ͍͕ɺ૝ఆͨ͠ҟৗʢࠇʣͷಈ࡞Ͱ΋ͳ͍ •Πϯϑϥଆ͔ΒݟΔͱਖ਼ৗʢനʣ͕ͩɺΞϓϦଆ͔ΒݟΔͱҟৗʢࠇʣ •Fail over͠ͳ͍ɺͳͲ͕Α͋͘Δ • Өڹ͕࣌ؒ௕͘ͳΓ͕ͪ • ਓؒ൑அͰڧ੍fail overͤ͞ΔରԠͳͲ 5

Slide 6

Slide 6 text

1. INTRODUCTION •Ϋϥ΢υ • ๛෋ͳ৑௕ίϯϙʔωϯτͰγεςϜܧଓΛҡ࣋ • ਝ଎͔࣮ͭ֬ʹো֐ݕ஌ͤ͞Δ •Gray failure • ໨ཱͨͳ͍ɺඍົͳো֐: ੑೳྼԽɺϥϯμϜύέϩεɺIOෆ҆ఆɺ༰ྔѹഭɺக໋తͰ͸ͳ͍ྫ֎ • كʹى͜Δ͔ʁ -> େن໛ͩͱසൟʹى͜Δ •৑௕Խͱো֐ݕ஌ɺ෮چͰରԠͰ͖Δ͔ʁ • ਖ਼ৗ͔ɺࢭ·͔ͬͨ (Fail stop)ɺΛલఏʹͯ͠͠·͏ͱɺѱԽ͢Δ৔߹΋͋Δ • Byzantine-fault-tolerant (BFT)εςʔτϚγϯ͸ʁ -> ෳࡶੑɾΦʔόʔϔουɾ௿଎ಈ࡞ͷऔΓೖΕࠔ೉ੑ •ࣄྫΛަ͑ͯGray Failureͷಛ௃ͱϦεΫΛઆ໌͠ɺdi ff erential observabilityʢࠩ෼؍ଌՄೳੑʣͰͷղܾࡦΛఏࣔ͢Δ 6

Slide 7

Slide 7 text

2. CLOUD ANOMALIES WITH GRAY FAILURE 1. Clos NWͰߴՄ༻ੑ޲্: ຊ౰ʁ • ϥϯμϜͳαΠϨϯτυϩοϓͰΞϓϦෆ۩߹΍஗Ԇ૿Ճ • Fan-out௨৴͕ଟ͍΄Ͳɺ1୆ͷো֐ͰඞͣӨڹΛड͚ͯ͠·͏ 2. ݕ஌Ͱ͖͍ͯΔ: ຊ౰ʁݕ஌ཻ౓ͷҧ͍ • ϩʔΧϧͷϋʔτϏʔτ͸݈શ͕ͩɺ֎෦Ϧʔν͕அ • controllerଆݕ஌ͱ࣮ঢ়گͱͷࠩ 3. ඞͣམͪΕ͹ྑ͍: ຊ౰ʁ • Data nodeׂ౰όάͰ͋Δnode͕༰ྔѹഭɾrebootΛ܁Γฦ͢ • system͕αʔϏεఀࢭ͠ɺଞͷ݈શͳdata clusterʹϨϓϦέʔγϣϯ։࢝ɺṧഭɺΛ܁Γฦͯ͠Χεέʔυো֐΁ɻɻ 4. ୲౰Λ෼཭ɾ໌֬Խ͢Ε͹ྑ͍: ຊ౰ʁ • Compute, Network/Storageʹ෼཭ • N/S͕ো֐ݕ஌Ͱ͖ͳ͍ͱɺC͕ʢṖͷݪҼͷʣো֐ͱͯ͠දग़ 7 r1 ͕ނো࣌ɺA->B௨৴͸r2-4 ʹᷖճͯ͠ແࣄͷϋζ͕ͩɻɻ P(n,m) = 1 − ( n − 1 n ) m N: # of Cores M: fan-out factor r1 ௨Βͳ͍֬཰

Slide 8

Slide 8 text

3. MODELING AND DIFINING GRAY FAILURE •Di ff erential observabilityͰͷఆٛ • AppଆͰ΋؍ଌ: query latency, remote I/OͳͲɻ • গͳ͘ͱ΋̍ͭͷApp͕unhealthyͷ৔߹Λgray failureͱఆٛ • 2͸యܕతͳ൒ࢮɻ3͸ϥοΩʔɻ4͸Ұൠతͳfail stop΍ΫϥογϡͳͲ •࣌ؒൃలͰ1->2->4ͷ৔߹΋͋Δ • యܕྫ͸ϝϞϦϦʔΫ •ϞσϧͰઆ໌ • Clow NWͷྫ: routing protocol (Reactor)͸Կ΋͠ͳ͍͕ɺApp (fan-out)͸஗ԆͳͲͷ໰୊Λݕ஌ • Storage clusterͷྫ: Storage Manager (Observer/Reactor)͸ؾ͔ͮͳ͍͕ɺApp(VM)͸I/OΤϥʔͳͲͷ໰୊Λݕ஌ •େن໛͔Ͳ͏͔͸ແؔ܎ɻখن໛Ͱ΋มͳঢ়ଶ͸ى͖͏Δ 8 DCN, Storage, PF (IaaS, PaaS, etc) Web server, user/operater S, Aͷঢ়ଶͷ৔߹෼͚

Slide 9

Slide 9 text

4. DISCUSSION •γεςϜͱΞϓϦͷ؍ଌ݁ՌͷΪϟοϓΛຒΊΔ • γεςϜଆͷOK/NG͚ͩʹཔΒͳ͍ɻAppଆͷ֤छΧ΢ϯλͷ׆༻͢ΔͳͲ •ΞϓϦ؍ଌΛۙࣅ͢Δ • ΞϓϦଆͷΧ΢ϯλΛશ෦࢖͏ͷ͸ݱ࣮తʹ͸ແཧ • pingmeshͷΑ͏ʹɺ࣮௨৴ΛΤϛϡϨʔτͯ͠౸ୡੑ΍஗ԆΛଌఆɻෛՙʹ஫ҙɻ •ن໛Λར༻͠ͳ͕Β޻෉͢Δ • Gray failure͕ຊ࣭తʹ෼ࢄݕ஌Ͱ͔͠ݕग़Ͱ͖ͳ͍ʢҰ෦͚ͩݟͯ΋ݕ஌Ͱ͖ͳ͍ʣ • Core system֎ʹ͋Δಠཱͨ͠plane, ͔ͭ, observer/reacterʹܨ͕͍ͬͯΔॴ͕΍Δͷ͕ྑͦ͞͏ • Appଆͷ୯ಠͰ͸ݟ͑ͮΒ͍gray failure΋ͷ͸ଟ͘ͷσʔλϙΠϯτΛ࢖ͬͯ౷ܭతʹਪ࿦͢Δ cf: pingmesh • cluster΍network, ToRʹϚοϐϯάͯ͠෼ੳ •࣌ؒతͳύλʔϯͷ׆༻ • Gray failure͕ݦࡏԽ͢Δ·ͰʹରԠ͢Δ • ޡݕ஌ʹ஫ҙ͢Δ 9

Slide 10

Slide 10 text

5. RELATED WORK •Pubic cloudͷো֐ใࠂͷ෼ੳ • Gray Failureͷఆ͕ٛͳ͍ •طଘͷख๏: Primary/Backup replication, RAID, Paxos, etc. • Fail stopΛલఏͱ͍ͯ͠Δɻ • Gray failureʹରͯ͠͸ෆे෼ͩͬͨΓඇޮ཰ͩͬͨΓɻ •Pigion: ෆ࣮֬Ͱ΋ো֐৘ใΛAppଆʹ௨஌͠ɺAppଆͰো֐ରԠ͢Δɻfail stopલఏɻ •Tree Augmented Naive Bayesian networks: low-level performance ͱ high-level SLOΛ૬ؔ෼ੳ • performanceىҼҎ֎͸૝ఆ֎ 10

Slide 11

Slide 11 text

6. CONCLUSION •Gray failure͸ݟա͝͞Ε͕͕ͪͩɺγεςϜΛεέʔϧ͢Δʹ͸ରԠ͸ෆՄܽ •Gray failureΛఆٛ͠ɺࣄྫΛूΊͯٞ࿦ •Di ff erential observabilityΛݮΒ͢͜ͱ͕ۃΊͯॏཁ 11

Slide 12

Slide 12 text

ิ଍εϥΠυ •SPOF͸Ͳ͜ʹ͋Δʁʁ • ෆ׬શͳো֐Ͱҟৗݕ஌Ͱ͖ͣɻϑΣΠϧΦʔόʔͷࣦഊʢͷ࿈ଓ΋ʣ •Root Cause͸ʁ 12 ϦϦʔεΛࢭΊΔͱ40%ݪҼ࡟ݮͰ͖Δɻɻ αʔϏεґଘ΋ଘࡏ

Slide 13

Slide 13 text

͓·͚: AIOps 13 https://www.microsoft.com/en-us/research/project/aiops/ •Detection, Prediction͸Ͱ͖͍ͯΔʢΒ͍͠ʣ •Auto healing͸·ͩʢΒ͍͠ʣ •CODTͷηογϣϯͰٞ࿦ •͖Ε͍ͳσʔλूܭ͕ୈҰาʁ • ෼ੳ͠΍͍͢Α͏ʹɻcf: ϧϯό

Slide 14

Slide 14 text

͓·͚2: Cloud Outage StudyΛࣼΊಡΈɻ •Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages 14

Slide 15

Slide 15 text

EoP 15