Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
#14 “The Network is Reliable"
Search
cafenero_777
June 14, 2023
Technology
0
93
#14 “The Network is Reliable"
ACM queue: July 23, 2014 Volume 12, issue 7
https://queue.acm.org/detail.cfm?id=2655736
cafenero_777
June 14, 2023
Tweet
Share
More Decks by cafenero_777
See All by cafenero_777
#51 “Empowering Azure Storage with RDMA”
cafenero_777
3
480
#49 “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems”
cafenero_777
2
120
#50 “Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction”
cafenero_777
0
120
#33 “Destroying networks for fun (and profit)”
cafenero_777
0
95
#34 “MTPSA: Multi-Tenant Programmable Switches”
cafenero_777
0
61
#37 “Bluebird: High-performance SDN for Bare-metal Cloud Services”
cafenero_777
1
130
#39 “Profiling a warehouse-scale computer”
cafenero_777
0
46
#23 “VFP: A Virtual Switch Platform for Host SDN in the Public Cloud”
cafenero_777
0
230
#24 “Ananta: Cloud Scale Load Balancing”
cafenero_777
0
270
Other Decks in Technology
See All in Technology
How Do I Contact HP Printer Support? [Full 2025 Guide for U.S. Businesses]
harrry1211
0
130
VGGT: Visual Geometry Grounded Transformer
peisuke
0
100
衛星運用をソフトウェアエンジニアに依頼したときにできあがるもの
sankichi92
1
160
PO初心者が考えた ”POらしさ”
nb_rady
0
220
赤煉瓦倉庫勉強会「Databricksを選んだ理由と、絶賛真っ只中のデータ基盤移行体験記」
ivry_presentationmaterials
2
380
Rethinking Incident Response: Context-Aware AI in Practice
rrreeeyyy
1
130
Reach American Airlines®️ Instantly: 19 Calling Methods for Fast Support in the USA
flyamerican
1
180
VS CodeとGitHub Copilotで爆速開発!アップデートの波に乗るおさらい会 / Rapid Development with VS Code and GitHub Copilot: Catch the Latest Wave
yamachu
2
190
Lakebaseを使ったAIエージェントを実装してみる
kameitomohiro
0
160
How to Quickly Call American Airlines®️ U.S. Customer Care : Full Guide
flyaahelpguide
0
140
Sansanのデータプロダクトマネジメントのアプローチ
sansantech
PRO
0
200
freeeのアクセシビリティの現在地 / freee's Current Position on Accessibility
ymrl
2
240
Featured
See All Featured
The Power of CSS Pseudo Elements
geoffreycrofte
77
5.9k
The Language of Interfaces
destraynor
158
25k
Code Review Best Practice
trishagee
69
19k
Statistics for Hackers
jakevdp
799
220k
Writing Fast Ruby
sferik
628
62k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
44
2.4k
Side Projects
sachag
455
42k
Automating Front-end Workflow
addyosmani
1370
200k
Measuring & Analyzing Core Web Vitals
bluesmoon
7
510
Embracing the Ebb and Flow
colly
86
4.7k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
233
17k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
8
830
Transcript
Research Paper Introduction #14 “The Network is Reliable ~An informal
survey of real-world communications failures~” ௨ࢉ#52 @cafenero_777 2020/09/24
Agenda • ରจ • ֓ཁͱಡ͏ͱͨ͠ཧ༝ • Introduction • େنΠϯϑϥ •
DC NW • ΫϥυNW • ϗεςΟϯάϓϩόΠμʔ • ҬNW • Global Routing Error • NIC/Driver • ΞϓϦέʔγϣϯ • ·ͱΊ
$ which • The Network is Reliable: An informal survey
of real-world communications failures • Peter Bailis, UC Berkeley, Kyle Kingsbury, Jepsen Networks • ACM queue: July 23, 2014 Volume 12, issue 7 • https://queue.acm.org/detail.cfm?id=2655736
֓ཁͱಡ͏ͱͨ͠ཧ༝ • ֓ཁ • ༷ʑͳࣄۀऀͰNWোʢpartitionʣ͕ൃੜ͍ͯ͠Δ • ্هʹىҼ͢ΔࢄγεςϜোͷࣄྫΛ͘ௐࠪɻ • ઃܭ࣌ʹΑ͘ߟ͑ͯʂʢઃܭޙৼΓฦͬͯʂʣ •
ಡ͏ͱͨ͠ཧ༝ • ࣮ࡍʹى͖ͨোʹ͍ͭͯ͘ɾ۩ମతʹݴٴ͞Ε͍ͯͨͨΊɻ • Կ͔ͷจͰҾ༻͞Ε͍ͯͨʢΕͨɻɻʣ
ࢄίϯϐϡʔςΟϯά҆શਆʁ http://nighthacks.com/jag/res/Fallacies.html http://www.rgoarchitects.com/Files/fallacies.pdf
ࢄίϯϐϡʔςΟϯά҆શਆʁ • “ଟ͘ͷਓࢄγεςϜΛ࠷ॳʹߏங͢Δ ࣌ɺҎԼͷ8߲Λఆͯ͠͠·͏ɻ͜ΕΒ Քಇதʹඞͣൃੜ͠ɺେΛҾ͖ى͜ ͠ɺ௧͍ࣦഊܦݧΛҾ͖ى͜͢ɻ” • NWোͰͳ͘ਓͷࢥ͍ࠐΈ՝ɾɾɾ • ͘ɾͨ·ʹਂ͘ݟΔඞཁੑ
http://nighthacks.com/jag/res/Fallacies.html http://www.rgoarchitects.com/Files/fallacies.pdf
http://nighthacks.com/jag/res/Fallacies.html http://www.rgoarchitects.com/Files/fallacies.pdf Where you are?
ཧ৭ʑ͋Δ͕ɺɺ • ݱঢ় • ີ݁߹ɾ৫ڞ༗͞Εͣ • ΞϓϦϨΠϠʔͰ݁ • ਪଌͱᷚ •
ݱ࣮ͷোࣄྫΛ·ͱΊΔ • ͔ͦ͜ΒֶͿ
େنࢄΠϯϑϥͷࣄྫ (1/2) • MS Data Center by MS research •
5.2ݸͷσόΠεނো/day, 40.8ݸͷϦϯΫো/day • म෮࣌ؒͷதԝ5(࠷େ1िؒ), ύέϩεதԝ59,000ύέοτ • NWੑͰ43%্->NWোͷҰൠతͳݪҼഉআʹࢸΒͣ • HPΤϯλʔϓϥΠζཧNW By HP labͰͷαϙʔτνέοτੳ • ଓؔ࿈νέοτ11.4%, ͦͷ͏ͪ14%࠷ߴ༏ઌ • ࠷ߴ༏ઌͷରԠ࣌ؒͷதԝ2.75࣌ؒɺશதԝ18 • Google Chubby (ࢄϩοΫγεςϜ for খ༰ྔࢄετϨʔδ) • 61ݸͷఀࢭ@700Λௐࠪɺ9ͭ30ඵҎ্ఀࢭɿ4ͭNWىҼɺ2ͭ”NWଓىҼΒ͍͠”
େنࢄΠϯϑϥͷࣄྫ (2/2) • googleࢄγεςϜͷlesson & advice by Je ff Deanޚେͷ৽Ϋϥελௐࠪʢ࠷ॳͷ1ʣ
• 5ϥοΫ͕ෆ҆ఆʢ50%ύέϩεʣ • 8ϝϯςʢ4ϝϯς30ؒϥϯμϜͳύέϩεͷՄೳੑʣ • 3ϧʔλোʢ1࣌ؒܧଓʣ • ΞυόΠε • ෳόʔδϣϯڝ߹Λߋ৽Ͱ͖ΔநԽϨΠϠʔͰٵऩͤ͞Δ • NWো(partition)෮چޙʹϨϓϦέʔτௐ • Amazon Dynamo (KVS) • ”ैདྷͷෳ͞ΕͨRDBͰNW partitionʹରॲͰ͖ͳ͍”ͱݴٴɻconsistencyΛ٘ਜ਼ʹͯ͠ͰavailabilityΛऔͬͨ • Yahoo! PNUTS/Sherpa (ཧࢄDB) • ڧ͍߹ੑͰ͋ΔλΠϜϥΠϯ߹ੑʢશϨίʔυ͕શϨϓϦΧಉ͡ॱংͰߋ৽ॲཧ͞ΕΔʣΛαϙʔτ • ->NWো(partition)ɾαʔόোͰ͠ΜͲ͍ͷͰऑ͍߹ੑʹมߋɾɾɾ https://research.cs.cornell.edu/ladis2009/talks/dean-keynote-ladis2009.pdf
DC NWো • ిݯো • ToR͕ยํམͪΔɺͳ͔ͥ͏ยํͷToRམͪΔɺϥοΫؒ௨৴͕ग़དྷͳ͍ΦϯσϚϯυαʔϏε͕ఀࢭɾɾɾ • ->ඞͣ͠ϦϯΫোΛ͙ͷͰͳ͍ʢMS SIGCOM paper͕ࣔࠦʣ
• BPDUϑϥου • ϝϯςதʹSTPϑϥοϓ͠ʢBPDUن֨తʹൃੜ͠ͳ͍ʣBPDUϑϥουൃੜɻ2࣌ؒαʔϏεఀࢭ • Bridge loop/Miscon fi guration/Broken MAC cache (github) • ʢଟஈSWͰͳ͘ʣूεΠονΛಋೖ -> ϧʔϓൃੜ->ϦϯΫແޮԽ->Կނ͔ར༻ଳҬ100%ுΓ͖ɾɾ • ઃఆϛεঢ়ଶͰ1ຊམͱ͢->োݕग़ػߏ͕શஅͤ͞Δ->18μϯ • εΠον͕MACΞυϨεͷΩϟογϡΛਖ਼͘͠ߋ৽Ͱ͖ͳ͍ͨΊϒϩʔυΩϟετ͢ΔϑΝʔϜόάɾɾɾ • MLAG/STP/STONITH (github) • ϕϯμʔ͕ूεΠονͷಛఆagentΛఀࢭ->linkΛshutग़དྷͣ->ਖ਼ৗʹLAG/STP/L2ϓϩτίϧॲཧͰ͖ͣ->STP࠶ܭࢉͰ90ඵϒϩοΫ • ϑΝΠϧαʔό (Pacemaker/DRBD)͕͓ޓ͍ఀࢭͯ͠ͱஅ->STONITHʢ૬खΛڧ੍rebootʣ->NW෮چޙʹ྆ܥdown->ϑΝΠϧΞΫηεग़དྷͣ • खಈ෮چʢϓϥΠϚϦϊʔυʹ߹ΘͤΔɻ྆ܥϓϥΠϚϦͳΒϩάௐࠪɺɺʣʹ5͔͔࣌ؒͬͨ
ΫϥυωοτϫʔΫࣄྫ (1/2) • Isolated MongoDB primary on EC2 • EC2
WestϦʔδϣϯͰNWো-> 1primary/2secondary͕->෮چޙʹݹ͍primary͕”্ॻ͖”ͨ͠->̎࣌ؒͷ ॻ͖ࠐΈଛࣦɻ • োࣗମҰൠతͳͷɺɺ • Amnesia split-brain on EC2 • Ұ൩Ͱsplit brain, ӡ༻νʔϜ͕ยܥΛrestartͰղܾ • MongoDB/ElasticSearch on EC2 • NWো->ಛఆϊʔυӨڹ->αʔϏεશͯʹӨڹՄೳੑ • ඵɾ݄ճͷbackendఀࢭ -> -45ͷαʔϏεఀࢭͱESΠϯσοΫεഁଛɺఀࢭ12-4ճ·ͰΤεΧϨʔτ
ΫϥυωοτϫʔΫࣄྫ (2/2) • AWS EBSఀࢭ (2011/04/21) • US West AZ͔ΒEBSͷτϥϑΟοΫΛγϑτ
-> ϧʔςΟϯάϙϦγʔϛε->primary/secondary NWΛಉ࣌ʹஅ -> EBSϛϥʔετʔϜൃੜ-> -> EC2 12࣌ؒఀࢭ, EBS 80࣌ؒఀࢭ • RDSఀࢭʢAZ failoverόάͷͨΊɺ2.5%͕ࣦഊʣɺHeroku 16-60࣌ؒఀࢭ • Isolated Redis Primary on EC2 • NWো->Twilio՝ۚPrimary Redis͕->secondaryঢ֨ͳ͠Ͱprimaryʹॻ͖ࠐΈ->෮چޙͷ࠶ ಉظͰprimaryߴෛՙ->primaryΛखಈͰ࠶ىಈ->ޡͬͨcon fi gͰىಈ͠ɺread onlyͰىಈ->Twilio APIݺͼग़͠Ͱސ٬ʹ࠶νϟʔδ͢Δ->40Ͱ1.1%ސ٬ʹաٻ->SMS/௨ͰΫϨΧ500υϧ ٻ+3500υϧΛ͑ΔͱٻडෆՄ
ϗεςΟϯάϓϩόΠμʔ • ҆ՁͰ৴པੑ͕ߴ͍ʢϋζʣ+NW/ServerཧऀͰ͋Δඞཁ͕͋Δ • GlusterFS split brain (Freistil IT) •
ϧʔλϑΝʔϜΣΞόάͰ50%-100%ύέϩε->GluasterFS͕split brain-> ෮چޙ2ͭ ͷσʔληοτΛෆ߹Λղܾग़དྷͣ ->म෮ޙʹτϥϑΟοΫٸ૿͠Webϊʔυߴෛՙ • ಗ໊ͳओཁϗεςΟϯάۀऀʢ100-200nodeنʣ • 90ؒͰ5ͭpartition͕࣌ؒൃੜ • ෦ͱ֎෦Λͭͳ͙NWোɺ෦ͱཧNWΛো
ҬNW • WANোɿϧʔτ͕গͳ͍߹ɺෳDCͳDRඞཁͳͲ • CENICʢCorporation for Education Network Initiatives in
Californiaʣௐࠪ • ΧϦϑΥϧχΞશͷϧʔλΛ̑ௐࠪʢϦϯΫোɺeBGP/tracerouteσʔλʣ • 500Ҏ্ͷNW partitionΛൃݟ • SW 6ʢதԝ2.7ɺ19.9@95%ileʣ • HW 8.2࣌ؒʢதԝ32ɺ3.7@95%ileʣ • PagerDuty on 2 EC2 region/Linode • CA෦AWSϐΞϦϯά͕ྼԽ->EC2ϊʔυ͕ଓྼԽ͠latency૿Ճ->ΫΦϥϜશஅ->ϝοηʔδͷσΟεύονఀࢭ • ઃܭతʹߟྀ͞Ε͍͕ͯͨɺ݁Ռతʹ18ར༻ෆՄɺAPIϦΫΤετυϩοϓࢯɺΫΦϥϜ෮چ·ͰϖʔδԆ
Global Routing Error • Cloud fl are • path/AnycastΛۦ͢Δ23ͷDC •
DDoSରࡦͱͯ͠ಛఆαΠζͷύέοτΛdropͤ͞ΔΑ͏FlowSpecͰશΤοδϧʔλʹୡ->ύέοτʹҰகͤͣ->Ϋϥογϡ͢Δ·ͰRAMফඅ͠ଓ͚ͨ->ࣗ ಈ࠶ىಈ͠ͳ͍ɺmgmtΞΫηεͰ͖ͳ͍->Ұ෦෮چτϥϑΟοΫूதͯ͠ߴෛՙ->·ͩϑΥʔϧόοΫ->ݱखಈ࠶ىಈʢ30ޙʹ։࢝ɺ1࣌ؒར༻ෆՄʣ • Level3 2011 • JuniperͷϑΝʔϜΣΞόάͷͨΊɺόοΫϘʔϯఀࢭ • Time Warner Cable RIM BlackBerry, UK ISP͕ΦϑϥΠϯ • Global BGP outages • 2008ʹύΩελϯςϨίϜ͕youtubeΛϒϩοΫ->ͦͷʢϒϩοΫ͞ΕͨʣϧʔτΛଞͷISPʹใ->ΞΫηεෆՄʢBGP hijackʣ • 2010ʹσϡʔΫେֶ͕BGPͷ࣮ݧతͳϑϥάΛςετ͢Δ͜ͱͰಉ༷ͷޮՌΛ֬ೝ • ൃදऀऍɿBGP hijackʢඇਖ਼نͳAS͔ΒউखʹBGPใʣ݁ߏى͖ͯΔ • ྫɿ2018ʹGoogle͕ʢࣗಈԽ͞ΕͨγεςϜͷόάʁͰʣىͨ݅͜͠ https://gigazine.net/news/20180711-shutting-down-bgp-hijack-factory/
NIC/Driver • Broadcom BCM5709 and Friends • BCM5709 • ड৴ύέοτdrop͢Δ͕ɺૹ৴drop͠ͳ͍ϑΝʔϜόά
• ->primaryॲཧʢड৴ʣͰ͖ͳ͍͕ɺsecondaryprimary͕ੜ͖ͯΔͱࢥ͍ࠐΉ->secondaryʹfallback͠ͳ͍->5࣌ؒఀࢭ -> Sven Ulland ࢯ͕Linux 2.6.32Ͱใࠂɺ2.6.38·ͰղܾͰ͖ͣɻ • BCM5709ʢαʔόʣ͕crash/bu ff erᷓΕ࣌ʹແؔʹPAUSEϑϨʔϜΛग़͢->ToR(BCM56314/BCM56820)͕֦େ->NWશମ͕ো • BCM57711ͰδϟϯϘϑϨʔϜͰߴෛՙ࣌ʹϨΠςϯγѱԽ->ESX on iSCSIͰݦࡏԽ • Intel 82574: Packet of Death • EEPROM͕ਖ਼͘͠ fl ashग़དྷͣ->SIPͷड৴ύέοτΛNIC͕ແޮԽʢௐ͕ࠪඇৗʹ͍͠ɺɺʣ->cold restartͰ෮׆ɾɾ • DriverىҼͷGlusterFS partition • ΞοϓάϨʔυޙʹFlusterFSϖΞͰNWো->LAGແޮԽͯ͠෮چ->12࣌ؒޙʹ࠶ൃ->υϥΠόىҼͱಛఆͯ͠͠->σʔλෆ߹/VMͷϑΝ ΠϧγεςϜσʔλഁଛൃੜ
ΞϓϦϨϕϧͷো (1/3) • ཧNWىҼ͚ͩͰͳ͍ɿϓϩάϥϜɾOSεέδϡʔϥԆɺߴෛՙϓϩηε • CPUߴෛՙͱαʔϏε • ElasticSearch࠶ىಈ->Ϋϥελׂ->split brain->Ϋϥελ෮چ->indexՃআग़དྷͣ->αʔόׂ͕ΓͯΒΕ͍ͯͳ͍indexΛ෮چ ͠Α͏ͱ͢ΔʢͰ͖ͳ͍ͷͰCPUۭճΓʣ->20ར༻ఀࢭɺ6࣌ؒαʔϏεԼ
• ͍GCఀࢭͱI/O • ESΫϥελͰGCى͖Δ->secondary node͕primary deadએݴ->ຊprimaryࢮΜͰ͍ͳ͍ͷͰsplit brain(dual master) • I/OݪҼͰGCఀࢭ->IO_WAIT͕࣌ؒ૿Ճ->split brain/write loss/indexഁଛ • MySQL overload & pacemaker segfault (github) • MySQL primaryෛՙ -> secondaryΛঢ֨->secondaryͷcold cache͕͔ͬͨ->primaryʹfailoverͨ͠Α͏ͱ͕ͨ͠खಈͰఀࢭ • ཌprimaryଆͷมߋ͕secondaryʹө͞Ε͍ͯͳ͍͜ͱΛൃݟ->Replication ManagerͰͷखಈ෮چதʹsegfault ->ࣗಈɾखಈϨϓ Ϧέʔγϣϯڝ߹ɺʢ֎෦ΩʔʹҰ؏ੑ͕ͳ͔ͬͨͨΊʣଞਓͷprivate repoΛදࣔɾɾɾ
ΞϓϦϨϕϧͷো (2/3) • DRDB split brain • 2nodeͷ߹ʢNW partition͞Εͨ߹ʹʣ͕࣮ࣗ֬ʹprimaryͰ͋Δͱݴ͑ͳ͍-> ྆node͕primary/onlineঢ়ଶͰॻ͖ࠐΈΛड͚ೖΕɺϑΝΠϧγεςϜϨϕϧͰ૬ҧൃੜ
• VoldDB on EC2 • NWো->split brain->dual primary->replica૬ҧ->ॏେͳσʔλଛࣦ • Mystery RabbitMQ Partition • ࠶ૹগͳ͘ɺϝοηʔδ҆ఆɺϊʔυؒଓ҆ఆɺͰpartition͢Δɻɻ • partitionݕग़timeoutΛ2ʹ͢ΔͱසݮΔ͕ɺpartitonશʹ͙͜ͱग़དྷͣɻṖɻ
ΞϓϦϨϕϧͷো (3/3) • ElasticSearch Discovery Failure on EC2 • 2node
ESΫϥελͰσΟεΧόϦϝοηʔδަʹ3ඵҎ্͔͔Δͱ1/ 10ͷ֬Ͱdual masterʹɻɻʢ߱֨खಈͷΈʣ • timeout15ඵʹͯ͠ղܾ
݁ɿզʑͲ͜ʹ͔͏ͷ͔ʁ • ࢀরͱͯ͠ͷ·ͱΊ • ϓϩηεɾαʔόɾNICɾεΠονɾϩʔΧϧɾάϩʔόϧ • NWো”ಥવ”དྷΔɻex: ఆظupdate࣌, ϝϯς࣌ •
ҰํͰɺpartiton͕ى͖ͳ͍NW/γεςϜ͋Δɻex: ۚ༥ܥʢ৻ॏͳΤϯδχΞϦϯά+NWٕज़ਐԽ+͓ۚʣ • Google/Amazon (ലେͳنͷͨΊɺҰͭҰͭͷHWίετ)Startup(༧ࢉ͕ݶΒΕΔ) • ༷ʑͳো͕ى͖ΔɻHuman errorΛؚΉݱ࣮ͷࢄγεςϜͷ͕ى͖Δ • ʢpartition͕ى͖ΔલʹʣϦεΫΛ࠶ߟ͢Δ͜ͱ͕ॏཁ • ϗϫΠτϘʔυͰಈ͖Λ͍ͳ͕Βɻ • PartitionରԠ͢Δͱଟ͘ͷ߹ϝϦοτ͕ಘΒΕΔ • partitionରԠͷՃϨΠςϯγͱɺʢࣄޙʹ͔͔Δௐ࣌ؒݮͷʣϝϦοτ
ิɿࠓͲ͖ʢ2020ʣͰΔͳΒʁ • ܗࣜख๏ • Unknownͳstateʢͦͷఆ্ٛʣଘࡏ͠ͳ͍͜ͱΛ୲อͰ͖Δ • Chaos Engineering • ࣮ϛεɺγεςϜ݁߹࣌ͷෆ߹Λൃݟɾमਖ਼Ͱ͖Δ
EOP