Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
#14 “The Network is Reliable"
Search
cafenero_777
June 14, 2023
Technology
0
120
#14 “The Network is Reliable"
ACM queue: July 23, 2014 Volume 12, issue 7
https://queue.acm.org/detail.cfm?id=2655736
cafenero_777
June 14, 2023
Tweet
Share
More Decks by cafenero_777
See All by cafenero_777
#51 “Empowering Azure Storage with RDMA”
cafenero_777
3
530
#49 “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems”
cafenero_777
2
130
#50 “Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction”
cafenero_777
0
150
#33 “Destroying networks for fun (and profit)”
cafenero_777
0
110
#34 “MTPSA: Multi-Tenant Programmable Switches”
cafenero_777
0
79
#37 “Bluebird: High-performance SDN for Bare-metal Cloud Services”
cafenero_777
1
150
#39 “Profiling a warehouse-scale computer”
cafenero_777
0
58
#23 “VFP: A Virtual Switch Platform for Host SDN in the Public Cloud”
cafenero_777
0
270
#24 “Ananta: Cloud Scale Load Balancing”
cafenero_777
0
320
Other Decks in Technology
See All in Technology
Agile Leadership Summit Keynote 2026
m_seki
1
660
SREが向き合う大規模リアーキテクチャ 〜信頼性とアジリティの両立〜
zepprix
0
470
20260204_Midosuji_Tech
takuyay0ne
1
160
生成AIを活用した音声文字起こしシステムの2つの構築パターンについて
miu_crescent
PRO
3
210
Red Hat OpenStack Services on OpenShift
tamemiya
0
130
プロポーザルに込める段取り八分
shoheimitani
1
600
Tebiki Engineering Team Deck
tebiki
0
24k
モダンUIでフルサーバーレスなAIエージェントをAmplifyとCDKでサクッとデプロイしよう
minorun365
4
220
30万人の同時アクセスに耐えたい!新サービスの盤石なリリースを支える負荷試験 / SRE Kaigi 2026
genda
4
1.3k
AWS Network Firewall Proxyを触ってみた
nagisa53
1
240
22nd ACRi Webinar - NTT Kawahara-san's slide
nao_sumikawa
0
100
制約が導く迷わない設計 〜 信頼性と運用性を両立するマイナンバー管理システムの実践 〜
bwkw
3
1k
Featured
See All Featured
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
659
61k
Between Models and Reality
mayunak
1
190
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
16k
A Modern Web Designer's Workflow
chriscoyier
698
190k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
55
3.2k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
21
1.4k
Music & Morning Musume
bryan
47
7.1k
Leveraging LLMs for student feedback in introductory data science courses - posit::conf(2025)
minecr
0
160
Visualization
eitanlees
150
17k
A designer walks into a library…
pauljervisheath
210
24k
Navigating Weather and Climate Data
rabernat
0
110
Six Lessons from altMBA
skipperchong
29
4.2k
Transcript
Research Paper Introduction #14 “The Network is Reliable ~An informal
survey of real-world communications failures~” ௨ࢉ#52 @cafenero_777 2020/09/24
Agenda • ରจ • ֓ཁͱಡ͏ͱͨ͠ཧ༝ • Introduction • େنΠϯϑϥ •
DC NW • ΫϥυNW • ϗεςΟϯάϓϩόΠμʔ • ҬNW • Global Routing Error • NIC/Driver • ΞϓϦέʔγϣϯ • ·ͱΊ
$ which • The Network is Reliable: An informal survey
of real-world communications failures • Peter Bailis, UC Berkeley, Kyle Kingsbury, Jepsen Networks • ACM queue: July 23, 2014 Volume 12, issue 7 • https://queue.acm.org/detail.cfm?id=2655736
֓ཁͱಡ͏ͱͨ͠ཧ༝ • ֓ཁ • ༷ʑͳࣄۀऀͰNWোʢpartitionʣ͕ൃੜ͍ͯ͠Δ • ্هʹىҼ͢ΔࢄγεςϜোͷࣄྫΛ͘ௐࠪɻ • ઃܭ࣌ʹΑ͘ߟ͑ͯʂʢઃܭޙৼΓฦͬͯʂʣ •
ಡ͏ͱͨ͠ཧ༝ • ࣮ࡍʹى͖ͨোʹ͍ͭͯ͘ɾ۩ମతʹݴٴ͞Ε͍ͯͨͨΊɻ • Կ͔ͷจͰҾ༻͞Ε͍ͯͨʢΕͨɻɻʣ
ࢄίϯϐϡʔςΟϯά҆શਆʁ http://nighthacks.com/jag/res/Fallacies.html http://www.rgoarchitects.com/Files/fallacies.pdf
ࢄίϯϐϡʔςΟϯά҆શਆʁ • “ଟ͘ͷਓࢄγεςϜΛ࠷ॳʹߏங͢Δ ࣌ɺҎԼͷ8߲Λఆͯ͠͠·͏ɻ͜ΕΒ Քಇதʹඞͣൃੜ͠ɺେΛҾ͖ى͜ ͠ɺ௧͍ࣦഊܦݧΛҾ͖ى͜͢ɻ” • NWোͰͳ͘ਓͷࢥ͍ࠐΈ՝ɾɾɾ • ͘ɾͨ·ʹਂ͘ݟΔඞཁੑ
http://nighthacks.com/jag/res/Fallacies.html http://www.rgoarchitects.com/Files/fallacies.pdf
http://nighthacks.com/jag/res/Fallacies.html http://www.rgoarchitects.com/Files/fallacies.pdf Where you are?
ཧ৭ʑ͋Δ͕ɺɺ • ݱঢ় • ີ݁߹ɾ৫ڞ༗͞Εͣ • ΞϓϦϨΠϠʔͰ݁ • ਪଌͱᷚ •
ݱ࣮ͷোࣄྫΛ·ͱΊΔ • ͔ͦ͜ΒֶͿ
େنࢄΠϯϑϥͷࣄྫ (1/2) • MS Data Center by MS research •
5.2ݸͷσόΠεނো/day, 40.8ݸͷϦϯΫো/day • म෮࣌ؒͷதԝ5(࠷େ1िؒ), ύέϩεதԝ59,000ύέοτ • NWੑͰ43%্->NWোͷҰൠతͳݪҼഉআʹࢸΒͣ • HPΤϯλʔϓϥΠζཧNW By HP labͰͷαϙʔτνέοτੳ • ଓؔ࿈νέοτ11.4%, ͦͷ͏ͪ14%࠷ߴ༏ઌ • ࠷ߴ༏ઌͷରԠ࣌ؒͷதԝ2.75࣌ؒɺશதԝ18 • Google Chubby (ࢄϩοΫγεςϜ for খ༰ྔࢄετϨʔδ) • 61ݸͷఀࢭ@700Λௐࠪɺ9ͭ30ඵҎ্ఀࢭɿ4ͭNWىҼɺ2ͭ”NWଓىҼΒ͍͠”
େنࢄΠϯϑϥͷࣄྫ (2/2) • googleࢄγεςϜͷlesson & advice by Je ff Deanޚେͷ৽Ϋϥελௐࠪʢ࠷ॳͷ1ʣ
• 5ϥοΫ͕ෆ҆ఆʢ50%ύέϩεʣ • 8ϝϯςʢ4ϝϯς30ؒϥϯμϜͳύέϩεͷՄೳੑʣ • 3ϧʔλোʢ1࣌ؒܧଓʣ • ΞυόΠε • ෳόʔδϣϯڝ߹Λߋ৽Ͱ͖ΔநԽϨΠϠʔͰٵऩͤ͞Δ • NWো(partition)෮چޙʹϨϓϦέʔτௐ • Amazon Dynamo (KVS) • ”ैདྷͷෳ͞ΕͨRDBͰNW partitionʹରॲͰ͖ͳ͍”ͱݴٴɻconsistencyΛ٘ਜ਼ʹͯ͠ͰavailabilityΛऔͬͨ • Yahoo! PNUTS/Sherpa (ཧࢄDB) • ڧ͍߹ੑͰ͋ΔλΠϜϥΠϯ߹ੑʢશϨίʔυ͕શϨϓϦΧಉ͡ॱংͰߋ৽ॲཧ͞ΕΔʣΛαϙʔτ • ->NWো(partition)ɾαʔόোͰ͠ΜͲ͍ͷͰऑ͍߹ੑʹมߋɾɾɾ https://research.cs.cornell.edu/ladis2009/talks/dean-keynote-ladis2009.pdf
DC NWো • ిݯো • ToR͕ยํམͪΔɺͳ͔ͥ͏ยํͷToRམͪΔɺϥοΫؒ௨৴͕ग़དྷͳ͍ΦϯσϚϯυαʔϏε͕ఀࢭɾɾɾ • ->ඞͣ͠ϦϯΫোΛ͙ͷͰͳ͍ʢMS SIGCOM paper͕ࣔࠦʣ
• BPDUϑϥου • ϝϯςதʹSTPϑϥοϓ͠ʢBPDUن֨తʹൃੜ͠ͳ͍ʣBPDUϑϥουൃੜɻ2࣌ؒαʔϏεఀࢭ • Bridge loop/Miscon fi guration/Broken MAC cache (github) • ʢଟஈSWͰͳ͘ʣूεΠονΛಋೖ -> ϧʔϓൃੜ->ϦϯΫແޮԽ->Կނ͔ར༻ଳҬ100%ுΓ͖ɾɾ • ઃఆϛεঢ়ଶͰ1ຊམͱ͢->োݕग़ػߏ͕શஅͤ͞Δ->18μϯ • εΠον͕MACΞυϨεͷΩϟογϡΛਖ਼͘͠ߋ৽Ͱ͖ͳ͍ͨΊϒϩʔυΩϟετ͢ΔϑΝʔϜόάɾɾɾ • MLAG/STP/STONITH (github) • ϕϯμʔ͕ूεΠονͷಛఆagentΛఀࢭ->linkΛshutग़དྷͣ->ਖ਼ৗʹLAG/STP/L2ϓϩτίϧॲཧͰ͖ͣ->STP࠶ܭࢉͰ90ඵϒϩοΫ • ϑΝΠϧαʔό (Pacemaker/DRBD)͕͓ޓ͍ఀࢭͯ͠ͱஅ->STONITHʢ૬खΛڧ੍rebootʣ->NW෮چޙʹ྆ܥdown->ϑΝΠϧΞΫηεग़དྷͣ • खಈ෮چʢϓϥΠϚϦϊʔυʹ߹ΘͤΔɻ྆ܥϓϥΠϚϦͳΒϩάௐࠪɺɺʣʹ5͔͔࣌ؒͬͨ
ΫϥυωοτϫʔΫࣄྫ (1/2) • Isolated MongoDB primary on EC2 • EC2
WestϦʔδϣϯͰNWো-> 1primary/2secondary͕->෮چޙʹݹ͍primary͕”্ॻ͖”ͨ͠->̎࣌ؒͷ ॻ͖ࠐΈଛࣦɻ • োࣗମҰൠతͳͷɺɺ • Amnesia split-brain on EC2 • Ұ൩Ͱsplit brain, ӡ༻νʔϜ͕ยܥΛrestartͰղܾ • MongoDB/ElasticSearch on EC2 • NWো->ಛఆϊʔυӨڹ->αʔϏεશͯʹӨڹՄೳੑ • ඵɾ݄ճͷbackendఀࢭ -> -45ͷαʔϏεఀࢭͱESΠϯσοΫεഁଛɺఀࢭ12-4ճ·ͰΤεΧϨʔτ
ΫϥυωοτϫʔΫࣄྫ (2/2) • AWS EBSఀࢭ (2011/04/21) • US West AZ͔ΒEBSͷτϥϑΟοΫΛγϑτ
-> ϧʔςΟϯάϙϦγʔϛε->primary/secondary NWΛಉ࣌ʹஅ -> EBSϛϥʔετʔϜൃੜ-> -> EC2 12࣌ؒఀࢭ, EBS 80࣌ؒఀࢭ • RDSఀࢭʢAZ failoverόάͷͨΊɺ2.5%͕ࣦഊʣɺHeroku 16-60࣌ؒఀࢭ • Isolated Redis Primary on EC2 • NWো->Twilio՝ۚPrimary Redis͕->secondaryঢ֨ͳ͠Ͱprimaryʹॻ͖ࠐΈ->෮چޙͷ࠶ ಉظͰprimaryߴෛՙ->primaryΛखಈͰ࠶ىಈ->ޡͬͨcon fi gͰىಈ͠ɺread onlyͰىಈ->Twilio APIݺͼग़͠Ͱސ٬ʹ࠶νϟʔδ͢Δ->40Ͱ1.1%ސ٬ʹաٻ->SMS/௨ͰΫϨΧ500υϧ ٻ+3500υϧΛ͑ΔͱٻडෆՄ
ϗεςΟϯάϓϩόΠμʔ • ҆ՁͰ৴པੑ͕ߴ͍ʢϋζʣ+NW/ServerཧऀͰ͋Δඞཁ͕͋Δ • GlusterFS split brain (Freistil IT) •
ϧʔλϑΝʔϜΣΞόάͰ50%-100%ύέϩε->GluasterFS͕split brain-> ෮چޙ2ͭ ͷσʔληοτΛෆ߹Λղܾग़དྷͣ ->म෮ޙʹτϥϑΟοΫٸ૿͠Webϊʔυߴෛՙ • ಗ໊ͳओཁϗεςΟϯάۀऀʢ100-200nodeنʣ • 90ؒͰ5ͭpartition͕࣌ؒൃੜ • ෦ͱ֎෦Λͭͳ͙NWোɺ෦ͱཧNWΛো
ҬNW • WANোɿϧʔτ͕গͳ͍߹ɺෳDCͳDRඞཁͳͲ • CENICʢCorporation for Education Network Initiatives in
Californiaʣௐࠪ • ΧϦϑΥϧχΞશͷϧʔλΛ̑ௐࠪʢϦϯΫোɺeBGP/tracerouteσʔλʣ • 500Ҏ্ͷNW partitionΛൃݟ • SW 6ʢதԝ2.7ɺ19.9@95%ileʣ • HW 8.2࣌ؒʢதԝ32ɺ3.7@95%ileʣ • PagerDuty on 2 EC2 region/Linode • CA෦AWSϐΞϦϯά͕ྼԽ->EC2ϊʔυ͕ଓྼԽ͠latency૿Ճ->ΫΦϥϜશஅ->ϝοηʔδͷσΟεύονఀࢭ • ઃܭతʹߟྀ͞Ε͍͕ͯͨɺ݁Ռతʹ18ར༻ෆՄɺAPIϦΫΤετυϩοϓࢯɺΫΦϥϜ෮چ·ͰϖʔδԆ
Global Routing Error • Cloud fl are • path/AnycastΛۦ͢Δ23ͷDC •
DDoSରࡦͱͯ͠ಛఆαΠζͷύέοτΛdropͤ͞ΔΑ͏FlowSpecͰશΤοδϧʔλʹୡ->ύέοτʹҰகͤͣ->Ϋϥογϡ͢Δ·ͰRAMফඅ͠ଓ͚ͨ->ࣗ ಈ࠶ىಈ͠ͳ͍ɺmgmtΞΫηεͰ͖ͳ͍->Ұ෦෮چτϥϑΟοΫूதͯ͠ߴෛՙ->·ͩϑΥʔϧόοΫ->ݱखಈ࠶ىಈʢ30ޙʹ։࢝ɺ1࣌ؒར༻ෆՄʣ • Level3 2011 • JuniperͷϑΝʔϜΣΞόάͷͨΊɺόοΫϘʔϯఀࢭ • Time Warner Cable RIM BlackBerry, UK ISP͕ΦϑϥΠϯ • Global BGP outages • 2008ʹύΩελϯςϨίϜ͕youtubeΛϒϩοΫ->ͦͷʢϒϩοΫ͞ΕͨʣϧʔτΛଞͷISPʹใ->ΞΫηεෆՄʢBGP hijackʣ • 2010ʹσϡʔΫେֶ͕BGPͷ࣮ݧతͳϑϥάΛςετ͢Δ͜ͱͰಉ༷ͷޮՌΛ֬ೝ • ൃදऀऍɿBGP hijackʢඇਖ਼نͳAS͔ΒউखʹBGPใʣ݁ߏى͖ͯΔ • ྫɿ2018ʹGoogle͕ʢࣗಈԽ͞ΕͨγεςϜͷόάʁͰʣىͨ݅͜͠ https://gigazine.net/news/20180711-shutting-down-bgp-hijack-factory/
NIC/Driver • Broadcom BCM5709 and Friends • BCM5709 • ड৴ύέοτdrop͢Δ͕ɺૹ৴drop͠ͳ͍ϑΝʔϜόά
• ->primaryॲཧʢड৴ʣͰ͖ͳ͍͕ɺsecondaryprimary͕ੜ͖ͯΔͱࢥ͍ࠐΉ->secondaryʹfallback͠ͳ͍->5࣌ؒఀࢭ -> Sven Ulland ࢯ͕Linux 2.6.32Ͱใࠂɺ2.6.38·ͰղܾͰ͖ͣɻ • BCM5709ʢαʔόʣ͕crash/bu ff erᷓΕ࣌ʹແؔʹPAUSEϑϨʔϜΛग़͢->ToR(BCM56314/BCM56820)͕֦େ->NWશମ͕ো • BCM57711ͰδϟϯϘϑϨʔϜͰߴෛՙ࣌ʹϨΠςϯγѱԽ->ESX on iSCSIͰݦࡏԽ • Intel 82574: Packet of Death • EEPROM͕ਖ਼͘͠ fl ashग़དྷͣ->SIPͷड৴ύέοτΛNIC͕ແޮԽʢௐ͕ࠪඇৗʹ͍͠ɺɺʣ->cold restartͰ෮׆ɾɾ • DriverىҼͷGlusterFS partition • ΞοϓάϨʔυޙʹFlusterFSϖΞͰNWো->LAGແޮԽͯ͠෮چ->12࣌ؒޙʹ࠶ൃ->υϥΠόىҼͱಛఆͯ͠͠->σʔλෆ߹/VMͷϑΝ ΠϧγεςϜσʔλഁଛൃੜ
ΞϓϦϨϕϧͷো (1/3) • ཧNWىҼ͚ͩͰͳ͍ɿϓϩάϥϜɾOSεέδϡʔϥԆɺߴෛՙϓϩηε • CPUߴෛՙͱαʔϏε • ElasticSearch࠶ىಈ->Ϋϥελׂ->split brain->Ϋϥελ෮چ->indexՃআग़དྷͣ->αʔόׂ͕ΓͯΒΕ͍ͯͳ͍indexΛ෮چ ͠Α͏ͱ͢ΔʢͰ͖ͳ͍ͷͰCPUۭճΓʣ->20ར༻ఀࢭɺ6࣌ؒαʔϏεԼ
• ͍GCఀࢭͱI/O • ESΫϥελͰGCى͖Δ->secondary node͕primary deadએݴ->ຊprimaryࢮΜͰ͍ͳ͍ͷͰsplit brain(dual master) • I/OݪҼͰGCఀࢭ->IO_WAIT͕࣌ؒ૿Ճ->split brain/write loss/indexഁଛ • MySQL overload & pacemaker segfault (github) • MySQL primaryෛՙ -> secondaryΛঢ֨->secondaryͷcold cache͕͔ͬͨ->primaryʹfailoverͨ͠Α͏ͱ͕ͨ͠खಈͰఀࢭ • ཌprimaryଆͷมߋ͕secondaryʹө͞Ε͍ͯͳ͍͜ͱΛൃݟ->Replication ManagerͰͷखಈ෮چதʹsegfault ->ࣗಈɾखಈϨϓ Ϧέʔγϣϯڝ߹ɺʢ֎෦ΩʔʹҰ؏ੑ͕ͳ͔ͬͨͨΊʣଞਓͷprivate repoΛදࣔɾɾɾ
ΞϓϦϨϕϧͷো (2/3) • DRDB split brain • 2nodeͷ߹ʢNW partition͞Εͨ߹ʹʣ͕࣮ࣗ֬ʹprimaryͰ͋Δͱݴ͑ͳ͍-> ྆node͕primary/onlineঢ়ଶͰॻ͖ࠐΈΛड͚ೖΕɺϑΝΠϧγεςϜϨϕϧͰ૬ҧൃੜ
• VoldDB on EC2 • NWো->split brain->dual primary->replica૬ҧ->ॏେͳσʔλଛࣦ • Mystery RabbitMQ Partition • ࠶ૹগͳ͘ɺϝοηʔδ҆ఆɺϊʔυؒଓ҆ఆɺͰpartition͢Δɻɻ • partitionݕग़timeoutΛ2ʹ͢ΔͱසݮΔ͕ɺpartitonશʹ͙͜ͱग़དྷͣɻṖɻ
ΞϓϦϨϕϧͷো (3/3) • ElasticSearch Discovery Failure on EC2 • 2node
ESΫϥελͰσΟεΧόϦϝοηʔδަʹ3ඵҎ্͔͔Δͱ1/ 10ͷ֬Ͱdual masterʹɻɻʢ߱֨खಈͷΈʣ • timeout15ඵʹͯ͠ղܾ
݁ɿզʑͲ͜ʹ͔͏ͷ͔ʁ • ࢀরͱͯ͠ͷ·ͱΊ • ϓϩηεɾαʔόɾNICɾεΠονɾϩʔΧϧɾάϩʔόϧ • NWো”ಥવ”དྷΔɻex: ఆظupdate࣌, ϝϯς࣌ •
ҰํͰɺpartiton͕ى͖ͳ͍NW/γεςϜ͋Δɻex: ۚ༥ܥʢ৻ॏͳΤϯδχΞϦϯά+NWٕज़ਐԽ+͓ۚʣ • Google/Amazon (ലେͳنͷͨΊɺҰͭҰͭͷHWίετ)Startup(༧ࢉ͕ݶΒΕΔ) • ༷ʑͳো͕ى͖ΔɻHuman errorΛؚΉݱ࣮ͷࢄγεςϜͷ͕ى͖Δ • ʢpartition͕ى͖ΔલʹʣϦεΫΛ࠶ߟ͢Δ͜ͱ͕ॏཁ • ϗϫΠτϘʔυͰಈ͖Λ͍ͳ͕Βɻ • PartitionରԠ͢Δͱଟ͘ͷ߹ϝϦοτ͕ಘΒΕΔ • partitionରԠͷՃϨΠςϯγͱɺʢࣄޙʹ͔͔Δௐ࣌ؒݮͷʣϝϦοτ
ิɿࠓͲ͖ʢ2020ʣͰΔͳΒʁ • ܗࣜख๏ • Unknownͳstateʢͦͷఆ্ٛʣଘࡏ͠ͳ͍͜ͱΛ୲อͰ͖Δ • Chaos Engineering • ࣮ϛεɺγεςϜ݁߹࣌ͷෆ߹Λൃݟɾमਖ਼Ͱ͖Δ
EOP