Lock in $30 Savings on PRO—Offer Ends Soon! ⏳
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
#14 “The Network is Reliable"
Search
cafenero_777
June 14, 2023
Technology
0
110
#14 “The Network is Reliable"
ACM queue: July 23, 2014 Volume 12, issue 7
https://queue.acm.org/detail.cfm?id=2655736
cafenero_777
June 14, 2023
Tweet
Share
More Decks by cafenero_777
See All by cafenero_777
#51 “Empowering Azure Storage with RDMA”
cafenero_777
3
520
#49 “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems”
cafenero_777
2
120
#50 “Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction”
cafenero_777
0
140
#33 “Destroying networks for fun (and profit)”
cafenero_777
0
100
#34 “MTPSA: Multi-Tenant Programmable Switches”
cafenero_777
0
75
#37 “Bluebird: High-performance SDN for Bare-metal Cloud Services”
cafenero_777
1
140
#39 “Profiling a warehouse-scale computer”
cafenero_777
0
56
#23 “VFP: A Virtual Switch Platform for Host SDN in the Public Cloud”
cafenero_777
0
250
#24 “Ananta: Cloud Scale Load Balancing”
cafenero_777
0
310
Other Decks in Technology
See All in Technology
シニアソフトウェアエンジニアになるためには
kworkdev
PRO
3
210
ペアーズにおけるAIエージェント 基盤とText to SQLツールの紹介
hisamouna
2
1k
ExpoのインダストリーブースでみたAWSが見せる製造業の未来
hamadakoji
0
180
AWS運用を効率化する!AWS Organizationsを軸にした一元管理の実践/nikkei-tech-talk-202512
nikkei_engineer_recruiting
0
140
フィッシュボウルのやり方 / How to do a fishbowl
pauli
2
300
AIBuildersDay_track_A_iidaxs
iidaxs
3
620
AIプラットフォームにおけるMLflowの利用について
lycorptech_jp
PRO
1
180
障害対応訓練、その前に
coconala_engineer
0
140
ActiveJobUpdates
igaiga
1
280
AlmaLinux + KVM + Cockpit で始めるお手軽仮想化基盤 ~ 開発環境などでの利用を想定して ~
koedoyoshida
0
130
Database イノベーショントークを振り返る/reinvent-2025-database-innovation-talk-recap
emiki
0
250
Strands Agents × インタリーブ思考 で変わるAIエージェント設計 / Strands Agents x Interleaved Thinking AI Agents
takanorig
4
1.3k
Featured
See All Featured
Making the Leap to Tech Lead
cromwellryan
135
9.7k
How GitHub (no longer) Works
holman
316
140k
StorybookのUI Testing Handbookを読んだ
zakiyama
31
6.5k
Avoiding the “Bad Training, Faster” Trap in the Age of AI
tmiket
0
34
Designing Experiences People Love
moore
143
24k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
254
22k
Effective software design: The role of men in debugging patriarchy in IT @ Voxxed Days AMS
baasie
0
170
The AI Search Optimization Roadmap by Aleyda Solis
aleyda
1
5k
Measuring & Analyzing Core Web Vitals
bluesmoon
9
710
Groundhog Day: Seeking Process in Gaming for Health
codingconduct
0
61
Organizational Design Perspectives: An Ontology of Organizational Design Elements
kimpetersen
PRO
0
40
Marketing Yourself as an Engineer | Alaka | Gurzu
gurzu
0
87
Transcript
Research Paper Introduction #14 “The Network is Reliable ~An informal
survey of real-world communications failures~” ௨ࢉ#52 @cafenero_777 2020/09/24
Agenda • ରจ • ֓ཁͱಡ͏ͱͨ͠ཧ༝ • Introduction • େنΠϯϑϥ •
DC NW • ΫϥυNW • ϗεςΟϯάϓϩόΠμʔ • ҬNW • Global Routing Error • NIC/Driver • ΞϓϦέʔγϣϯ • ·ͱΊ
$ which • The Network is Reliable: An informal survey
of real-world communications failures • Peter Bailis, UC Berkeley, Kyle Kingsbury, Jepsen Networks • ACM queue: July 23, 2014 Volume 12, issue 7 • https://queue.acm.org/detail.cfm?id=2655736
֓ཁͱಡ͏ͱͨ͠ཧ༝ • ֓ཁ • ༷ʑͳࣄۀऀͰNWোʢpartitionʣ͕ൃੜ͍ͯ͠Δ • ্هʹىҼ͢ΔࢄγεςϜোͷࣄྫΛ͘ௐࠪɻ • ઃܭ࣌ʹΑ͘ߟ͑ͯʂʢઃܭޙৼΓฦͬͯʂʣ •
ಡ͏ͱͨ͠ཧ༝ • ࣮ࡍʹى͖ͨোʹ͍ͭͯ͘ɾ۩ମతʹݴٴ͞Ε͍ͯͨͨΊɻ • Կ͔ͷจͰҾ༻͞Ε͍ͯͨʢΕͨɻɻʣ
ࢄίϯϐϡʔςΟϯά҆શਆʁ http://nighthacks.com/jag/res/Fallacies.html http://www.rgoarchitects.com/Files/fallacies.pdf
ࢄίϯϐϡʔςΟϯά҆શਆʁ • “ଟ͘ͷਓࢄγεςϜΛ࠷ॳʹߏங͢Δ ࣌ɺҎԼͷ8߲Λఆͯ͠͠·͏ɻ͜ΕΒ Քಇதʹඞͣൃੜ͠ɺେΛҾ͖ى͜ ͠ɺ௧͍ࣦഊܦݧΛҾ͖ى͜͢ɻ” • NWোͰͳ͘ਓͷࢥ͍ࠐΈ՝ɾɾɾ • ͘ɾͨ·ʹਂ͘ݟΔඞཁੑ
http://nighthacks.com/jag/res/Fallacies.html http://www.rgoarchitects.com/Files/fallacies.pdf
http://nighthacks.com/jag/res/Fallacies.html http://www.rgoarchitects.com/Files/fallacies.pdf Where you are?
ཧ৭ʑ͋Δ͕ɺɺ • ݱঢ় • ີ݁߹ɾ৫ڞ༗͞Εͣ • ΞϓϦϨΠϠʔͰ݁ • ਪଌͱᷚ •
ݱ࣮ͷোࣄྫΛ·ͱΊΔ • ͔ͦ͜ΒֶͿ
େنࢄΠϯϑϥͷࣄྫ (1/2) • MS Data Center by MS research •
5.2ݸͷσόΠεނো/day, 40.8ݸͷϦϯΫো/day • म෮࣌ؒͷதԝ5(࠷େ1िؒ), ύέϩεதԝ59,000ύέοτ • NWੑͰ43%্->NWোͷҰൠతͳݪҼഉআʹࢸΒͣ • HPΤϯλʔϓϥΠζཧNW By HP labͰͷαϙʔτνέοτੳ • ଓؔ࿈νέοτ11.4%, ͦͷ͏ͪ14%࠷ߴ༏ઌ • ࠷ߴ༏ઌͷରԠ࣌ؒͷதԝ2.75࣌ؒɺશதԝ18 • Google Chubby (ࢄϩοΫγεςϜ for খ༰ྔࢄετϨʔδ) • 61ݸͷఀࢭ@700Λௐࠪɺ9ͭ30ඵҎ্ఀࢭɿ4ͭNWىҼɺ2ͭ”NWଓىҼΒ͍͠”
େنࢄΠϯϑϥͷࣄྫ (2/2) • googleࢄγεςϜͷlesson & advice by Je ff Deanޚେͷ৽Ϋϥελௐࠪʢ࠷ॳͷ1ʣ
• 5ϥοΫ͕ෆ҆ఆʢ50%ύέϩεʣ • 8ϝϯςʢ4ϝϯς30ؒϥϯμϜͳύέϩεͷՄೳੑʣ • 3ϧʔλোʢ1࣌ؒܧଓʣ • ΞυόΠε • ෳόʔδϣϯڝ߹Λߋ৽Ͱ͖ΔநԽϨΠϠʔͰٵऩͤ͞Δ • NWো(partition)෮چޙʹϨϓϦέʔτௐ • Amazon Dynamo (KVS) • ”ैདྷͷෳ͞ΕͨRDBͰNW partitionʹରॲͰ͖ͳ͍”ͱݴٴɻconsistencyΛ٘ਜ਼ʹͯ͠ͰavailabilityΛऔͬͨ • Yahoo! PNUTS/Sherpa (ཧࢄDB) • ڧ͍߹ੑͰ͋ΔλΠϜϥΠϯ߹ੑʢશϨίʔυ͕શϨϓϦΧಉ͡ॱংͰߋ৽ॲཧ͞ΕΔʣΛαϙʔτ • ->NWো(partition)ɾαʔόোͰ͠ΜͲ͍ͷͰऑ͍߹ੑʹมߋɾɾɾ https://research.cs.cornell.edu/ladis2009/talks/dean-keynote-ladis2009.pdf
DC NWো • ిݯো • ToR͕ยํམͪΔɺͳ͔ͥ͏ยํͷToRམͪΔɺϥοΫؒ௨৴͕ग़དྷͳ͍ΦϯσϚϯυαʔϏε͕ఀࢭɾɾɾ • ->ඞͣ͠ϦϯΫোΛ͙ͷͰͳ͍ʢMS SIGCOM paper͕ࣔࠦʣ
• BPDUϑϥου • ϝϯςதʹSTPϑϥοϓ͠ʢBPDUن֨తʹൃੜ͠ͳ͍ʣBPDUϑϥουൃੜɻ2࣌ؒαʔϏεఀࢭ • Bridge loop/Miscon fi guration/Broken MAC cache (github) • ʢଟஈSWͰͳ͘ʣूεΠονΛಋೖ -> ϧʔϓൃੜ->ϦϯΫແޮԽ->Կނ͔ར༻ଳҬ100%ுΓ͖ɾɾ • ઃఆϛεঢ়ଶͰ1ຊམͱ͢->োݕग़ػߏ͕શஅͤ͞Δ->18μϯ • εΠον͕MACΞυϨεͷΩϟογϡΛਖ਼͘͠ߋ৽Ͱ͖ͳ͍ͨΊϒϩʔυΩϟετ͢ΔϑΝʔϜόάɾɾɾ • MLAG/STP/STONITH (github) • ϕϯμʔ͕ूεΠονͷಛఆagentΛఀࢭ->linkΛshutग़དྷͣ->ਖ਼ৗʹLAG/STP/L2ϓϩτίϧॲཧͰ͖ͣ->STP࠶ܭࢉͰ90ඵϒϩοΫ • ϑΝΠϧαʔό (Pacemaker/DRBD)͕͓ޓ͍ఀࢭͯ͠ͱஅ->STONITHʢ૬खΛڧ੍rebootʣ->NW෮چޙʹ྆ܥdown->ϑΝΠϧΞΫηεग़དྷͣ • खಈ෮چʢϓϥΠϚϦϊʔυʹ߹ΘͤΔɻ྆ܥϓϥΠϚϦͳΒϩάௐࠪɺɺʣʹ5͔͔࣌ؒͬͨ
ΫϥυωοτϫʔΫࣄྫ (1/2) • Isolated MongoDB primary on EC2 • EC2
WestϦʔδϣϯͰNWো-> 1primary/2secondary͕->෮چޙʹݹ͍primary͕”্ॻ͖”ͨ͠->̎࣌ؒͷ ॻ͖ࠐΈଛࣦɻ • োࣗମҰൠతͳͷɺɺ • Amnesia split-brain on EC2 • Ұ൩Ͱsplit brain, ӡ༻νʔϜ͕ยܥΛrestartͰղܾ • MongoDB/ElasticSearch on EC2 • NWো->ಛఆϊʔυӨڹ->αʔϏεશͯʹӨڹՄೳੑ • ඵɾ݄ճͷbackendఀࢭ -> -45ͷαʔϏεఀࢭͱESΠϯσοΫεഁଛɺఀࢭ12-4ճ·ͰΤεΧϨʔτ
ΫϥυωοτϫʔΫࣄྫ (2/2) • AWS EBSఀࢭ (2011/04/21) • US West AZ͔ΒEBSͷτϥϑΟοΫΛγϑτ
-> ϧʔςΟϯάϙϦγʔϛε->primary/secondary NWΛಉ࣌ʹஅ -> EBSϛϥʔετʔϜൃੜ-> -> EC2 12࣌ؒఀࢭ, EBS 80࣌ؒఀࢭ • RDSఀࢭʢAZ failoverόάͷͨΊɺ2.5%͕ࣦഊʣɺHeroku 16-60࣌ؒఀࢭ • Isolated Redis Primary on EC2 • NWো->Twilio՝ۚPrimary Redis͕->secondaryঢ֨ͳ͠Ͱprimaryʹॻ͖ࠐΈ->෮چޙͷ࠶ ಉظͰprimaryߴෛՙ->primaryΛखಈͰ࠶ىಈ->ޡͬͨcon fi gͰىಈ͠ɺread onlyͰىಈ->Twilio APIݺͼग़͠Ͱސ٬ʹ࠶νϟʔδ͢Δ->40Ͱ1.1%ސ٬ʹաٻ->SMS/௨ͰΫϨΧ500υϧ ٻ+3500υϧΛ͑ΔͱٻडෆՄ
ϗεςΟϯάϓϩόΠμʔ • ҆ՁͰ৴པੑ͕ߴ͍ʢϋζʣ+NW/ServerཧऀͰ͋Δඞཁ͕͋Δ • GlusterFS split brain (Freistil IT) •
ϧʔλϑΝʔϜΣΞόάͰ50%-100%ύέϩε->GluasterFS͕split brain-> ෮چޙ2ͭ ͷσʔληοτΛෆ߹Λղܾग़དྷͣ ->म෮ޙʹτϥϑΟοΫٸ૿͠Webϊʔυߴෛՙ • ಗ໊ͳओཁϗεςΟϯάۀऀʢ100-200nodeنʣ • 90ؒͰ5ͭpartition͕࣌ؒൃੜ • ෦ͱ֎෦Λͭͳ͙NWোɺ෦ͱཧNWΛো
ҬNW • WANোɿϧʔτ͕গͳ͍߹ɺෳDCͳDRඞཁͳͲ • CENICʢCorporation for Education Network Initiatives in
Californiaʣௐࠪ • ΧϦϑΥϧχΞશͷϧʔλΛ̑ௐࠪʢϦϯΫোɺeBGP/tracerouteσʔλʣ • 500Ҏ্ͷNW partitionΛൃݟ • SW 6ʢதԝ2.7ɺ19.9@95%ileʣ • HW 8.2࣌ؒʢதԝ32ɺ3.7@95%ileʣ • PagerDuty on 2 EC2 region/Linode • CA෦AWSϐΞϦϯά͕ྼԽ->EC2ϊʔυ͕ଓྼԽ͠latency૿Ճ->ΫΦϥϜશஅ->ϝοηʔδͷσΟεύονఀࢭ • ઃܭతʹߟྀ͞Ε͍͕ͯͨɺ݁Ռతʹ18ར༻ෆՄɺAPIϦΫΤετυϩοϓࢯɺΫΦϥϜ෮چ·ͰϖʔδԆ
Global Routing Error • Cloud fl are • path/AnycastΛۦ͢Δ23ͷDC •
DDoSରࡦͱͯ͠ಛఆαΠζͷύέοτΛdropͤ͞ΔΑ͏FlowSpecͰશΤοδϧʔλʹୡ->ύέοτʹҰகͤͣ->Ϋϥογϡ͢Δ·ͰRAMফඅ͠ଓ͚ͨ->ࣗ ಈ࠶ىಈ͠ͳ͍ɺmgmtΞΫηεͰ͖ͳ͍->Ұ෦෮چτϥϑΟοΫूதͯ͠ߴෛՙ->·ͩϑΥʔϧόοΫ->ݱखಈ࠶ىಈʢ30ޙʹ։࢝ɺ1࣌ؒར༻ෆՄʣ • Level3 2011 • JuniperͷϑΝʔϜΣΞόάͷͨΊɺόοΫϘʔϯఀࢭ • Time Warner Cable RIM BlackBerry, UK ISP͕ΦϑϥΠϯ • Global BGP outages • 2008ʹύΩελϯςϨίϜ͕youtubeΛϒϩοΫ->ͦͷʢϒϩοΫ͞ΕͨʣϧʔτΛଞͷISPʹใ->ΞΫηεෆՄʢBGP hijackʣ • 2010ʹσϡʔΫେֶ͕BGPͷ࣮ݧతͳϑϥάΛςετ͢Δ͜ͱͰಉ༷ͷޮՌΛ֬ೝ • ൃදऀऍɿBGP hijackʢඇਖ਼نͳAS͔ΒউखʹBGPใʣ݁ߏى͖ͯΔ • ྫɿ2018ʹGoogle͕ʢࣗಈԽ͞ΕͨγεςϜͷόάʁͰʣىͨ݅͜͠ https://gigazine.net/news/20180711-shutting-down-bgp-hijack-factory/
NIC/Driver • Broadcom BCM5709 and Friends • BCM5709 • ड৴ύέοτdrop͢Δ͕ɺૹ৴drop͠ͳ͍ϑΝʔϜόά
• ->primaryॲཧʢड৴ʣͰ͖ͳ͍͕ɺsecondaryprimary͕ੜ͖ͯΔͱࢥ͍ࠐΉ->secondaryʹfallback͠ͳ͍->5࣌ؒఀࢭ -> Sven Ulland ࢯ͕Linux 2.6.32Ͱใࠂɺ2.6.38·ͰղܾͰ͖ͣɻ • BCM5709ʢαʔόʣ͕crash/bu ff erᷓΕ࣌ʹແؔʹPAUSEϑϨʔϜΛग़͢->ToR(BCM56314/BCM56820)͕֦େ->NWશମ͕ো • BCM57711ͰδϟϯϘϑϨʔϜͰߴෛՙ࣌ʹϨΠςϯγѱԽ->ESX on iSCSIͰݦࡏԽ • Intel 82574: Packet of Death • EEPROM͕ਖ਼͘͠ fl ashग़དྷͣ->SIPͷड৴ύέοτΛNIC͕ແޮԽʢௐ͕ࠪඇৗʹ͍͠ɺɺʣ->cold restartͰ෮׆ɾɾ • DriverىҼͷGlusterFS partition • ΞοϓάϨʔυޙʹFlusterFSϖΞͰNWো->LAGແޮԽͯ͠෮چ->12࣌ؒޙʹ࠶ൃ->υϥΠόىҼͱಛఆͯ͠͠->σʔλෆ߹/VMͷϑΝ ΠϧγεςϜσʔλഁଛൃੜ
ΞϓϦϨϕϧͷো (1/3) • ཧNWىҼ͚ͩͰͳ͍ɿϓϩάϥϜɾOSεέδϡʔϥԆɺߴෛՙϓϩηε • CPUߴෛՙͱαʔϏε • ElasticSearch࠶ىಈ->Ϋϥελׂ->split brain->Ϋϥελ෮چ->indexՃআग़དྷͣ->αʔόׂ͕ΓͯΒΕ͍ͯͳ͍indexΛ෮چ ͠Α͏ͱ͢ΔʢͰ͖ͳ͍ͷͰCPUۭճΓʣ->20ར༻ఀࢭɺ6࣌ؒαʔϏεԼ
• ͍GCఀࢭͱI/O • ESΫϥελͰGCى͖Δ->secondary node͕primary deadએݴ->ຊprimaryࢮΜͰ͍ͳ͍ͷͰsplit brain(dual master) • I/OݪҼͰGCఀࢭ->IO_WAIT͕࣌ؒ૿Ճ->split brain/write loss/indexഁଛ • MySQL overload & pacemaker segfault (github) • MySQL primaryෛՙ -> secondaryΛঢ֨->secondaryͷcold cache͕͔ͬͨ->primaryʹfailoverͨ͠Α͏ͱ͕ͨ͠खಈͰఀࢭ • ཌprimaryଆͷมߋ͕secondaryʹө͞Ε͍ͯͳ͍͜ͱΛൃݟ->Replication ManagerͰͷखಈ෮چதʹsegfault ->ࣗಈɾखಈϨϓ Ϧέʔγϣϯڝ߹ɺʢ֎෦ΩʔʹҰ؏ੑ͕ͳ͔ͬͨͨΊʣଞਓͷprivate repoΛදࣔɾɾɾ
ΞϓϦϨϕϧͷো (2/3) • DRDB split brain • 2nodeͷ߹ʢNW partition͞Εͨ߹ʹʣ͕࣮ࣗ֬ʹprimaryͰ͋Δͱݴ͑ͳ͍-> ྆node͕primary/onlineঢ়ଶͰॻ͖ࠐΈΛड͚ೖΕɺϑΝΠϧγεςϜϨϕϧͰ૬ҧൃੜ
• VoldDB on EC2 • NWো->split brain->dual primary->replica૬ҧ->ॏେͳσʔλଛࣦ • Mystery RabbitMQ Partition • ࠶ૹগͳ͘ɺϝοηʔδ҆ఆɺϊʔυؒଓ҆ఆɺͰpartition͢Δɻɻ • partitionݕग़timeoutΛ2ʹ͢ΔͱසݮΔ͕ɺpartitonશʹ͙͜ͱग़དྷͣɻṖɻ
ΞϓϦϨϕϧͷো (3/3) • ElasticSearch Discovery Failure on EC2 • 2node
ESΫϥελͰσΟεΧόϦϝοηʔδަʹ3ඵҎ্͔͔Δͱ1/ 10ͷ֬Ͱdual masterʹɻɻʢ߱֨खಈͷΈʣ • timeout15ඵʹͯ͠ղܾ
݁ɿզʑͲ͜ʹ͔͏ͷ͔ʁ • ࢀরͱͯ͠ͷ·ͱΊ • ϓϩηεɾαʔόɾNICɾεΠονɾϩʔΧϧɾάϩʔόϧ • NWো”ಥવ”དྷΔɻex: ఆظupdate࣌, ϝϯς࣌ •
ҰํͰɺpartiton͕ى͖ͳ͍NW/γεςϜ͋Δɻex: ۚ༥ܥʢ৻ॏͳΤϯδχΞϦϯά+NWٕज़ਐԽ+͓ۚʣ • Google/Amazon (ലେͳنͷͨΊɺҰͭҰͭͷHWίετ)Startup(༧ࢉ͕ݶΒΕΔ) • ༷ʑͳো͕ى͖ΔɻHuman errorΛؚΉݱ࣮ͷࢄγεςϜͷ͕ى͖Δ • ʢpartition͕ى͖ΔલʹʣϦεΫΛ࠶ߟ͢Δ͜ͱ͕ॏཁ • ϗϫΠτϘʔυͰಈ͖Λ͍ͳ͕Βɻ • PartitionରԠ͢Δͱଟ͘ͷ߹ϝϦοτ͕ಘΒΕΔ • partitionରԠͷՃϨΠςϯγͱɺʢࣄޙʹ͔͔Δௐ࣌ؒݮͷʣϝϦοτ
ิɿࠓͲ͖ʢ2020ʣͰΔͳΒʁ • ܗࣜख๏ • Unknownͳstateʢͦͷఆ্ٛʣଘࡏ͠ͳ͍͜ͱΛ୲อͰ͖Δ • Chaos Engineering • ࣮ϛεɺγεςϜ݁߹࣌ͷෆ߹Λൃݟɾमਖ਼Ͱ͖Δ
EOP