Slide 1

Slide 1 text

Research Paper Introduction #14 “The Network is Reliable ~An informal survey of real-world communications failures~” ௨ࢉ#52 @cafenero_777 2020/09/24

Slide 2

Slide 2 text

Agenda • ର৅࿦จ • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • Introduction • େن໛Πϯϑϥ • DC NW • Ϋϥ΢υNW • ϗεςΟϯάϓϩόΠμʔ • ޿ҬNW • Global Routing Error • NIC/Driver • ΞϓϦέʔγϣϯ • ·ͱΊ

Slide 3

Slide 3 text

$ which • The Network is Reliable: An informal survey of real-world communications failures • Peter Bailis, UC Berkeley, Kyle Kingsbury, Jepsen Networks • ACM queue: July 23, 2014 Volume 12, issue 7 • https://queue.acm.org/detail.cfm?id=2655736

Slide 4

Slide 4 text

֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • ֓ཁ • ༷ʑͳࣄۀऀͰNWো֐ʢpartitionʣ͕ൃੜ͍ͯ͠Δ • ্هʹىҼ͢Δ෼ࢄγεςϜো֐ͷࣄྫΛ޿͘ௐࠪɻ • ઃܭ࣌ʹΑ͘ߟ͑ͯʂʢઃܭޙ΋ৼΓฦͬͯʂʣ • ಡ΋͏ͱͨ͠ཧ༝ • ࣮ࡍʹى͖ͨো֐ʹ͍ͭͯ޿͘ɾ۩ମతʹݴٴ͞Ε͍ͯͨͨΊɻ • Կ͔ͷ࿦จͰҾ༻͞Ε͍ͯͨʢ๨Εͨɻɻʣ

Slide 5

Slide 5 text

෼ࢄίϯϐϡʔςΟϯά҆શਆ࿩ʁ http://nighthacks.com/jag/res/Fallacies.html
 http://www.rgoarchitects.com/Files/fallacies.pdf

Slide 6

Slide 6 text

෼ࢄίϯϐϡʔςΟϯά҆શਆ࿩ʁ • “ଟ͘ͷਓ͸෼ࢄγεςϜΛ࠷ॳʹߏங͢Δ ࣌ɺҎԼͷ8߲໨Λ૝ఆͯ͠͠·͏ɻ͜ΕΒ ͸Քಇதʹඞͣൃੜ͠ɺେ໰୊ΛҾ͖ى͜ ͠ɺ௧͍ࣦഊܦݧΛҾ͖ى͜͢ɻ” • NWো֐Ͱ͸ͳ͘ਓͷࢥ͍ࠐΈ΋՝୊ɾɾɾ • ޿͘ɾͨ·ʹਂ͘ݟΔඞཁੑ http://nighthacks.com/jag/res/Fallacies.html
 http://www.rgoarchitects.com/Files/fallacies.pdf

Slide 7

Slide 7 text

http://nighthacks.com/jag/res/Fallacies.html
 http://www.rgoarchitects.com/Files/fallacies.pdf Where you are?

Slide 8

Slide 8 text

ཧ࿦͸৭ʑ͋Δ͕ɺɺ • ݱঢ় • ີ݁߹ɾ૊৫಺ڞ༗͞Εͣ • ΞϓϦϨΠϠʔ಺Ͱ׬݁ • ਪଌͱᷚ • ݱ࣮ͷো֐ࣄྫΛ·ͱΊΔ • ͔ͦ͜ΒֶͿ

Slide 9

Slide 9 text

େن໛෼ࢄΠϯϑϥͷࣄྫ (1/2) • MS Data Center by MS research • 5.2ݸͷσόΠεނো/day, 40.8ݸͷϦϯΫো֐/day • म෮࣌ؒͷதԝ஋͸໿5෼(࠷େ1िؒ), ύέϩεதԝ஋59,000ύέοτ • NW৑௕ੑͰ43%޲্->NWো֐ͷҰൠతͳݪҼഉআʹ͸ࢸΒͣ • HPΤϯλʔϓϥΠζ؅ཧNW By HP labͰͷαϙʔτνέοτ෼ੳ • ઀ଓؔ࿈νέοτ͸11.4%, ͦͷ͏ͪ14%͸࠷ߴ༏ઌ౓ • ࠷ߴ༏ઌ౓ͷରԠ࣌ؒͷதԝ஋͸2.75࣌ؒɺશதԝ஋͸18෼ • Google Chubby (෼ࢄϩοΫγεςϜ for খ༰ྔ෼ࢄετϨʔδ) • 61ݸͷఀࢭ@700೔Λௐࠪɺ9ͭ͸30ඵҎ্ఀࢭɿ4ͭ͸NWىҼɺ2ͭ͸”NW઀ଓىҼΒ͍͠”

Slide 10

Slide 10 text

େن໛෼ࢄΠϯϑϥͷࣄྫ (2/2) • google෼ࢄγεςϜͷlesson & advice by Je ff Deanޚେͷ৽Ϋϥελௐࠪʢ࠷ॳͷ1೥ʣ • 5ϥοΫ͕ෆ҆ఆʢ50%ύέϩεʣ • 8ϝϯςʢ4ϝϯς͸30෼ؒϥϯμϜͳύέϩεͷՄೳੑʣ • 3ϧʔλো֐ʢ1࣌ؒܧଓʣ • ΞυόΠε • ෳ਺όʔδϣϯڝ߹Λߋ৽Ͱ͖Δந৅ԽϨΠϠʔͰٵऩͤ͞Δ • NWো֐(partition)෮چޙʹϨϓϦέʔτௐ੔ • Amazon Dynamo (KVS) • ”ैདྷͷෳ੡͞ΕͨRDBͰ͸NW partitionʹରॲͰ͖ͳ͍”ͱݴٴɻconsistencyΛ٘ਜ਼ʹͯ͠Ͱ΋availabilityΛऔͬͨ • Yahoo! PNUTS/Sherpa (஍ཧ෼ࢄDB) • ڧ͍੔߹ੑͰ͋ΔλΠϜϥΠϯ੔߹ੑʢશϨίʔυ͕શϨϓϦΧ΁ಉ͡ॱংͰߋ৽ॲཧ͞ΕΔʣΛαϙʔτ • ->NWো֐(partition)ɾαʔόো֐Ͱ͠ΜͲ͍ͷͰऑ͍੔߹ੑʹมߋɾɾɾ https://research.cs.cornell.edu/ladis2009/talks/dean-keynote-ladis2009.pdf

Slide 11

Slide 11 text

DC NWো֐ • ిݯো֐ • ToR͕ยํམͪΔɺͳ͔ͥ΋͏ยํͷToR΋མͪΔɺϥοΫؒ௨৴͕ग़དྷͳ͍ΦϯσϚϯυαʔϏε͕ఀࢭɾɾɾ • ->ඞͣ͠΋ϦϯΫো֐Λ๷͙΋ͷͰ͸ͳ͍ʢMS SIGCOM paper͕ࣔࠦʣ • BPDUϑϥου • ϝϯςதʹSTPϑϥοϓ͠ʢBPDUن֨తʹ͸ൃੜ͠ͳ͍ʣBPDUϑϥουൃੜɻ2࣌ؒαʔϏεఀࢭ • Bridge loop/Miscon fi guration/Broken MAC cache (github) • ʢଟஈSWͰ͸ͳ͘ʣू໿εΠονΛಋೖ -> ϧʔϓൃੜ->ϦϯΫແޮԽ->Կނ͔ར༻ଳҬ100%ுΓ෇͖ɾɾ • ઃఆϛεঢ়ଶͰ1ຊམͱ͢->ো֐ݕग़ػߏ͕શஅͤ͞Δ->18෼μ΢ϯ • εΠον͕MACΞυϨεͷΩϟογϡΛਖ਼͘͠ߋ৽Ͱ͖ͳ͍ͨΊϒϩʔυΩϟετ͢ΔϑΝʔϜόάɾɾɾ • MLAG/STP/STONITH (github) • ϕϯμʔ͕ू໿εΠονͷಛఆagentΛఀࢭ->linkΛshutग़དྷͣ->ਖ਼ৗʹLAG/STP/L2ϓϩτίϧॲཧͰ͖ͣ->STP࠶ܭࢉͰ90ඵϒϩοΫ • ϑΝΠϧαʔό (Pacemaker/DRBD)͕͓ޓ͍ఀࢭͯ͠ͱ൑அ->STONITHʢ૬खΛڧ੍rebootʣ->NW෮چ௚ޙʹ྆ܥdown->ϑΝΠϧΞΫηεग़དྷͣ • खಈ෮چʢϓϥΠϚϦϊʔυʹ߹ΘͤΔɻ྆ܥϓϥΠϚϦͳΒϩάௐࠪɺɺʣʹ5͔͔࣌ؒͬͨ

Slide 12

Slide 12 text

Ϋϥ΢υωοτϫʔΫࣄྫ (1/2) • Isolated MongoDB primary on EC2 • EC2 WestϦʔδϣϯͰNWো֐-> 1primary/2secondary͕෼཭->෮چޙʹݹ͍primary͕”্ॻ͖”ͨ͠->̎࣌ؒͷ ॻ͖ࠐΈଛࣦɻ • ো֐ࣗମ͸Ұൠతͳ΋ͷɺɺ • Amnesia split-brain on EC2 • Ұ൩Ͱsplit brain, ӡ༻νʔϜ͕ยܥΛrestartͰղܾ • MongoDB/ElasticSearch on EC2 • NWো֐->ಛఆϊʔυӨڹ->αʔϏεશͯʹӨڹՄೳੑ • ਺ඵɾ݄਺ճͷbackendఀࢭ -> -45෼ͷαʔϏεఀࢭͱESΠϯσοΫεഁଛɺఀࢭ͸1೔2-4ճ·ͰΤεΧϨʔτ

Slide 13

Slide 13 text

Ϋϥ΢υωοτϫʔΫࣄྫ (2/2) • AWS EBSఀࢭ (2011/04/21) • US West AZ͔ΒEBSͷτϥϑΟοΫΛγϑτ -> ϧʔςΟϯάϙϦγʔϛε->primary/secondary NWΛಉ࣌ʹ੾அ -> EBSϛϥʔετʔϜൃੜ->᫔᫓ -> EC2 12࣌ؒఀࢭ, EBS 80࣌ؒఀࢭ • RDSఀࢭʢAZ failoverόάͷͨΊɺ2.5%͕ࣦഊʣɺHeroku 16-60࣌ؒఀࢭ • Isolated Redis Primary on EC2 • NWো֐->Twilio՝ۚPrimary Redis͕෼཭->secondaryঢ֨ͳ͠Ͱprimaryʹॻ͖ࠐΈ->෮چޙͷ࠶ ಉظͰprimaryߴෛՙ->primaryΛखಈͰ࠶ىಈ->ޡͬͨcon fi gͰىಈ͠ɺread onlyͰىಈ->Twilio APIݺͼग़͠Ͱސ٬ʹ࠶νϟʔδ͢Δ->40෼Ͱ1.1%ސ٬ʹա৒੥ٻ->SMS/௨࿩ͰΫϨΧ500υϧ ੥ٻ+3500υϧΛ௒͑Δͱ੥ٻड෇ෆՄ

Slide 14

Slide 14 text

ϗεςΟϯάϓϩόΠμʔ • ҆ՁͰ৴པੑ͕ߴ͍ʢϋζʣ+NW/Server؅ཧऀͰ͋Δඞཁ͕͋Δ • GlusterFS split brain (Freistil IT) • ϧʔλϑΝʔϜ΢ΣΞόάͰ50%-100%ύέϩε->GluasterFS͕split brain-> ෮چޙ΋2ͭ ͷσʔληοτΛෆ੔߹Λղܾग़དྷͣ ->म෮ޙʹτϥϑΟοΫٸ૿͠Webϊʔυߴෛՙ • ಗ໊ͳओཁϗεςΟϯάۀऀʢ100-200nodeن໛ʣ • 90೔ؒͰ5ͭpartition͕࣌ؒൃੜ • ಺෦ͱ֎෦Λͭͳ͙NW෼཭ো֐ɺ಺෦ͱ؅ཧNWΛ෼཭ো֐

Slide 15

Slide 15 text

޿ҬNW • WANো֐ɿ৑௕ϧʔτ͕গͳ͍৔߹ɺෳ਺DCͳDRඞཁͳͲ • CENICʢCorporation for Education Network Initiatives in Californiaʣௐࠪ • ΧϦϑΥϧχΞશ౔ͷϧʔλΛ̑೥ௐࠪʢϦϯΫো֐ɺeBGP/tracerouteσʔλ౳ʣ • 500Ҏ্ͷNW partitionΛൃݟ • SW 6෼ʢதԝ஋2.7෼ɺ19.9෼@95%ileʣ • HW 8.2࣌ؒʢதԝ஋32෼ɺ3.7೔@95%ileʣ • PagerDuty on 2 EC2 region/Linode • CA๺෦AWSϐΞϦϯά͕ྼԽ->EC2ϊʔυ͕઀ଓྼԽ͠latency૿Ճ->ΫΦϥϜશஅ->ϝοηʔδͷσΟεύονఀࢭ • ઃܭతʹ͸ߟྀ͞Ε͍͕ͯͨɺ݁Ռతʹ͸18෼ར༻ෆՄɺAPIϦΫΤετυϩοϓࢯɺΫΦϥϜ෮چ·Ͱϖʔδ஗Ԇ

Slide 16

Slide 16 text

Global Routing Error • Cloud fl are • ৑௕path/AnycastΛۦ࢖͢Δ23ͷDC • DDoSରࡦͱͯ͠ಛఆαΠζͷύέοτΛdropͤ͞ΔΑ͏FlowSpecͰશΤοδϧʔλʹ఻ୡ->ύέοτʹ͸Ұகͤͣ->Ϋϥογϡ͢Δ·ͰRAMফඅ͠ଓ͚ͨ->ࣗ ಈ࠶ىಈ͠ͳ͍ɺmgmt΋ΞΫηεͰ͖ͳ͍->Ұ෦෮چ΋τϥϑΟοΫूதͯ͠ߴෛՙ->·ͩϑΥʔϧόοΫ->ݱ஍खಈ࠶ىಈʢ30෼ޙʹ։࢝ɺ1࣌ؒར༻ෆՄʣ • Level3 2011೥ • JuniperͷϑΝʔϜ΢ΣΞόάͷͨΊɺόοΫϘʔϯఀࢭ • Time Warner Cable RIM BlackBerry, UK ISP͕ΦϑϥΠϯ • Global BGP outages • 2008೥ʹύΩελϯςϨίϜ͕youtubeΛϒϩοΫ->ͦͷʢϒϩοΫ͞ΕͨʣϧʔτΛଞͷISPʹ޿ใ->ΞΫηεෆՄʢBGP hijackʣ • 2010೥ʹσϡʔΫେֶ͕BGPͷ࣮ݧతͳϑϥάΛςετ͢Δ͜ͱͰಉ༷ͷޮՌΛ֬ೝ • ൃදऀ஫ऍɿBGP hijackʢඇਖ਼نͳAS͔ΒউखʹBGP޿ใʣ͸݁ߏى͖ͯΔ • ྫɿ2018೥ʹGoogle͕ʢࣗಈԽ͞ΕͨγεςϜͷόάʁͰʣىͨ݅͜͠ https://gigazine.net/news/20180711-shutting-down-bgp-hijack-factory/

Slide 17

Slide 17 text

NIC/Driver • Broadcom BCM5709 and Friends • BCM5709 • ड৴ύέοτ͸drop͢Δ͕ɺૹ৴͸drop͠ͳ͍ϑΝʔϜόά • ->primary͸ॲཧʢड৴ʣͰ͖ͳ͍͕ɺsecondary͸primary͕ੜ͖ͯΔͱࢥ͍ࠐΉ->secondaryʹfallback͠ͳ͍->5࣌ؒఀࢭ -> Sven Ulland ࢯ͕Linux 2.6.32Ͱใࠂɺ2.6.38·ͰղܾͰ͖ͣɻ • BCM5709ʢαʔόʣ͕crash/bu ff erᷓΕ࣌ʹແؔ܎ʹPAUSEϑϨʔϜΛग़͢->ToR(BCM56314/BCM56820)͕֦େ->NWશମ͕ো֐ • BCM57711ͰδϟϯϘϑϨʔϜͰߴෛՙ࣌ʹϨΠςϯγѱԽ->ESX on iSCSIͰ໰୊ݦࡏԽ • Intel 82574: Packet of Death • EEPROM͕ਖ਼͘͠ fl ashग़དྷͣ->SIPͷड৴ύέοτΛNIC͕ແޮԽʢௐ͕ࠪඇৗʹ೉͍͠ɺɺʣ->cold restartͰ෮׆ɾɾ • DriverىҼͷGlusterFS partition • ΞοϓάϨʔυޙʹFlusterFSϖΞͰNWো֐->LAGແޮԽͯ͠෮چ->12࣌ؒޙʹ࠶ൃ->υϥΠόىҼͱಛఆͯ͠໭͠->σʔλෆ੔߹/VMͷϑΝ ΠϧγεςϜσʔλഁଛൃੜ

Slide 18

Slide 18 text

ΞϓϦϨϕϧͷো֐ (1/3) • ෺ཧNWىҼ͚ͩͰ͸ͳ͍ɿϓϩάϥϜɾOSεέδϡʔϥ஗Ԇɺߴෛՙϓϩηε౳ • CPUߴෛՙͱαʔϏε • ElasticSearch࠶ىಈ->Ϋϥελ෼ׂ->split brain->Ϋϥελ෮چ->index௥Ճ࡟আग़དྷͣ->αʔόׂ͕Γ౰ͯΒΕ͍ͯͳ͍indexΛ෮چ ͠Α͏ͱ͢ΔʢͰ͖ͳ͍ͷͰCPUۭճΓʣ->20෼ར༻ఀࢭɺ6࣌ؒαʔϏε௿Լ • ௕͍GCఀࢭͱI/O • ESΫϥελͰGCى͖Δ->secondary node͕primary deadએݴ->ຊ౰͸primary͸ࢮΜͰ͍ͳ͍ͷͰsplit brain(dual master) • I/OݪҼͰGCఀࢭ->IO_WAIT͕࣌ؒ૿Ճ->split brain/write loss/indexഁଛ • MySQL overload & pacemaker segfault (github) • MySQL primaryෛՙ -> secondaryΛঢ֨->secondaryͷcold cache͕஗͔ͬͨ->primaryʹfailoverͨ͠Α͏ͱ͕ͨ͠खಈͰఀࢭ • ཌ೔primaryଆͷมߋ͕secondaryʹ൓ө͞Ε͍ͯͳ͍͜ͱΛൃݟ->Replication ManagerͰͷखಈ෮چதʹsegfault ->ࣗಈɾखಈϨϓ Ϧέʔγϣϯڝ߹ɺʢ֎෦ΩʔʹҰ؏ੑ͕ͳ͔ͬͨͨΊʣଞਓͷprivate repoΛදࣔɾɾɾ

Slide 19

Slide 19 text

ΞϓϦϨϕϧͷো֐ (2/3) • DRDB split brain • 2nodeͷ৔߹͸ʢNW partition͞Εͨ৔߹ʹʣࣗ਎͕࣮֬ʹprimaryͰ͋Δͱ͸ݴ͑ͳ͍-> ྆node͕primary/onlineঢ়ଶͰॻ͖ࠐΈΛड͚ೖΕɺϑΝΠϧγεςϜϨϕϧͰ૬ҧൃੜ • VoldDB on EC2 • NWো֐->split brain->dual primary->replica૬ҧ->ॏେͳσʔλଛࣦ • Mystery RabbitMQ Partition • ࠶ૹগͳ͘ɺϝοηʔδ΋҆ఆɺϊʔυؒ઀ଓ΋҆ఆɺͰ΋partition͢Δɻɻ • partitionݕग़timeoutΛ2෼ʹ͢Δͱස౓͸ݮΔ͕ɺpartiton׬શʹ๷͙͜ͱ͸ग़དྷͣɻṖɻ

Slide 20

Slide 20 text

ΞϓϦϨϕϧͷো֐ (3/3) • ElasticSearch Discovery Failure on EC2 • 2node ESΫϥελͰσΟεΧόϦϝοηʔδަ׵ʹ3ඵҎ্͔͔Δͱ1/ 10ͷ֬཰Ͱdual masterʹɻɻʢ߱֨͸खಈͷΈʣ • timeout15ඵʹͯ͠ղܾ

Slide 21

Slide 21 text

݁࿦ɿզʑ͸Ͳ͜ʹ޲͔͏ͷ͔ʁ • ࢀরͱͯ͠ͷ·ͱΊ • ϓϩηεɾαʔόɾNICɾεΠονɾϩʔΧϧɾάϩʔόϧ • NWো֐͸”ಥવ”དྷΔɻex: ఆظupdate࣌, ϝϯς࣌ • ҰํͰɺpartiton͕ى͖ͳ͍NW/γεςϜ΋͋Δɻex: ۚ༥ܥʢ৻ॏͳΤϯδχΞϦϯά+NWٕज़ਐԽ+͓ۚʣ • Google/Amazon (ലେͳن໛ͷͨΊɺҰͭҰͭͷHW͸௿ίετ)΍Startup(༧ࢉ͕ݶΒΕΔ) • ༷ʑͳো֐͕ى͖ΔɻHuman errorΛؚΉݱ࣮ͷ෼ࢄγεςϜͷ໰୊͕ى͖Δ • ʢpartition͕ى͖ΔલʹʣϦεΫΛ࠶ߟ͢Δ͜ͱ͕ॏཁ • ϗϫΠτϘʔυͰಈ͖Λ௥͍ͳ͕Βɻ • PartitionରԠ͢Δͱଟ͘ͷ৔߹͸ϝϦοτ͕ಘΒΕΔ • partitionରԠͷ௥ՃϨΠςϯγͱɺʢࣄޙʹ͔͔Δௐ੔࣌ؒ࡟ݮͷʣϝϦοτ

Slide 22

Slide 22 text

ิ଍ɿࠓͲ͖ʢ2020೥ʣͰ΍ΔͳΒʁ • ܗࣜख๏ • Unknownͳstate͸ʢͦͷఆ্ٛʣଘࡏ͠ͳ͍͜ͱΛ୲อͰ͖Δ • Chaos Engineering • ࣮૷ϛεɺγεςϜ݁߹࣌౳ͷෆ੔߹Λൃݟɾमਖ਼Ͱ͖Δ

Slide 23

Slide 23 text

EOP