Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#14 “The Network is Reliable"

#14 “The Network is Reliable"

ACM queue: July 23, 2014 Volume 12, issue 7
https://queue.acm.org/detail.cfm?id=2655736

cafenero_777

June 14, 2023
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Research Paper Introduction #14 “The Network is Reliable ~An informal

    survey of real-world communications failures~” ௨ࢉ#52 @cafenero_777 2020/09/24
  2. Agenda • ର৅࿦จ • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • Introduction • େن໛Πϯϑϥ •

    DC NW • Ϋϥ΢υNW • ϗεςΟϯάϓϩόΠμʔ • ޿ҬNW • Global Routing Error • NIC/Driver • ΞϓϦέʔγϣϯ • ·ͱΊ
  3. $ which • The Network is Reliable: An informal survey

    of real-world communications failures • Peter Bailis, UC Berkeley, Kyle Kingsbury, Jepsen Networks • ACM queue: July 23, 2014 Volume 12, issue 7 • https://queue.acm.org/detail.cfm?id=2655736
  4. େن໛෼ࢄΠϯϑϥͷࣄྫ (1/2) • MS Data Center by MS research •

    5.2ݸͷσόΠεނো/day, 40.8ݸͷϦϯΫো֐/day • म෮࣌ؒͷதԝ஋͸໿5෼(࠷େ1िؒ), ύέϩεதԝ஋59,000ύέοτ • NW৑௕ੑͰ43%޲্->NWো֐ͷҰൠతͳݪҼഉআʹ͸ࢸΒͣ • HPΤϯλʔϓϥΠζ؅ཧNW By HP labͰͷαϙʔτνέοτ෼ੳ • ઀ଓؔ࿈νέοτ͸11.4%, ͦͷ͏ͪ14%͸࠷ߴ༏ઌ౓ • ࠷ߴ༏ઌ౓ͷରԠ࣌ؒͷதԝ஋͸2.75࣌ؒɺશதԝ஋͸18෼ • Google Chubby (෼ࢄϩοΫγεςϜ for খ༰ྔ෼ࢄετϨʔδ) • 61ݸͷఀࢭ@700೔Λௐࠪɺ9ͭ͸30ඵҎ্ఀࢭɿ4ͭ͸NWىҼɺ2ͭ͸”NW઀ଓىҼΒ͍͠”
  5. େن໛෼ࢄΠϯϑϥͷࣄྫ (2/2) • google෼ࢄγεςϜͷlesson & advice by Je ff Deanޚେͷ৽Ϋϥελௐࠪʢ࠷ॳͷ1೥ʣ

    • 5ϥοΫ͕ෆ҆ఆʢ50%ύέϩεʣ • 8ϝϯςʢ4ϝϯς͸30෼ؒϥϯμϜͳύέϩεͷՄೳੑʣ • 3ϧʔλো֐ʢ1࣌ؒܧଓʣ • ΞυόΠε • ෳ਺όʔδϣϯڝ߹Λߋ৽Ͱ͖Δந৅ԽϨΠϠʔͰٵऩͤ͞Δ • NWো֐(partition)෮چޙʹϨϓϦέʔτௐ੔ • Amazon Dynamo (KVS) • ”ैདྷͷෳ੡͞ΕͨRDBͰ͸NW partitionʹରॲͰ͖ͳ͍”ͱݴٴɻconsistencyΛ٘ਜ਼ʹͯ͠Ͱ΋availabilityΛऔͬͨ • Yahoo! PNUTS/Sherpa (஍ཧ෼ࢄDB) • ڧ͍੔߹ੑͰ͋ΔλΠϜϥΠϯ੔߹ੑʢશϨίʔυ͕શϨϓϦΧ΁ಉ͡ॱংͰߋ৽ॲཧ͞ΕΔʣΛαϙʔτ • ->NWো֐(partition)ɾαʔόো֐Ͱ͠ΜͲ͍ͷͰऑ͍੔߹ੑʹมߋɾɾɾ https://research.cs.cornell.edu/ladis2009/talks/dean-keynote-ladis2009.pdf
  6. DC NWো֐ • ిݯো֐ • ToR͕ยํམͪΔɺͳ͔ͥ΋͏ยํͷToR΋མͪΔɺϥοΫؒ௨৴͕ग़དྷͳ͍ΦϯσϚϯυαʔϏε͕ఀࢭɾɾɾ • ->ඞͣ͠΋ϦϯΫো֐Λ๷͙΋ͷͰ͸ͳ͍ʢMS SIGCOM paper͕ࣔࠦʣ

    • BPDUϑϥου • ϝϯςதʹSTPϑϥοϓ͠ʢBPDUن֨తʹ͸ൃੜ͠ͳ͍ʣBPDUϑϥουൃੜɻ2࣌ؒαʔϏεఀࢭ • Bridge loop/Miscon fi guration/Broken MAC cache (github) • ʢଟஈSWͰ͸ͳ͘ʣू໿εΠονΛಋೖ -> ϧʔϓൃੜ->ϦϯΫແޮԽ->Կނ͔ར༻ଳҬ100%ுΓ෇͖ɾɾ • ઃఆϛεঢ়ଶͰ1ຊམͱ͢->ো֐ݕग़ػߏ͕શஅͤ͞Δ->18෼μ΢ϯ • εΠον͕MACΞυϨεͷΩϟογϡΛਖ਼͘͠ߋ৽Ͱ͖ͳ͍ͨΊϒϩʔυΩϟετ͢ΔϑΝʔϜόάɾɾɾ • MLAG/STP/STONITH (github) • ϕϯμʔ͕ू໿εΠονͷಛఆagentΛఀࢭ->linkΛshutग़དྷͣ->ਖ਼ৗʹLAG/STP/L2ϓϩτίϧॲཧͰ͖ͣ->STP࠶ܭࢉͰ90ඵϒϩοΫ • ϑΝΠϧαʔό (Pacemaker/DRBD)͕͓ޓ͍ఀࢭͯ͠ͱ൑அ->STONITHʢ૬खΛڧ੍rebootʣ->NW෮چ௚ޙʹ྆ܥdown->ϑΝΠϧΞΫηεग़དྷͣ • खಈ෮چʢϓϥΠϚϦϊʔυʹ߹ΘͤΔɻ྆ܥϓϥΠϚϦͳΒϩάௐࠪɺɺʣʹ5͔͔࣌ؒͬͨ
  7. Ϋϥ΢υωοτϫʔΫࣄྫ (1/2) • Isolated MongoDB primary on EC2 • EC2

    WestϦʔδϣϯͰNWো֐-> 1primary/2secondary͕෼཭->෮چޙʹݹ͍primary͕”্ॻ͖”ͨ͠->̎࣌ؒͷ ॻ͖ࠐΈଛࣦɻ • ো֐ࣗମ͸Ұൠతͳ΋ͷɺɺ • Amnesia split-brain on EC2 • Ұ൩Ͱsplit brain, ӡ༻νʔϜ͕ยܥΛrestartͰղܾ • MongoDB/ElasticSearch on EC2 • NWো֐->ಛఆϊʔυӨڹ->αʔϏεશͯʹӨڹՄೳੑ • ਺ඵɾ݄਺ճͷbackendఀࢭ -> -45෼ͷαʔϏεఀࢭͱESΠϯσοΫεഁଛɺఀࢭ͸1೔2-4ճ·ͰΤεΧϨʔτ
  8. Ϋϥ΢υωοτϫʔΫࣄྫ (2/2) • AWS EBSఀࢭ (2011/04/21) • US West AZ͔ΒEBSͷτϥϑΟοΫΛγϑτ

    -> ϧʔςΟϯάϙϦγʔϛε->primary/secondary NWΛಉ࣌ʹ੾அ -> EBSϛϥʔετʔϜൃੜ->᫔᫓ -> EC2 12࣌ؒఀࢭ, EBS 80࣌ؒఀࢭ • RDSఀࢭʢAZ failoverόάͷͨΊɺ2.5%͕ࣦഊʣɺHeroku 16-60࣌ؒఀࢭ • Isolated Redis Primary on EC2 • NWো֐->Twilio՝ۚPrimary Redis͕෼཭->secondaryঢ֨ͳ͠Ͱprimaryʹॻ͖ࠐΈ->෮چޙͷ࠶ ಉظͰprimaryߴෛՙ->primaryΛखಈͰ࠶ىಈ->ޡͬͨcon fi gͰىಈ͠ɺread onlyͰىಈ->Twilio APIݺͼग़͠Ͱސ٬ʹ࠶νϟʔδ͢Δ->40෼Ͱ1.1%ސ٬ʹա৒੥ٻ->SMS/௨࿩ͰΫϨΧ500υϧ ੥ٻ+3500υϧΛ௒͑Δͱ੥ٻड෇ෆՄ
  9. ϗεςΟϯάϓϩόΠμʔ • ҆ՁͰ৴པੑ͕ߴ͍ʢϋζʣ+NW/Server؅ཧऀͰ͋Δඞཁ͕͋Δ • GlusterFS split brain (Freistil IT) •

    ϧʔλϑΝʔϜ΢ΣΞόάͰ50%-100%ύέϩε->GluasterFS͕split brain-> ෮چޙ΋2ͭ ͷσʔληοτΛෆ੔߹Λղܾग़དྷͣ ->म෮ޙʹτϥϑΟοΫٸ૿͠Webϊʔυߴෛՙ • ಗ໊ͳओཁϗεςΟϯάۀऀʢ100-200nodeن໛ʣ • 90೔ؒͰ5ͭpartition͕࣌ؒൃੜ • ಺෦ͱ֎෦Λͭͳ͙NW෼཭ো֐ɺ಺෦ͱ؅ཧNWΛ෼཭ো֐
  10. ޿ҬNW • WANো֐ɿ৑௕ϧʔτ͕গͳ͍৔߹ɺෳ਺DCͳDRඞཁͳͲ • CENICʢCorporation for Education Network Initiatives in

    Californiaʣௐࠪ • ΧϦϑΥϧχΞશ౔ͷϧʔλΛ̑೥ௐࠪʢϦϯΫো֐ɺeBGP/tracerouteσʔλ౳ʣ • 500Ҏ্ͷNW partitionΛൃݟ • SW 6෼ʢதԝ஋2.7෼ɺ19.9෼@95%ileʣ • HW 8.2࣌ؒʢதԝ஋32෼ɺ3.7೔@95%ileʣ • PagerDuty on 2 EC2 region/Linode • CA๺෦AWSϐΞϦϯά͕ྼԽ->EC2ϊʔυ͕઀ଓྼԽ͠latency૿Ճ->ΫΦϥϜશஅ->ϝοηʔδͷσΟεύονఀࢭ • ઃܭతʹ͸ߟྀ͞Ε͍͕ͯͨɺ݁Ռతʹ͸18෼ར༻ෆՄɺAPIϦΫΤετυϩοϓࢯɺΫΦϥϜ෮چ·Ͱϖʔδ஗Ԇ
  11. Global Routing Error • Cloud fl are • ৑௕path/AnycastΛۦ࢖͢Δ23ͷDC •

    DDoSରࡦͱͯ͠ಛఆαΠζͷύέοτΛdropͤ͞ΔΑ͏FlowSpecͰશΤοδϧʔλʹ఻ୡ->ύέοτʹ͸Ұகͤͣ->Ϋϥογϡ͢Δ·ͰRAMফඅ͠ଓ͚ͨ->ࣗ ಈ࠶ىಈ͠ͳ͍ɺmgmt΋ΞΫηεͰ͖ͳ͍->Ұ෦෮چ΋τϥϑΟοΫूதͯ͠ߴෛՙ->·ͩϑΥʔϧόοΫ->ݱ஍खಈ࠶ىಈʢ30෼ޙʹ։࢝ɺ1࣌ؒར༻ෆՄʣ • Level3 2011೥ • JuniperͷϑΝʔϜ΢ΣΞόάͷͨΊɺόοΫϘʔϯఀࢭ • Time Warner Cable RIM BlackBerry, UK ISP͕ΦϑϥΠϯ • Global BGP outages • 2008೥ʹύΩελϯςϨίϜ͕youtubeΛϒϩοΫ->ͦͷʢϒϩοΫ͞ΕͨʣϧʔτΛଞͷISPʹ޿ใ->ΞΫηεෆՄʢBGP hijackʣ • 2010೥ʹσϡʔΫେֶ͕BGPͷ࣮ݧతͳϑϥάΛςετ͢Δ͜ͱͰಉ༷ͷޮՌΛ֬ೝ • ൃදऀ஫ऍɿBGP hijackʢඇਖ਼نͳAS͔ΒউखʹBGP޿ใʣ͸݁ߏى͖ͯΔ • ྫɿ2018೥ʹGoogle͕ʢࣗಈԽ͞ΕͨγεςϜͷόάʁͰʣىͨ݅͜͠ https://gigazine.net/news/20180711-shutting-down-bgp-hijack-factory/
  12. NIC/Driver • Broadcom BCM5709 and Friends • BCM5709 • ड৴ύέοτ͸drop͢Δ͕ɺૹ৴͸drop͠ͳ͍ϑΝʔϜόά

    • ->primary͸ॲཧʢड৴ʣͰ͖ͳ͍͕ɺsecondary͸primary͕ੜ͖ͯΔͱࢥ͍ࠐΉ->secondaryʹfallback͠ͳ͍->5࣌ؒఀࢭ -> Sven Ulland ࢯ͕Linux 2.6.32Ͱใࠂɺ2.6.38·ͰղܾͰ͖ͣɻ • BCM5709ʢαʔόʣ͕crash/bu ff erᷓΕ࣌ʹແؔ܎ʹPAUSEϑϨʔϜΛग़͢->ToR(BCM56314/BCM56820)͕֦େ->NWશମ͕ো֐ • BCM57711ͰδϟϯϘϑϨʔϜͰߴෛՙ࣌ʹϨΠςϯγѱԽ->ESX on iSCSIͰ໰୊ݦࡏԽ • Intel 82574: Packet of Death • EEPROM͕ਖ਼͘͠ fl ashग़དྷͣ->SIPͷड৴ύέοτΛNIC͕ແޮԽʢௐ͕ࠪඇৗʹ೉͍͠ɺɺʣ->cold restartͰ෮׆ɾɾ • DriverىҼͷGlusterFS partition • ΞοϓάϨʔυޙʹFlusterFSϖΞͰNWো֐->LAGແޮԽͯ͠෮چ->12࣌ؒޙʹ࠶ൃ->υϥΠόىҼͱಛఆͯ͠໭͠->σʔλෆ੔߹/VMͷϑΝ ΠϧγεςϜσʔλഁଛൃੜ
  13. ΞϓϦϨϕϧͷো֐ (1/3) • ෺ཧNWىҼ͚ͩͰ͸ͳ͍ɿϓϩάϥϜɾOSεέδϡʔϥ஗Ԇɺߴෛՙϓϩηε౳ • CPUߴෛՙͱαʔϏε • ElasticSearch࠶ىಈ->Ϋϥελ෼ׂ->split brain->Ϋϥελ෮چ->index௥Ճ࡟আग़དྷͣ->αʔόׂ͕Γ౰ͯΒΕ͍ͯͳ͍indexΛ෮چ ͠Α͏ͱ͢ΔʢͰ͖ͳ͍ͷͰCPUۭճΓʣ->20෼ར༻ఀࢭɺ6࣌ؒαʔϏε௿Լ

    • ௕͍GCఀࢭͱI/O • ESΫϥελͰGCى͖Δ->secondary node͕primary deadએݴ->ຊ౰͸primary͸ࢮΜͰ͍ͳ͍ͷͰsplit brain(dual master) • I/OݪҼͰGCఀࢭ->IO_WAIT͕࣌ؒ૿Ճ->split brain/write loss/indexഁଛ • MySQL overload & pacemaker segfault (github) • MySQL primaryෛՙ -> secondaryΛঢ֨->secondaryͷcold cache͕஗͔ͬͨ->primaryʹfailoverͨ͠Α͏ͱ͕ͨ͠खಈͰఀࢭ • ཌ೔primaryଆͷมߋ͕secondaryʹ൓ө͞Ε͍ͯͳ͍͜ͱΛൃݟ->Replication ManagerͰͷखಈ෮چதʹsegfault ->ࣗಈɾखಈϨϓ Ϧέʔγϣϯڝ߹ɺʢ֎෦ΩʔʹҰ؏ੑ͕ͳ͔ͬͨͨΊʣଞਓͷprivate repoΛදࣔɾɾɾ
  14. ΞϓϦϨϕϧͷো֐ (2/3) • DRDB split brain • 2nodeͷ৔߹͸ʢNW partition͞Εͨ৔߹ʹʣࣗ਎͕࣮֬ʹprimaryͰ͋Δͱ͸ݴ͑ͳ͍-> ྆node͕primary/onlineঢ়ଶͰॻ͖ࠐΈΛड͚ೖΕɺϑΝΠϧγεςϜϨϕϧͰ૬ҧൃੜ

    • VoldDB on EC2 • NWো֐->split brain->dual primary->replica૬ҧ->ॏେͳσʔλଛࣦ • Mystery RabbitMQ Partition • ࠶ૹগͳ͘ɺϝοηʔδ΋҆ఆɺϊʔυؒ઀ଓ΋҆ఆɺͰ΋partition͢Δɻɻ • partitionݕग़timeoutΛ2෼ʹ͢Δͱස౓͸ݮΔ͕ɺpartiton׬શʹ๷͙͜ͱ͸ग़དྷͣɻṖɻ
  15. ΞϓϦϨϕϧͷো֐ (3/3) • ElasticSearch Discovery Failure on EC2 • 2node

    ESΫϥελͰσΟεΧόϦϝοηʔδަ׵ʹ3ඵҎ্͔͔Δͱ1/ 10ͷ֬཰Ͱdual masterʹɻɻʢ߱֨͸खಈͷΈʣ • timeout15ඵʹͯ͠ղܾ
  16. ݁࿦ɿզʑ͸Ͳ͜ʹ޲͔͏ͷ͔ʁ • ࢀরͱͯ͠ͷ·ͱΊ • ϓϩηεɾαʔόɾNICɾεΠονɾϩʔΧϧɾάϩʔόϧ • NWো֐͸”ಥવ”དྷΔɻex: ఆظupdate࣌, ϝϯς࣌ •

    ҰํͰɺpartiton͕ى͖ͳ͍NW/γεςϜ΋͋Δɻex: ۚ༥ܥʢ৻ॏͳΤϯδχΞϦϯά+NWٕज़ਐԽ+͓ۚʣ • Google/Amazon (ലେͳن໛ͷͨΊɺҰͭҰͭͷHW͸௿ίετ)΍Startup(༧ࢉ͕ݶΒΕΔ) • ༷ʑͳো֐͕ى͖ΔɻHuman errorΛؚΉݱ࣮ͷ෼ࢄγεςϜͷ໰୊͕ى͖Δ • ʢpartition͕ى͖ΔલʹʣϦεΫΛ࠶ߟ͢Δ͜ͱ͕ॏཁ • ϗϫΠτϘʔυͰಈ͖Λ௥͍ͳ͕Βɻ • PartitionରԠ͢Δͱଟ͘ͷ৔߹͸ϝϦοτ͕ಘΒΕΔ • partitionରԠͷ௥ՃϨΠςϯγͱɺʢࣄޙʹ͔͔Δௐ੔࣌ؒ࡟ݮͷʣϝϦοτ
  17. EOP