SDN from Unexpected Failures 1 †Takuma Watanabe, †T.Omizo, *T.Akiyama, †K.Iida DRCN2015 Kansas, MO USA 2015-03-27 †Tokyo Institute of Technology *Kyoto Sangyo University
SDN lacks the reliability [2] • Especially against large-scale, unexpected link failure • What we did: • Add SDN the resiliency agaist multiple link failure • Through adding SDN switches a self-healing module (CCMM) • Design, Implement, Experiment of CCMM 2 [1] B. Nunes, M. Mendonca, X. Nguyen and K. Obraczka, “A Survey of Software-Defined Networking: Past, Present, and Future of Programmable Networks”, IEEE Communications Surveys & Tutorials, vol. 16, no. 3, pp. 1617–1634, Third quarter, 2014. [2] B.J. Asten, et al., “Scalability and Resilience of Software-Defined Networking: An Overview,” arXiv:1408.6760v1 [cs.NI], Technical report, Cornell Univ., Aug. 2014.
and flexibility • Multifarious resource demands from applications • Inflation of operational tasks on network administrators Applications Dynamic controlling 3 • Wide adoption of SDN (Software-Defined Networking) which promises manageable, flexible networking Networking Infrastructure
data plane • Centralization of control plane into controller 4 Control Plane Data Plane Data Plane Data Plane Control Plane Control Plane Data Plane Data Plane Switch Switch Switch Controller Conven&onal Architecture SDN Architecture Control Channel Control Plane ・Collecting topology ・Calculating routing entries Data Plane ・Forwarding packets
• Entire SDN switch relies on controllers. • A failure on control channel and/or controller leads to corruption of entire network 5 Switch Switch Switch Oops! Controller • Research Issue: Bring reliability against link failure or controller failure [2] !
SDN specification • Fail secure/standalone mode (since 1.1) • Fast failover specification (since 1.1) • Multiple controller (since 1.2) • Regarding Controller Failure: • State sync method proposal for backup controllers [4] • Regarding Control Channel Failure: • Link down failover proposal using central controller [6] • Which are targeting fast failover (within 50ms) 6 [3] N. McKeown, et. al., “OpenFlow: Enabling Innovation in Campus Networks,” ACM SIGCOMM Computer Communication Review, Mar. 2008. [4] P. Fonseca, et al., “A Replication Component for Resilient OpenFlow-based Networking,” IEEE NOMS, 2012. [6] S. Sharma, et al., “Fast Failure Recovery for In-band OpenFlow Networks,” IEEE DRCN, 2013.
Not targeting multiple, unexpected link failure • Limited failover capability relying on centralized controller Lack of capability against large-scale failure (e.g. earthquake) 7 • Goal: • Bring SDN a self-healing capability of control channel against large-scale, unexpected link failure • Proposal: ResilientFlow • Self-healing mechanism on control channel
• Independent from centralized controller • Only targeting control channel failover • Adding module for distributed control channel self-healing • CCMM (Control Channel Maintenance Module) • Issues on Current Researches: • Not targeting multiple, unexpected link failure • Limited failover capability relying on centralized controller
control channel S S C S M M M S S C S M M M ✗ Before Link Failure A9er Link Failure Controller Switch CCMM(proposed) S C M 9 ・Detects link failure ・Sets up a new path
should detect their control channel disconnection (2) Path Calculation • Switches should calculate an alternative path to the controllers (3) Establishing New Control Channel • Switches should be able to restore control channel through modifying their own flow table S S C S M M M ✗ Controller Switch CCMM(proposed) S C M 10 Establishing new path
Control Channel End point Control Channel Maintenance Module Neighboring CCMMs Controllers (3) 3) Installs Flow Entries (1) Monitors Link St (2) Exchg. Topo Map (1) (2) S M S M S M C (1) Heartbeat (2) Topo Map
detect their control channel disconnection (2) Path Calculation • Switches should calculate an alternative path to the controllers (3) Establishing New Control Channel • Switches should be able to restore control channel through modifying their own flow table 12 Normal Link-State Routing represents these; We use Quagga OSPF daemon Implement Flow Installer using Python
• Feasibility (validity): • Works with any type of control channel (in Scenario 1) • Works with multiple, unexpected link failure (in Scenario 2) • Performance: • Switch restoration time (in Scenario 1) • Network restoration time (in Scenario 2) • Method: • Network emulation with modified Mininet [11] 14 [11] B. Lantz, et al., “A Network in a Laptop: Rapid Prototyping for Software-Defined Networks,” Proc. ACM SIGCOMM Workshop on Hot Topics in Networks (Hotnets-IX), pp. 19:1–19:6, Oct. 2010.
2 3 C 1 2 3 C 1 2 3 C 1 2 3 (a) Out-to-in (b) In-to-in (c) In-to-in-middle Figure 5.3: Three failure cases in single specified link failure experiments chronously collects the network topology map, calculates the path, and installs the fl entries. This is considered to happen due to the reason that the in-to-in and the in-to middle case take a longer path than the other cases after the disconnection. After all switch alongside the path from the switch to the controller are configured, the switc can restore the connectivity to the controller. This describes the time between the tim C 1 2 3 C 1 2 3 C 1 2 3 C 1 2 3 C 1 2 3 C 1 2 3 (a) Out-to-in (b) In-to-in (c) In-to-in-middle Figure 5.3: Three failure cases in single specified link failure experiments chronously collects the network topology map, calculates the path, and installs the fl entries. This is considered to happen due to the reason that the in-to-in and the in-to middle case take a longer path than the other cases after the disconnection. After all switch alongside the path from the switch to the controller are configured, the switc Out-to-in In-to-in In-to-in-mid 211.5 ± 0.3 [ms] 246.3 ± 7.0 [ms] 211.5 ± 0.6 [ms] 279.3 ± 23.9 [ms] 217.9 ± 8.7 [ms] 292.7 ± 9.3 [ms] Topology Generation Control Channel Res. After Disconnection Before Disconnection 17 • Restoration time is under 300 [ms] • Latter two cases take longer restoration time. • More nodes in the topology may impact on OSPF stabilization Experimental Evaluation › Scenario 1
M. Roughan, “The Internet Topology Zoo,” IEEE Journal on Selected Areas in Communications, vol. 29, no. 9, pp. 1765–1775, Oct. 2011. [14] The University of Adelaide, “The Internet Topology Zoo,” http://www. topology-zoo.org/, last accessed at 21 Oct. 2014. From Topology Zoo [13][14] 18 Feasibility (validity) Works with multiple, unexpected link failure • Real-world topology from Topology Zoo[13][14] • 36 nodes, 76 links • 1 controller (with highest deg.) • 35 switches • Measured time: Against link failure rate, • # of controller- reachable switches • Network restoration time Experimental Evaluation › Scenario 2
• Through adding SDN switches a self-healing module (CCMM) • Design of CCMM • (1) monitors link status (2) exchange topology map and (3) establish new control channel via alternative path • Implementation of CCMM • Simple and practical application of OSPF with our Flow Entry Installer • Experiments of CCMM 1. Validity: Works on any type of control channel 2. Performance: ≦ 300 [ms] in the small topology, 13 seconds in the large, real-world topology. • Future Work • Further analysis on experimentation (OSPF tuning, etc…) • Further extension for split-domain (where no path to the controller is available) 20
Plane Data Plane Control Plane Control Plane Data Plane Data Plane ApplicaMons Switch Switch Switch Switch Switch Controller ConvenMonal Architecture SDN Architecture Control Channel Northbound API Southbound API 22
Switch [6] S. Sharma, et al., “Openflow: meeting carrier-grade recovery requirements,” Computer Communications, 2012. [7] N. L. M. van Adrichem, et al., “Fast recovery in software-defined networks,” IEEE EWSDN, 2014. [8] A. Sgambelluri, et al., “Openflow-based segment protection in ethernet networks,” IEEE/OSA JOCN, 2013.
Ports Interfaces Seen by applications With Open vSwitch SituaMon OSPF daemon Physical Ports Interfaces Seen by applications Bridge (Open vSwitch Bridge) OSPF daemon cannnot handle each ports separately 40
OSPF daemon Physical Ports Interfaces Bridge ・Create the same number of Internal Ports With Open vSwitch Situation (Fixed) (2) OSPF daemon Physical Ports Interfaces ・Insert OSPF-passthrough flow entries 41
00 01 02 03 04 05 06 07 08 09 10 11 12 13 Time from link down [s] 0 5 10 15 20 25 30 35 # of restored switches 0.0 0.2 0.4 0.6 0.8 1.0 Ratio of restored switches to total switches 51
are Disconnected Reachable domain Controller Switch S C C S S S S S S S C S S S S S S ✗ ✗ Unreachable domain Unreachable domain A9er Links are Disconnected
and Standardization • Most widely adopted OpenFlow • SDN Switch • Linux based Open vSwitch • CCMM Functionality and its implementation • (1) Disconnection Detection through Heart-beat exchange • (2) Topology Exchanging and Calculating Alternative Path • (3) Establishing New Control Channel by Modifying Flow Entries 65 Link-State Routing Mechanism represents the same behaviour; We apply Quagga OSPFd Implement our Flow Installer using Python
• ネットワーク機能を独立にした仮想コンテナの立ち上げ • 外部ポートと内部ポートを連動させたインタフェースのアップダウン • 評価手法 • コントロールチャネル切断からの復旧時間を測定 • コントロールチャネルを通じて行う通信であるPacket-inを十分短い時間間隔で送信 • リンク切断からコントローラでのPacket-in受信再開までの時間を復旧時間と定義 • Packet-inの送信にはNping・受信コントローラにはRyuを用いた実装を利用 評価実験 ― 評価手法 ― 2015/3/10 ిࢠใ௨৴ֶձ ૯߹େձˏ 66 [6] Mininet Team, “Mininet: An Instant Virtual Network on Your Laptop (or Other PC) ― Mininet,” http://mininet.org/. S C S C S C
切断時に フローエントリ編集し 制御チャネルを確立 CCMM(Func=onality(1( S( M S( M He’s(alive CCMM(Func=onality(2( S( M S( M CCMM(Func=onality(3( S( M S( M S( M ✗( ✗( ✓( C( Control(Channel( via(new(path 68 (3) (2) (1) 2015/3/10 ిࢠใ௨৴ֶձ ૯߹େձˏ