Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LegoSDN SOSR '16

Bala
March 14, 2016

LegoSDN SOSR '16

Isolating and Tolerating SDN Application Failures with LegoSDN, SOSR 2016.

A redesign of the SDN controller architecture centering around a set of abstractions to eliminate the fate-sharing relationships between SDN applications & the controller, and between the SDN applications themselves.

Bala

March 14, 2016
Tweet

More Decks by Bala

Other Decks in Research

Transcript

  1. Duke UNIVERSITY LegoSDN | SOSR ’16 Isolating and Tolerating SDN

    Application Failures with LegoSDN Balakrishnan Chandrasekaran Brendan Tschaen Theophilus Benson … … NetLog Event Transformer AppVisor
  2. Duke UNIVERSITY LegoSDN | SOSR ’16 2 Enterprise SDN “The

    network should be a solid foundation that never goes wrong” Colin Constable Deutsche Bank Labs
  3. Duke UNIVERSITY LegoSDN | SOSR ’16 2 Enterprise SDN “The

    network should be a solid foundation that never goes wrong” Colin Constable Deutsche Bank Labs “We need as an enterprise to be able to control our network, predictably and efficiently to make it secure” Bryan Larish National Security Agency
  4. LegoSDN | SOSR ’16 Duke UNIVERSITY Rich choice of SDN

    controllers FloodLight, OpenDaylight Platform, ONOS, … More importantly, a good market for SDN applications! HPE Network Optimizer, Big Tap, Packet Design SDN-TE, OpenDayLight OFM App Why aren’t more organizations adopting yet? 3 Easing transition to SDN
  5. LegoSDN | SOSR ’16 Duke UNIVERSITY Fact: (SDN) applications are

    most likely to have bugs! bug-free code? let’s be serious… Crash of SDN applications — meh! 4 Bugs are endemic in software
  6. LegoSDN | SOSR ’16 Duke UNIVERSITY Crash of one SDN

    application Crash of all SDN applications Crash of SDN controller Loss of network control 5 Cascading Failures
  7. LegoSDN | SOSR ’16 Duke UNIVERSITY Crash of one SDN

    application Crash of all SDN applications Crash of SDN controller Loss of network control 5 Cascading Failures Isolate SDN applications from one another and the SDN controller
  8. LegoSDN | SOSR ’16 Duke UNIVERSITY App1 App2 Controller Crashing

    application might also have changed the state of one or more network devices! 6 Network Inconsistency
  9. LegoSDN | SOSR ’16 Duke UNIVERSITY App1 App2 Controller Crashing

    application might also have changed the state of one or more network devices! 6 Network Inconsistency Undo changes to both control and data planes…
  10. LegoSDN | SOSR ’16 Duke UNIVERSITY Deterministic bugs can cripple

    the SDN ecosystem! Reboot — not an option Loss of SDN application(s) or controller state can be problematic! Paxos — solves orthogonal problems Replay of crash-inducing inputs gets stuck in a infinite loop 7 Deterministic Bugs
  11. LegoSDN | SOSR ’16 Duke UNIVERSITY Deterministic bugs can cripple

    the SDN ecosystem! Reboot — not an option Loss of SDN application(s) or controller state can be problematic! Paxos — solves orthogonal problems Replay of crash-inducing inputs gets stuck in a infinite loop 7 Deterministic Bugs ‘transform’ the input … !?
  12. LegoSDN | SOSR ’16 Duke UNIVERSITY … … NetLog Event

    Transformer AppVisor 1. Isolation 2. Cross-Layer Transactions 3. Event Transformations 4. Prototype Implementation 8 LegoSDN
  13. LegoSDN | SOSR ’16 Duke UNIVERSITY Operating System Controller Process

    Controller Network Event Transfomer Cross-Layer Transaction Manager AppVisor Sandbox App1 Sandbox App2 Sandbox App3 RPC OpenFlow Messages Isolate SDN applications within sandboxes. No cascading of crashes! 9 Isolation
  14. LegoSDN | SOSR ’16 Duke UNIVERSITY Operating System Controller Process

    Controller Network Event Transfomer Cross-Layer Transaction Manager AppVisor Sandbox App1 Sandbox App2 Sandbox App3 RPC OpenFlow Messages Isolate SDN applications within sandboxes. No cascading of crashes! 9 Isolation
  15. LegoSDN | SOSR ’16 Duke UNIVERSITY Operating System Controller Process

    Controller Network Event Transfomer Cross-Layer Transaction Manager AppVisor Sandbox App1 Sandbox App2 Sandbox App3 RPC OpenFlow Messages Crash of App1 does not affect other healthy SDN applications No more fate-sharing - between SDN applications - between SDN applications and the SDN controller 10 Isolation
  16. LegoSDN | SOSR ’16 Duke UNIVERSITY Operating System Controller Process

    Controller Network Event Transfomer Cross-Layer Transaction Manager AppVisor Sandbox App1 Sandbox App2 Sandbox App3 RPC OpenFlow Messages Transactions span both control and data planes. In case of failures, undo changes to both control and data planes 11 Cross-Layer Transactions
  17. LegoSDN | SOSR ’16 Duke UNIVERSITY Reverting control plane changes

    is easy. checkpoint application, and restore to last checkpoint on a crash 12 Cross-Layer Transactions
  18. LegoSDN | SOSR ’16 Duke UNIVERSITY Reverting control plane changes

    is easy. checkpoint application, and restore to last checkpoint on a crash Reverting data plane changes is hard. 12 Cross-Layer Transactions
  19. LegoSDN | SOSR ’16 Duke UNIVERSITY Reverting control plane changes

    is easy. checkpoint application, and restore to last checkpoint on a crash Reverting data plane changes is hard. leverage OpenFlow spec. – control messages are invertible! FlowMod (add) ➔ FlowMod (delete) elegant change: retain output as such but change the ‘field’ 12 Cross-Layer Transactions
  20. Duke UNIVERSITY Transaction2 Timeline of events in the context of

    an SDN-App 1 SDN-App Snapshot Event In1 Msg Outx Msg Outx Msg Out3 Event In2 Msg Outx Msg Outx Msg Out4 2 SDN-App Snapshot 13 Cross-Layer Transactions Control plane changes –SDN Application snapshots Data plane changes –output messages
  21. Duke UNIVERSITY Transaction2 Timeline of events in the context of

    an SDN-App 1 SDN-App Snapshot Event In1 Msg Outx Msg Outx Msg Out3 Event In2 Msg Outx Msg Outx Msg Out4 2 SDN-App Snapshot Revert SDN application to last snapshot Use CRIU to checkpoint SDN application state. 14 Cross-Layer Transactions
  22. Duke UNIVERSITY Transaction2 Timeline of events in the context of

    an SDN-App 1 SDN-App Snapshot Event In1 Msg Outx Msg Outx Msg Out3 Event In2 Msg Outx Msg Outx Msg Out4 2 SDN-App Snapshot Revert SDN application to last snapshot Use CRIU to checkpoint SDN application state. 14 Cross-Layer Transactions
  23. Duke UNIVERSITY Transaction2 Timeline of events in the context of

    an SDN-App 1 SDN-App Snapshot Event In1 Msg Outx Msg Outx Msg Out3 Event In2 Msg Outx Msg Outx Msg Out4 2 SDN-App Snapshot Network state changes as part of the current transaction. Revert network state changes in case transaction fails. FlowMod (add) ➔ FlowMod (delete) 15 Cross-Layer Transactions
  24. Duke UNIVERSITY Transaction2 Timeline of events in the context of

    an SDN-App 1 SDN-App Snapshot Event In1 Msg Outx Msg Outx Msg Out3 Event In2 Msg Outx Msg Outx Msg Out4 2 SDN-App Snapshot Input events processed since the last transaction. Until transaction is committed, events are in replay buffer. Replay e1, e2, … en-1. Transform en 16 Cross-Layer Transactions
  25. LegoSDN | SOSR ’16 Duke UNIVERSITY Operating System Controller Process

    Controller Network Event Transfomer Cross-Layer Transaction Manager AppVisor Sandbox App1 Sandbox App2 Sandbox App3 RPC OpenFlow Messages Transform the crash-inducing message. Tolerate deterministic crashes! 17 Event Transformations
  26. LegoSDN | SOSR ’16 Duke UNIVERSITY M: crash-inducing input Transform

    M to … !? leverage OpenFlow spec. & domain knowledge 18 Event Transformations
  27. LegoSDN | SOSR ’16 Duke UNIVERSITY M: crash-inducing input Transform

    M to … !? leverage OpenFlow spec. & domain knowledge Transform M to T M ≣ T both have same intent Port (A) Down ≣ {Port (A) Up, Port (A) Down} 18 Event Transformations
  28. LegoSDN | SOSR ’16 Duke UNIVERSITY Transformations preserve the intent

    of the original event. Port (S1 :A) Down ➔ Switch (S1) Down 19 Event Transformations
  29. LegoSDN | SOSR ’16 Duke UNIVERSITY Transformations preserve the intent

    of the original event. Port (S1 :A) Down ➔ Switch (S1) Down Rules created by studying OpenFlow specification and switch behavior. 19 Event Transformations
  30. LegoSDN | SOSR ’16 Duke UNIVERSITY Transformations preserve the intent

    of the original event. Port (S1 :A) Down ➔ Switch (S1) Down Rules created by studying OpenFlow specification and switch behavior. Rules also exploit natural hierarchy amongst network elements. Network elements: Port, Switch; Hierarchy: Port ➔ Switch Port (S1:A) Down ➔ Switch (S1) Down 19 Event Transformations
  31. LegoSDN | SOSR ’16 Duke UNIVERSITY Toggle Port (A) Down

    ➔ Port (A) Up, Port (A) Down 20 Event Transformations
  32. LegoSDN | SOSR ’16 Duke UNIVERSITY Toggle Port (A) Down

    ➔ Port (A) Up, Port (A) Down Escalate Port (A) Up ➔ Switch (S1) Up 20 Event Transformations
  33. LegoSDN | SOSR ’16 Duke UNIVERSITY Toggle Port (A) Down

    ➔ Port (A) Up, Port (A) Down Escalate Port (A) Up ➔ Switch (S1) Up Reorder Change ordering of inputs across switches. No reordering of messages from the same switch. 20 Event Transformations
  34. LegoSDN | SOSR ’16 Duke UNIVERSITY (1) Assume App1 crashed

    on input M1 M1 ➔ T1 21 Event Transformations
  35. LegoSDN | SOSR ’16 Duke UNIVERSITY (1) Assume App1 crashed

    on input M1 M1 ➔ T1 (2) Can App1 process T1 without fail? Yes: done. No: M1 ➔ T2 21 Event Transformations
  36. LegoSDN | SOSR ’16 Duke UNIVERSITY (1) Assume App1 crashed

    on input M1 M1 ➔ T1 (2) Can App1 process T1 without fail? Yes: done. No: M1 ➔ T2 (3) Repeat (2) until App1 succeeds. 21 Event Transformations
  37. LegoSDN | SOSR ’16 Duke UNIVERSITY (1) Assume App1 crashed

    on input M1 M1 ➔ T1 (2) Can App1 process T1 without fail? Yes: done. No: M1 ➔ T2 (3) Repeat (2) until App1 succeeds. Controller maintains per-application network view, as required. App1 — Port (S1:A) Down ➔ Switch (S1) Down As far as App1 is concerned, S1 is down. 21 Event Transformations
  38. LegoSDN | SOSR ’16 Duke UNIVERSITY … … NetLog Event

    Transformer AppVisor LegoSDN Evaluations 22
  39. LegoSDN | SOSR ’16 Duke UNIVERSITY Two Linux (Ubuntu 14.04

    LTS) servers connected by a 1 Gbps link with 10 ms delay. Each machine had 12 cores and 16 GB of memory. Server1: SDN Controller and SDN Applications Server2: Mininet, traffic generators 23 Testbed
  40. LegoSDN | SOSR ’16 Duke UNIVERSITY – SDN Applications –

    Hub, Learning Switch, Load Balancer, Route Manager, Stateful Firewall Test applications were designed to run with FloodLight, and required no modifications to run on top of LegoSDN. Comparison with controller reboots and application reboots. 24 Testbed
  41. LegoSDN | SOSR ’16 Duke UNIVERSITY 1. How fast is

    LegoSDN compared to controller and application reboots? 2. Can LegoSDN successfully recover from deterministic bugs? 3. Can LegoSDN recover stateful applications safely? 4. Can LegoSDN network changes on a crash to avoid policy violations? 25 Evaluations
  42. LegoSDN | SOSR ’16 Duke UNIVERSITY 1. How fast is

    LegoSDN compared to controller and application reboots? 2. Can LegoSDN successfully recover from deterministic bugs? 3. Can LegoSDN recover stateful applications safely? 4. Can LegoSDN network changes on a crash to avoid policy violations? 26 Evaluations
  43. Duke UNIVERSITY 1 10 100 1000 10000 Hub L.Switch LoadBal.

    Rt.Flow Recovery time (in ms) Application Ctrlr. Reboot App. Reboot LegoSDN LegoSDN is 3x faster than controller reboots. 27 Recovery Time | Results
  44. LegoSDN | SOSR ’16 Duke UNIVERSITY 1. How fast is

    LegoSDN compared to controller and application reboots? LegoSDN is 3x faster than controller reboots. 2. Can LegoSDN successfully recover from deterministic bugs? 3. Can LegoSDN recover stateful applications safely? 4. Can LegoSDN network changes on a crash to avoid policy violations? 28 Evaluations
  45. Duke UNIVERSITY S1 S2 S3 S4 S5 S6 S7 S8

    1 4 ms 2 ms UDP Flow. Packet every 10 ms. Primary: S2 –S1 –S7 –S8 Secondary: S2 –S3 –S4 –S5 –S6 –S8 Application: Route Manager. S1 –S7 brought down at t = 5 s and brought back up at t = 25 s Application cannot process Link-Down events. 29 Deterministic Bugs | Setup
  46. Duke UNIVERSITY 0 1 2 3 4 5 6 0

    5 10 15 20 25 30 RTT (in ms) Time (in s) Ctrlr. Reboot App. Reboot LegoSDN Crash Primary Secondary Primary lost packets LegoSDN recovers quickly (~250ms). Event transformations help in recovering from deterministic faults. (deterministic bug: inability of application to process Port-Down event). 30 Deterministic Bugs | Results
  47. LegoSDN | SOSR ’16 Duke UNIVERSITY 1. How fast is

    LegoSDN compared to controller and application reboots? LegoSDN is 3x faster than controller reboots. 2. Can LegoSDN successfully recover from deterministic bugs? Event transformations help in recovering from deterministic bugs. 3. Can LegoSDN recover stateful applications safely? 4. Can LegoSDN network changes on a crash to avoid policy violations? 31 Evaluations
  48. LegoSDN | SOSR ’16 Duke UNIVERSITY 1. How fast is

    LegoSDN compared to controller and application reboots? LegoSDN is 3x faster than controller reboots. 2. Can LegoSDN successfully recover from deterministic bugs? Event transformations help in recovering from deterministic bugs. 3. Can LegoSDN recover stateful applications safely? LegoSDN uses checkpointing to avoid loss of application state. 4. Can LegoSDN network changes on a crash to avoid policy violations? LegoSDN reverts network changes quickly on a crash. 32 Evaluations
  49. LegoSDN | SOSR ’16 Duke UNIVERSITY Treats failures as first-class

    citizens. Recovers applications from even deterministic crashes. Newer abstractions; transparent to SDN application developers 33 Conclusion
  50. LegoSDN | SOSR ’16 Duke UNIVERSITY Treats failures as first-class

    citizens. Recovers applications from even deterministic crashes. Newer abstractions; transparent to SDN application developers … still, only the tip of the iceberg! 33 Conclusion
  51. LegoSDN | SOSR ’16 Duke UNIVERSITY Treats failures as first-class

    citizens. Recovers applications from even deterministic crashes. Newer abstractions; transparent to SDN application developers … still, only the tip of the iceberg! Peek into applications’ state? Learn from Beehive … Leverage symbolic execution to identify exact cause of crash … Handle Byzantine faults … Distributed scenarios introduce more complexity … 33 Conclusion
  52. Duke UNIVERSITY LegoSDN | SOSR ’16 LegoSDN http://legosdn.cs.duke.edu (coming soon)

    Building blocks for designing a robust SDN controller! … … NetLog Event Transformer AppVisor
  53. LegoSDN | SOSR ’16 Duke UNIVERSITY Troubleshooting & Verification NICE,

    NSDI ’12; Minimal Causal Sequences, SIGCOMM ’14; Veriflow, NSDI ’13; SDNRacer, SOSR ’15; … Programming Abstractions Transactional Networking, HotSDN ’13; OF. CPP, HotSDN ’13; … Controller Replication Ravana, SOSR ’15; Onix, OSDI ’10; BeeHive, HotNets ’14; … Fault-Tolerance Microreboot, OSDI ’04; Failure Oblivious Computing, OSDI ’04; … 35 Related Works