Slide 1

Slide 1 text

Tolerating Application Failures with LegoSDN Balakrishnan Chandrasekaran Theophilus Benson Duke University

Slide 2

Slide 2 text

Quality of Code “In C, I never learned to use the debugger, so I used to never make mistakes …” “I went millions and millions of hours with no problems—probably tens of millions of hours with no problems.” — Arthur Whitney, creator of A, K and Q. ACM Queue, Feb 2009. October 28, 2014 HotNets 2014 | LegoSDN 2

Slide 3

Slide 3 text

Bugs are endemic in software! § Bugs can be deterministic or non- deterministic § [STS] Pox Premature PacketIn – l2_multi routing module failed unexpectedly with a KeyError. October 28, 2014 HotNets 2014 | LegoSDN 3

Slide 4

Slide 4 text

Cascading Crashes October 28, 2014 HotNets 2014 | LegoSDN 4 Controller A App1 A App2 A … in out

Slide 5

Slide 5 text

Cascading Crashes October 28, 2014 HotNets 2014 | LegoSDN 5 Controller A App1 A App2 A … in out

Slide 6

Slide 6 text

Cascading Crashes October 28, 2014 HotNets 2014 | LegoSDN 6 Controller A App1 A App2 A … in out

Slide 7

Slide 7 text

LegoSDN § Availability is of utmost importance – Second only to security October 28, 2014 7 HotNets 2014 | LegoSDN

Slide 8

Slide 8 text

Fate-sharing § Fate-sharing relationships between – the SDN controller and the SDN application(s) (also between SDN applications) – the SDN application and the network § Failure in any one SDN application brings down the other applications, and the SDN controller. October 28, 2014 8 HotNets 2014 | LegoSDN

Slide 9

Slide 9 text

Three-pronged approach Controller A App1 A App2 A … in out 1 October 28, 2014 HotNets 2014 | LegoSDN 9 Contain crash

Slide 10

Slide 10 text

Three-pronged approach Controller A App1 A App2 A … in out 2 October 28, 2014 HotNets 2014 | LegoSDN 10 Undo changes

Slide 11

Slide 11 text

Three-pronged approach Controller A App1 A App2 A … in out 3 October 28, 2014 HotNets 2014 | LegoSDN 11 Handle message

Slide 12

Slide 12 text

Controller architecture must support two new abstractions October 28, 2014 HotNets 2014 | LegoSDN 12

Slide 13

Slide 13 text

Current architecture Controller A App1 A App2 October 28, 2014 HotNets 2014 | LegoSDN 13

Slide 14

Slide 14 text

Isolate SDN-Apps from the controller Sandbox A App1 Sandbox A App2 Controller October 28, 2014 HotNets 2014 | LegoSDN 14

Slide 15

Slide 15 text

Isolate SDN-Apps from the controller Sandbox A App1 Sandbox A App2 Controller October 28, 2014 HotNets 2014 | LegoSDN 15

Slide 16

Slide 16 text

Isolate SDN-Apps from the controller Sandbox A App1 Sandbox A App2 Controller October 28, 2014 HotNets 2014 | LegoSDN 16

Slide 17

Slide 17 text

Isolate SDN-Apps from the network Sandbox A App1 Controller a October 28, 2014 HotNets 2014 | LegoSDN 17

Slide 18

Slide 18 text

Isolate SDN-Apps from the network Sandbox A App1 Controller a October 28, 2014 HotNets 2014 | LegoSDN 18

Slide 19

Slide 19 text

LegoSDN AppVisor Stub Lightweight wrapper AppVisor Proxy Message dispatcher SDN-App is treated as a black-box. Stub and proxy allow SDN-Apps to talk to controller. NetLog Transactional support Sandbox A App1 Controller a AppVisor Stub AppVisor Proxy NetLog October 28, 2014 HotNets 2014 | LegoSDN 19

Slide 20

Slide 20 text

LegoSDN Built on top of FloodLight Ported three applications bundled with FloodLight to LegoSDN Sandbox A App1 Controller a AppVisor Stub AppVisor Proxy NetLog October 28, 2014 HotNets 2014 | LegoSDN 20

Slide 21

Slide 21 text

Three-pronged approach Controller A App1 A App2 A … in out 3 October 28, 2014 HotNets 2014 | LegoSDN 21 Handle message

Slide 22

Slide 22 text

How do you handle the crash inducing message? October 28, 2014 HotNets 2014 | LegoSDN 22

Slide 23

Slide 23 text

1. Crash and burn § Halt the application – SDN-App cannot continue processing – Other SDN-Apps can continue unaffected § No Compromise – Think of security related SDN-Apps Correctness: SDN-App’s ability to implement its functionality without change, according to the specification. October 28, 2014 HotNets 2014 | LegoSDN 23

Slide 24

Slide 24 text

2. Induce amnesia § Ignore or drop the crash inducing message – SDN-App will not see the message again § Complete Compromise October 28, 2014 HotNets 2014 | LegoSDN 24

Slide 25

Slide 25 text

3. Apply transformations § Transform the offending message into another one that the application can handle – application will continue with a modified input § Equivalence Compromise October 28, 2014 HotNets 2014 | LegoSDN 25

Slide 26

Slide 26 text

Course of action? No Compromise Apply Transformation(s) Complete Compromise Operator October 28, 2014 HotNets 2014 | LegoSDN 26

Slide 27

Slide 27 text

Related work § Fault tolerance – via reboots – applying Paxos for leader selection § Debugging SDN-Apps or the controller October 28, 2014 HotNets 2014 | LegoSDN 27

Slide 28

Slide 28 text

Message equivalence § How do you determine two messages are equivalent? October 28, 2014 HotNets 2014 | LegoSDN 28

Slide 29

Slide 29 text

Rollbacks are non-trivial § Rollback of one or more rules installed changes controller’s view of the state of network – Might induce crashes of other SDN applications that rely on a consistent view of network state October 28, 2014 HotNets 2014 | LegoSDN 29

Slide 30

Slide 30 text

Error propagation § Last message received by the SDN-App prior to the crash need not be the culprit! – How far along should we go back in history to find the root cause of the crash? – Recovery from an earlier checkpoint; How many checkpoints should we maintain? October 28, 2014 HotNets 2014 | LegoSDN 30

Slide 31

Slide 31 text

Road ahead § Rethink controller architecture – LegoSDN is only the tip of the iceberg. § Resilient controllers can catalyze adoption § Failures need to be a first-class citizen October 28, 2014 HotNets 2014 | LegoSDN 31

Slide 32

Slide 32 text

October 28, 2014 HotNets 2014 | LegoSDN 32 [https://twitter.com/tech_faq/status/450276248854355968]