Leveraging SDN Layering to Systematically Troubleshoot Networks

6ba55f1a39e7a1c8ac3b730edb1e32f4?s=47 Colin Scott
November 02, 2015

Leveraging SDN Layering to Systematically Troubleshoot Networks

Talk I gave jointly with Brandon Heller at HotSDN 2013.

Paper: http://www.eecs.berkeley.edu/~rcs/research/vision-paper.pdf

6ba55f1a39e7a1c8ac3b730edb1e32f4?s=128

Colin Scott

November 02, 2015
Tweet

Transcript

  1. Leveraging  SDN  Layering  to   Systema2cally  Troubleshoot  Networks   Brandon

     Heller★   Colin  Sco/u   Nick  McKeown⌘   Sco=  Shenker  u¤   Andreas  Wundsam  §   Hongyi  Zeng⌘
 Sam  Whitlocku   Vimalkumar  Jeyakumar⌘   Nikhil  Handigol★
  2. Admin   Network   skills  +  tools   +  knowledge

      Protocols     Configura2on     Topology   Policy   •  connect  hosts  A  +  B   •  quaran2ne  virus-­‐ infected  hosts   •  route  guest  traffic  to     an  HTTP  proxy   •  priori2ze  SSH     +   à     1:  Configure   Ethane,  overlays,  consistency   primi2ves,  network   programming  languages,  …   3:  Fix  Stuff!   2:  Troubleshoot   This  Talk  
  3. Admin   Network   skills  +  tools   +  knowledge

      Protocols     Configura2on     Topology   Policy   •  connect  hosts  A  +  B   •  quaran2ne  virus-­‐ infected  hosts   •  route  guest  traffic  to     an  HTTP  proxy   •  priori2ze  SSH     +   à     1:  Configure   Ethane,  overlays,  consistency   primi2ves,  network   programming  languages,  …   3:  Fix  Stuff!   2:  Troubleshoot   #1  request  from  network  admins:   Automa<c  Troubleshoo<ng     Source:  “Automa2c  Test  Packet  Genera2on”,  CoNEXT  ‘12,  Zeng  et  al.     This  Talk  
  4. How  to  automate  troubleshoo2ng?   Network   Policy   • 

    isolate  groups  A  +  B   •  route  guest  traffic  to     an  HTTP  proxy   •  block  a  list  of  virus-­‐ infected  hosts   Challenging  in  tradi2onal  networks.   ~ ?   (2)  Check  behavior  against  policy:   • confusing:  don’t  know  lowest-­‐level  forwarding  behavior   • distributed:  hard  to  get  a  meaningful  snapshot     Two  requirements.   (1)   Know  the  intended  policy:   • confusing:  different  config  format  for  each  protocol   • distributed:  configura2on  spread  among  all  nodes   • hard:  must  understand  all  protocols  &  their  interac2ons    difficult  to  check   imprac<cal  to  infer  
  5. Control-­‐Plane  Layering  in  SDN   Firmware   Firmware   Firmware

      Network  Hypervisor   App   App   App   State Layers Logical View Physical View Device State Hardware Policy Code Layers Network  OS    HW    HW    HW  
  6. Firmware   Firmware   Firmware    HW    HW  

     HW   Systema<cally  Troubleshoo<ng  an  SDN   Network  OS   Network  Hypervisor   App   App   App   State Layers Logical View Physical View Device State Hardware Policy Code Layers Observa<on:     Each  state  layer  fully  specifies   network  behavior.   Insight:   Bugs  manifest  as   mistransla2ons  between   layers.   Systema<c  Approach:   (1) Binary  search  to  isolate   to  a  code  layer.   (2) Leverage  state  to  isolate   within  the  code  layer.  
  7. Phase  1:  Localizing  to  a  code  layer   [Operator Intent]

    Logical View Physical View Device State Hardware Policy ?   ~ Apps NetHyperV NetOS Firmware [Actual Behavior] Cause:   Firmware  Bug   Yes No ?   ~ Yes No ?   ~ Yes No SOFT   [CONEXT  ‘12]   Anteater   [SIGCOMM  ‘11]   Symptom:  Hosts  unable  to  communicate  
  8. Phase  1:  Localizing  to  a  code  layer   [Operator Intent]

    Logical View Physical View Device State Hardware Policy ?   ~ Apps NetHyperV NetOS Firmware [Actual Behavior] Yes No ?   ~ Yes No Symptom:  Tenant  Isola2on  Breach   HSA   [NSDI  ’12]   OFRewind   [ATC  ‘11]   Yes No ?   ~ ?   ~ Yes No Correspondence   Checking   Cause:   NetHypervisor  Bug  
  9. How  to  automate  troubleshoo2ng?   Network   Policy   • 

    isolate  groups  A  +  B   •  route  guest  traffic  to     an  HTTP  proxy   •  block  a  list  of  virus-­‐ infected  hosts   Possible  in  Sonware-­‐Defined  Networks   ~ ?   (2)  Check  behavior  against  policy:   • confusing:  don’t  know  lowest-­‐level  forwarding  behavior   • distributed:  hard  to  get  a  meaningful  snapshot     Two  requirements.   (1)   Know  the  intended  policy:   • confusing:  different  config  format  for  each  protocol   • distributed:  configura2on  spread  among  all  nodes   • hard:  must  understand  all  protocols  &  their  interac2ons   directly  accessible   directly  provided   app   fewer  nodes  
  10. Takeways   •  Control  plane  layering  enables  systema2c   troubleshoo2ng

      •  Thinking  about  troubleshoo2ng  in  terms  of   layers  shows  us  where  tools  fit  in   – Reveals  missing  tools   – Highlights  choices  between  tools,  with  tradeoffs   •  Plenty  of  opportuni2es  len.   Opera2onalize!  
  11. Leverage  the  layers  in  SDN.     Brandon  Heller★  

    Colin  Sco/u   Nick  McKeown⌘   Sco=  Shenker  u¤   Andreas  Wundsam  §   Hongyi  Zeng⌘
 Sam  Whitlocku   Vimalkumar  Jeyakumar⌘   Nikhil  Handigol★
  12. How  is  this  different  than  general   distributed  systems  debugging?

        •  Simple  answer:  it’s  not!  SDN  is  an  excellent   opportunity  to  draw  upon  ideas  from  other   distributed  systems   •  Subtlety:  networks  are  solving  a  much  more   constrained  problem  than  general  distributed   systems  
  13. Limita2ons   •  Correctness  only,  not  performance   •  Side

     effects  not  reflected  in  state   •  No  guarantee  of  finding  single  code  layer   •  No  guarantee  of  individual  layer  correctness   •  No  guarantee  of  future  correctness   •  Layer  visibility  may  be  imperfect  
  14. Plenty  of  Opportuni2es  Remain   •  Automa2c  Troubleshoo2ng  à  

      Ac2onable  Bug  Reports   – Filtering  the  signal  from  the  noise   – Crea2ng  consistent  views  of  state   •  Improving  Invariant  Checkers   – Scale   – Flexible  Policy  Input   •  Hybrid  Tradi2onal  +  SDN  Debugging  
  15. Plenty  of  Opportuni2es  Remain   •  Automa2c  Troubleshoo2ng  à  

      Ac2onable  Bug  Reports   – Filtering  the  signal  from  the  noise   – Crea2ng  consistent  views  of  state   Packet  History:   Path  +  Headers   +  Forwarding  State               Forwarding   State   Forwarding   State   Forwarding   State   Forwarding   State   [HotSDN  2012:   Where  is  the  Debugger  for  My  Sonware-­‐Defined  Network?]  
  16. Plenty  of  Opportuni2es  Remain   •  Automa2c  Troubleshoo2ng  à  

      Ac2onable  Bug  Reports   – Filtering  the  signal  from  the  noise   Controller  A   Controller  B   Controller  C   Switch  1   Switch  2   Switch3   Switch  4   Switch  5   Switch  6   Switch  7   Switch  8   Switch  9   [Berkeley  Tech  Report:   How  Did  We  Get  Into  This  Mess?  Isola2ng  Fault-­‐Inducing  Inputs  to  SDN  Control  Sonware]   Minimal   Causal   Sequence  
  17. Isn’t  this  unnecessary  with  consistency   primi2ves/languages/etc?   •  No

      •  Catch/rule  out  bugs  outside  the  framework   •  Catch  instances  where  the  framework  pushes   config  that  breaks  the  policy  
  18. What’s  novel  about  this  work?     •  Simple  answer:

     nothing!  
  19. Control-­‐Plane  Layering  in  SDN   Firmware   Firmware   Firmware

      Network  Hypervisor   App   App   App   State Layers Logical View Physical View Device State Hardware Policy Example Errors Configuration Parsing Error Tenant isolation breach (policy mistranslation) Failover logic error, synchronization bug Register misconfiguration, Router memory corruption Code Layers [Unintended Config] [External Connectivity Error] Network  OS    HW    HW    HW