Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Challenges in Distributed SDN

Challenges in Distributed SDN

Virtual software-defined networking (SDN) is becoming one of the most interesting and appealing topics in our industry. This talk will cover the challenges of scalability that cloud-scale, distributed virtual SDN solutions face. Duarte and Guillermo will go over the gory details of hardening distributed ARP tables and of replicating the NAT state of distributed routers, all the while ensuring packets are processed at ludicrous speed. The talk will cover the problem space, what tradeoffs are involved, and how these issues are solved in Midonet, an open source network virtualization system.

Duarte Nunes

October 06, 2015
Tweet

More Decks by Duarte Nunes

Other Decks in Programming

Transcript

  1. Agenda • Network virtualization overlays ◦ MidoNet • Distributed devices

    ◦ Managing the virtual topology ◦ Replicating device state • Distributed flow state ◦ Use cases ◦ State replication ◦ SNAT block reservation • A low-level view • Scaling
  2. Bare Metal Server Bare Metal Server NVOs transform this... VM

    VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM IP Fabric
  3. Bare Metal Server Bare Metal Server ...into this... VM VM

    VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM FW LB FW LB Internet/ WAN FW
  4. Network virtualization overlays • Decouples logical and physical configuration and

    address space • Isolation between tenants ◦ Tenant traffic is encapsulated at its local edge and carried by a tunnel over an IP network to another edge where the packet is decapsulated and delivered to the target ◦ Tenant address space can overlap • Allows virtual machine mobility • Optimal Forwarding ◦ One single physical hop
  5. MidoNet • Fully distributed architecture • All traffic processed at

    the edges, i.e., where it ingresses the physical network ◦ Virtual devices become distributed ◦ A packet can traverse a particular virtual device at any host in the cloud ◦ Distributed virtual bridges, routers, NATs, FWs, LBs • No SPOF • No middle boxes • Horizontally scalable L2 and L3 Gateways
  6. MidoNet Hosts • Gateway ◦ On-ramp and off-ramp into the

    cloud • Compute ◦ Where virtual machines or containers are hosted • NSDB ◦ A distributed database containing the virtual topology
  7. MidoNet Agent • Flow programmable switch at the bottom ◦

    Open vSwitch kernel module • Flow simulator at the top • Agents perform just-in-time flow computation ◦ Each flow is the result of simulation a packet’s trip through the virtual network topology
  8. • Port bindings • Flow match ◦ Input port ◦

    Tunnel header ◦ Ethernet header ◦ IP header ◦ Transport header ◦ … • Flow actions ◦ Output ◦ Encapsulate ◦ Transform ◦ ... Datapath - Flow programmable switch OVS Datapath VM Mask Mask Mask NIC Match Actions Tunnel src: 119.15.120.139 Tunnel dst: 119.15.168.162 Tunnel key: 12 Eth src: f6:73:84:a4:26:54 Eth dst: 26:eb:ac:ea:1e:d4 IP src: 10.0.3.1 IP dst: 10.0.3.2 Eth src: 9e:d9:b3:74:0f:c7 Eth dst: 16:ed:98:8d:8b:1a IP src: 10.0.3.1 IP dst: 10.0.3.2
  9. Tunneling VM VM Datapath Datapath Node 1 Node 2 queue

    userspace packet execute packet create flow Simulation Packet VXLAN UDP+IP+ Ethernet IP Fabric
  10. Managing topology changes • ZooKeeper serves the virtual network topology

    ◦ Reliable subscription to topology changes • Agents fetch and “watch” virtual devices as they need them to process packets they see • Every traversed device applies a tag on a computed flow • Agents react to topology changes by invalidating the corresponding flows by tag ◦ For example, the ID of a device that was removed
  11. Replicating device state • Remember: we process each packet locally

    at the physical host where it ingresses the virtual network: Different packets traverse the same virtual device at different hosts • This affects virtual network devices: ◦ A virtual bridge learns a MAC-port mapping a host and needs to read it in other hosts ◦ A virtual router emits an ARP request out of one host and receives the reply on another host
  12. Replicating device state • Store device state tables (ARP, MAC-learning,

    routes) in ZooKeeper • Interested agents subscribe to tables to get updates • The owner of an entry manages its lifecycle: ◦ Refresh ARP entries ◦ Reference count which flows use a MAC-learning entry • Use ZK Ephemeral nodes so entries go away if a host dies
  13. ARP Table VM VM ARP Table IP Fabric ARP reply

    handled locally and written to ZK ZK notification VM VM
  14. Virtual Firewall VM FW FW VM VM NIC Internet/ WAN

    Ingress: CIDR: 0.0.0.0/0 Port: 80 Egress: Forward flow Return flow Internet/ WAN
  15. Virtual NAT VM VM VM NIC Internet/ WAN Return flow

    Forward flow 180.0.1.100:80 10.0.0.2 10.0.0.2:6456 Internet/ WAN SNAT
  16. Flow state • Connection tracking ◦ Key: 5 tuple +

    ingress device UUID ◦ Value: NA ◦ Forward state not needed ◦ One flow state entry per flow • NAT ◦ Key: 5 tuple + device UUID under which NAT was performed ◦ Value: (IP, port) binding ◦ Possibly multiple flow state entries per flow
  17. Interested set • Egress host ◦ Which will (probably) simulate

    the return flow • Hinting via port groups (ingress and egress) ◦ Depending on upstream routes, any of the static or dynamic uplinks of the Provider Router may receive North-to-South flows (an L3 Gateway) ◦ Similarly, a tenant may have a redundant L3 VPN from their office network to their Tenant Router, which may ingress MidoNet at more than one node (on-ramp) ◦ Also a VLAN L2 Gateway, which allows a 802.1Q virtual bridge with 2 physical links; traffic from the physical workloads ingresses either of the bridge’s uplink ports depending on STP • VXLAN L2 Gateway ◦ To support bare metal servers behind a VTEP sending traffic to MidoNet, which ingresses a Flooding Proxy
  18. Sharing state - Distributed database Node 2 Node 1 Fault-tolerant

    flow-state DB 1. New flow arrives 2. Check for existing flow state 3. Push new flow state 4. Tunnel the packet 5. Deliver the packet
  19. 4. Deliver the packet Sharing state - Distributed database Node

    2 (or other if asym. ret. path) Node 1 Fault-tolerant flow-state DB 3. Tunnel the packet 1. Return flow arrives 2. Lookup the flow state
  20. Sharing state - Distributed database Node 2 Node 3 Fault-tolerant

    flow-state DB 1. Exiting flow arrives at different node 2. Lookup flow state 3. Tunnel the packet 4. Deliver the packet
  21. Sharing state - Distributed database • How many replicas receive

    the flow state? • How many replicas must acknowledge before finishing the simulation? • How many replicas to read the flow state from in order to simulate a return flow or re-simulate a forward flow? • Adds significant latency to simulation
  22. Sharing state - Peer-to-peer handoff Node 2 Node 1 1.

    New flow arrives 4. Tunnel the packet 5. Deliver the packet Node 4 (possible asym. fwd. path) Node 3 (possible asym. ret. path) 2. Check or create local state 3. Replicate the flow state to interested set
  23. 1. Return flow arrives 4. Deliver the packet Sharing state

    - Peer-to-peer handoff Node 2 Node 1 3. Tunnel the packet Node 4 (possible asym. fwd. path) Node 3 (possible asym. ret. path) 2. Lookup local state
  24. Sharing state - Peer-to-peer handoff Node 2 Node 1 3.

    Tunnel the packet 4. Deliver the packet Node 3 (possible asym. ret. path) 2. Lookup local state 1. Exiting flow arrives at different node Node 4 (possible asym. fwd. path)
  25. Sharing state - Peer-to-peer handoff • No added latency •

    Fire-and-forget or reliable? • How often to retry? • Delay tunneling the packets until the flow state has propagated or accept the risk of the return flow being computed without the flow state?
  26. Lifecyle management • Local reference counting on ingress hosts •

    Flows have a hard timeout, triggers expiration of associated flow state ◦ Flow state is expired independently at each host ◦ Tries to minimize coordination between hosts • Flow state refreshed by new, simulated packets ◦ At any ingress host ◦ Means that connections die in the absence of forward packets
  27. State transfer • Directly between hosts ◦ Requires hinting from

    the integration layer • Through a data store ◦ Can be considered part of the interested set ◦ Requires state to be replicated to it (asynchronously, fire-and-forget) ◦ Also solves restart problem
  28. SNAT block reservation VM VM VM NIC 10.0.0.2 10.0.0.2:6456 180.0.1.100:9043

    VM dst: 216.58.210.164:80 Internet/ WAN Internet/ WAN
  29. SNAT block reservation VM VM VM NIC 10.0.0.2 10.0.0.2:6456 180.0.1.100:9043

    VM dst: 216.58.210.164:80 NAT Target: (start_ip..end_ip, start_port..end_port) e.g. 180.0.1.100..180.0.1.100 5000..65535 Internet/ WAN Internet/ WAN
  30. 10.0.0.1 SNAT block reservation VM VM NIC 10.0.0.2 10.0.0.2:6456 180.0.1.100:9043

    dst: 216.58.210.164:80 10.0.0.1:7182 180.0.1.100:9044 VM VM Internet/ WAN Internet/ WAN
  31. SNAT block reservation VM VM VM NIC 10.0.0.2 10.0.0.2:6456 180.0.1.100:9043

    VM dst: 216.58.210.164:80 10.0.0.1 10.0.0.1:7182 180.0.1.100:? Internet/ WAN Internet/ WAN
  32. • Performed through ZooKeeper • /nat/{device_id}/{ip}/{block_idx} • 64 ports per

    block, 1024 total blocks • LRU based allocation • Blocks are referenced by flow state SNAT block reservation
  33. SNAT block reservation VM VM VM NIC 10.0.0.2 10.0.0.2:6456 180.0.1.100:9043

    VM dst: 216.58.210.164:80 10.0.0.1 10.0.0.1:7182 180.0.1.100:10345 block #141 block #161 Internet/ WAN Internet/ WAN
  34. SNAT block reservation and port migration VM VM NIC VM

    180.0.1.100:9043 10.0.0.2 block #141 block #X Internet/ WAN Internet/ WAN
  35. SNAT block reservation and port migration VM VM NIC VM

    180.0.1.100:9043 10.0.0.2 block #141 block #X Internet/ WAN Internet/ WAN
  36. SNAT block reservation and port migration VM VM NIC VM

    180.0.1.100:9043 10.0.0.2 block #141 block #X Internet/ WAN Internet/ WAN
  37. SNAT block reservation and port migration VM VM NIC VM

    180.0.1.100:9043 10.0.0.2 block #141 block #X Internet/ WAN Internet/ WAN
  38. SNAT block reservation and port migration VM VM NIC VM

    180.0.1.100:9043 10.0.0.2 block #9 block #X Internet/ WAN Internet/ WAN
  39. • Overload source port for same source IP, based on

    ◦ Destination IP ◦ Destination port • Resiliency against port scanning a given destination IP SNAT block overloading
  40. Inside the Agent Flow table Flow state ARP broker CPU

    Flow table Flow state ARP broker CPU Flow table Flow state ARP broker CPU Flow table Flow state ARP broker CPU Upcall Output Simulation Datapath Backchannel Backchannel Backchannel Backchannel Virtual Topology User Kernel
  41. Performance • Sharding ◦ Share nothing model ◦ Each simulation

    thread is responsible for a subset of the installed flows ◦ Each simulation thread is responsible for a subset of the flow state ◦ Each thread ARPs individually ◦ Communication by message passing through “backchannels” • Run to completion model ◦ When a piece of the virtual topology is needed, simulations are parked • Lock-free algorithms where sharding is not possible
  42. Scaling • Horizontally scale gateways • Consider ◦ 6 GWs

    doing 100kfps, 100 Computes doing 25kfps ◦ Retention time: 30min ◦ Global flow state dataset size: 177GB • Can’t hold everything in memory ◦ Need partitioning ◦ Static, by partitioning announced routes ◦ Dynamically • Have to send a lot of flow state messages ◦ Can use IP multicast
  43. Dynamic partitioning C1 VM GW4 GW1 GW3 GW2 Internet/ WAN

    NIC NIC NIC NIC GW5 NIC 1 2 4 7 8 6 3 Forward Deliver Flow record 5 Return Deliver • Consistent hashing • One extra inter-DC hop ◦ For the first packet • Special case TCP ◦ Only for North -> South • eBPf in the kernel