Slide 1

Slide 1 text

Challenges in Distributed SDN Duarte Nunes [email protected] @duarte_nunes Guillermo Ontañón [email protected] @gontanon

Slide 2

Slide 2 text

Agenda ● Network virtualization overlays ○ MidoNet ● Distributed devices ○ Managing the virtual topology ○ Replicating device state ● Distributed flow state ○ Use cases ○ State replication ○ SNAT block reservation ● A low-level view ● Scaling

Slide 3

Slide 3 text

Bare Metal Server Bare Metal Server NVOs transform this... VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM IP Fabric

Slide 4

Slide 4 text

Bare Metal Server Bare Metal Server ...into this... VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM FW LB FW LB Internet/ WAN FW

Slide 5

Slide 5 text

Network virtualization overlays ● Decouples logical and physical configuration and address space ● Isolation between tenants ○ Tenant traffic is encapsulated at its local edge and carried by a tunnel over an IP network to another edge where the packet is decapsulated and delivered to the target ○ Tenant address space can overlap ● Allows virtual machine mobility ● Optimal Forwarding ○ One single physical hop

Slide 6

Slide 6 text

MidoNet ● Fully distributed architecture ● All traffic processed at the edges, i.e., where it ingresses the physical network ○ Virtual devices become distributed ○ A packet can traverse a particular virtual device at any host in the cloud ○ Distributed virtual bridges, routers, NATs, FWs, LBs ● No SPOF ● No middle boxes ● Horizontally scalable L2 and L3 Gateways

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

MidoNet Hosts ● Gateway ○ On-ramp and off-ramp into the cloud ● Compute ○ Where virtual machines or containers are hosted ● NSDB ○ A distributed database containing the virtual topology

Slide 9

Slide 9 text

MidoNet Agent ● Flow programmable switch at the bottom ○ Open vSwitch kernel module ● Flow simulator at the top ● Agents perform just-in-time flow computation ○ Each flow is the result of simulation a packet’s trip through the virtual network topology

Slide 10

Slide 10 text

● Port bindings ● Flow match ○ Input port ○ Tunnel header ○ Ethernet header ○ IP header ○ Transport header ○ … ● Flow actions ○ Output ○ Encapsulate ○ Transform ○ ... Datapath - Flow programmable switch OVS Datapath VM Mask Mask Mask NIC Match Actions Tunnel src: 119.15.120.139 Tunnel dst: 119.15.168.162 Tunnel key: 12 Eth src: f6:73:84:a4:26:54 Eth dst: 26:eb:ac:ea:1e:d4 IP src: 10.0.3.1 IP dst: 10.0.3.2 Eth src: 9e:d9:b3:74:0f:c7 Eth dst: 16:ed:98:8d:8b:1a IP src: 10.0.3.1 IP dst: 10.0.3.2

Slide 11

Slide 11 text

Just-in-time flow computation Datapath queue userspace packet execute packet create flow Simulation

Slide 12

Slide 12 text

Tunneling VM VM Datapath Datapath Node 1 Node 2 queue userspace packet execute packet create flow Simulation Packet VXLAN UDP+IP+ Ethernet IP Fabric

Slide 13

Slide 13 text

Virtual Devices

Slide 14

Slide 14 text

Managing topology changes ● ZooKeeper serves the virtual network topology ○ Reliable subscription to topology changes ● Agents fetch and “watch” virtual devices as they need them to process packets they see ● Every traversed device applies a tag on a computed flow ● Agents react to topology changes by invalidating the corresponding flows by tag ○ For example, the ID of a device that was removed

Slide 15

Slide 15 text

Replicating device state ● Remember: we process each packet locally at the physical host where it ingresses the virtual network: Different packets traverse the same virtual device at different hosts ● This affects virtual network devices: ○ A virtual bridge learns a MAC-port mapping a host and needs to read it in other hosts ○ A virtual router emits an ARP request out of one host and receives the reply on another host

Slide 16

Slide 16 text

Replicating device state ● Store device state tables (ARP, MAC-learning, routes) in ZooKeeper ● Interested agents subscribe to tables to get updates ● The owner of an entry manages its lifecycle: ○ Refresh ARP entries ○ Reference count which flows use a MAC-learning entry ● Use ZK Ephemeral nodes so entries go away if a host dies

Slide 17

Slide 17 text

ARP Table VM VM ARP Table IP Fabric VM VM

Slide 18

Slide 18 text

ARP Table VM VM ARP Table IP Fabric VM VM

Slide 19

Slide 19 text

ARP Table VM VM VM ARP Table IP Fabric Encapsulated ARP request VM

Slide 20

Slide 20 text

ARP Table VM VM ARP Table IP Fabric ARP reply handled locally and written to ZK ZK notification VM VM

Slide 21

Slide 21 text

Encapsulated packet ARP Table VM VM ARP Table IP Fabric VM VM

Slide 22

Slide 22 text

Distributed Flow State

Slide 23

Slide 23 text

Virtual Firewall VM FW FW VM VM NIC Internet/ WAN Ingress: CIDR: 0.0.0.0/0 Port: 80 Egress: Forward flow Return flow Internet/ WAN

Slide 24

Slide 24 text

Virtual NAT VM VM VM NIC Internet/ WAN Return flow Forward flow 180.0.1.100:80 10.0.0.2 10.0.0.2:6456 Internet/ WAN SNAT

Slide 25

Slide 25 text

Flow state ● Connection tracking ○ Key: 5 tuple + ingress device UUID ○ Value: NA ○ Forward state not needed ○ One flow state entry per flow ● NAT ○ Key: 5 tuple + device UUID under which NAT was performed ○ Value: (IP, port) binding ○ Possibly multiple flow state entries per flow

Slide 26

Slide 26 text

Asymmetric routing NIC Internet/ WAN NIC NIC VM LB VM

Slide 27

Slide 27 text

Asymmetric routing NIC Internet/ WAN NIC NIC VM LB VM Forward flow

Slide 28

Slide 28 text

Asymmetric routing NIC Internet/ WAN NIC NIC VM LB VM Return flow

Slide 29

Slide 29 text

Asymmetric routing NIC Internet/ WAN NIC NIC VM LB VM Forward flow

Slide 30

Slide 30 text

Asymmetric routing NIC Internet/ WAN NIC NIC VM LB VM Return flow

Slide 31

Slide 31 text

Asymmetric routing NIC Internet/ WAN NIC NIC VM LB VM Forward flow

Slide 32

Slide 32 text

Interested set ● Egress host ○ Which will (probably) simulate the return flow ● Hinting via port groups (ingress and egress) ○ Depending on upstream routes, any of the static or dynamic uplinks of the Provider Router may receive North-to-South flows (an L3 Gateway) ○ Similarly, a tenant may have a redundant L3 VPN from their office network to their Tenant Router, which may ingress MidoNet at more than one node (on-ramp) ○ Also a VLAN L2 Gateway, which allows a 802.1Q virtual bridge with 2 physical links; traffic from the physical workloads ingresses either of the bridge’s uplink ports depending on STP ● VXLAN L2 Gateway ○ To support bare metal servers behind a VTEP sending traffic to MidoNet, which ingresses a Flooding Proxy

Slide 33

Slide 33 text

Sharing state - Distributed database Node 2 Node 1 Fault-tolerant flow-state DB 1. New flow arrives 2. Check for existing flow state 3. Push new flow state 4. Tunnel the packet 5. Deliver the packet

Slide 34

Slide 34 text

4. Deliver the packet Sharing state - Distributed database Node 2 (or other if asym. ret. path) Node 1 Fault-tolerant flow-state DB 3. Tunnel the packet 1. Return flow arrives 2. Lookup the flow state

Slide 35

Slide 35 text

Sharing state - Distributed database Node 2 Node 3 Fault-tolerant flow-state DB 1. Exiting flow arrives at different node 2. Lookup flow state 3. Tunnel the packet 4. Deliver the packet

Slide 36

Slide 36 text

Sharing state - Distributed database ● How many replicas receive the flow state? ● How many replicas must acknowledge before finishing the simulation? ● How many replicas to read the flow state from in order to simulate a return flow or re-simulate a forward flow? ● Adds significant latency to simulation

Slide 37

Slide 37 text

Sharing state - Peer-to-peer handoff Node 2 Node 1 1. New flow arrives 4. Tunnel the packet 5. Deliver the packet Node 4 (possible asym. fwd. path) Node 3 (possible asym. ret. path) 2. Check or create local state 3. Replicate the flow state to interested set

Slide 38

Slide 38 text

1. Return flow arrives 4. Deliver the packet Sharing state - Peer-to-peer handoff Node 2 Node 1 3. Tunnel the packet Node 4 (possible asym. fwd. path) Node 3 (possible asym. ret. path) 2. Lookup local state

Slide 39

Slide 39 text

Sharing state - Peer-to-peer handoff Node 2 Node 1 3. Tunnel the packet 4. Deliver the packet Node 3 (possible asym. ret. path) 2. Lookup local state 1. Exiting flow arrives at different node Node 4 (possible asym. fwd. path)

Slide 40

Slide 40 text

Sharing state - Peer-to-peer handoff ● No added latency ● Fire-and-forget or reliable? ● How often to retry? ● Delay tunneling the packets until the flow state has propagated or accept the risk of the return flow being computed without the flow state?

Slide 41

Slide 41 text

Lifecyle management ● Local reference counting on ingress hosts ● Flows have a hard timeout, triggers expiration of associated flow state ○ Flow state is expired independently at each host ○ Tries to minimize coordination between hosts ● Flow state refreshed by new, simulated packets ○ At any ingress host ○ Means that connections die in the absence of forward packets

Slide 42

Slide 42 text

Port migration VM VM VM NIC Internet/ WAN Internet/ WAN

Slide 43

Slide 43 text

Port migration VM VM VM NIC Internet/ WAN Internet/ WAN

Slide 44

Slide 44 text

Port migration VM VM NIC Internet/ WAN Internet/ WAN VM

Slide 45

Slide 45 text

Port migration VM VM NIC VM Internet/ WAN Internet/ WAN

Slide 46

Slide 46 text

Port migration VM VM NIC VM Internet/ WAN Internet/ WAN

Slide 47

Slide 47 text

Port migration VM VM NIC VM Internet/ WAN Internet/ WAN

Slide 48

Slide 48 text

State transfer ● Directly between hosts ○ Requires hinting from the integration layer ● Through a data store ○ Can be considered part of the interested set ○ Requires state to be replicated to it (asynchronously, fire-and-forget) ○ Also solves restart problem

Slide 49

Slide 49 text

SNAT block reservation VM VM VM NIC 10.0.0.2 10.0.0.2:6456 180.0.1.100:9043 VM dst: 216.58.210.164:80 Internet/ WAN Internet/ WAN

Slide 50

Slide 50 text

SNAT block reservation VM VM VM NIC 10.0.0.2 10.0.0.2:6456 180.0.1.100:9043 VM dst: 216.58.210.164:80 NAT Target: (start_ip..end_ip, start_port..end_port) e.g. 180.0.1.100..180.0.1.100 5000..65535 Internet/ WAN Internet/ WAN

Slide 51

Slide 51 text

10.0.0.1 SNAT block reservation VM VM NIC 10.0.0.2 10.0.0.2:6456 180.0.1.100:9043 dst: 216.58.210.164:80 10.0.0.1:7182 180.0.1.100:9044 VM VM Internet/ WAN Internet/ WAN

Slide 52

Slide 52 text

SNAT block reservation VM VM VM NIC 10.0.0.2 10.0.0.2:6456 180.0.1.100:9043 VM dst: 216.58.210.164:80 10.0.0.1 10.0.0.1:7182 180.0.1.100:? Internet/ WAN Internet/ WAN

Slide 53

Slide 53 text

● Performed through ZooKeeper ● /nat/{device_id}/{ip}/{block_idx} ● 64 ports per block, 1024 total blocks ● LRU based allocation ● Blocks are referenced by flow state SNAT block reservation

Slide 54

Slide 54 text

SNAT block reservation VM VM VM NIC 10.0.0.2 10.0.0.2:6456 180.0.1.100:9043 VM dst: 216.58.210.164:80 10.0.0.1 10.0.0.1:7182 180.0.1.100:10345 block #141 block #161 Internet/ WAN Internet/ WAN

Slide 55

Slide 55 text

SNAT block reservation and port migration VM VM NIC VM 180.0.1.100:9043 10.0.0.2 block #141 block #X Internet/ WAN Internet/ WAN

Slide 56

Slide 56 text

SNAT block reservation and port migration VM VM NIC VM 180.0.1.100:9043 10.0.0.2 block #141 block #X Internet/ WAN Internet/ WAN

Slide 57

Slide 57 text

SNAT block reservation and port migration VM VM NIC VM 180.0.1.100:9043 10.0.0.2 block #141 block #X Internet/ WAN Internet/ WAN

Slide 58

Slide 58 text

SNAT block reservation and port migration VM VM NIC VM 180.0.1.100:9043 10.0.0.2 block #141 block #X Internet/ WAN Internet/ WAN

Slide 59

Slide 59 text

SNAT block reservation and port migration VM VM NIC VM 180.0.1.100:9043 10.0.0.2 block #9 block #X Internet/ WAN Internet/ WAN

Slide 60

Slide 60 text

● Overload source port for same source IP, based on ○ Destination IP ○ Destination port ● Resiliency against port scanning a given destination IP SNAT block overloading

Slide 61

Slide 61 text

Low-level

Slide 62

Slide 62 text

Inside the Agent Flow table Flow state ARP broker CPU Flow table Flow state ARP broker CPU Flow table Flow state ARP broker CPU Flow table Flow state ARP broker CPU Upcall Output Simulation Datapath Backchannel Backchannel Backchannel Backchannel Virtual Topology User Kernel

Slide 63

Slide 63 text

Performance ● Sharding ○ Share nothing model ○ Each simulation thread is responsible for a subset of the installed flows ○ Each simulation thread is responsible for a subset of the flow state ○ Each thread ARPs individually ○ Communication by message passing through “backchannels” ● Run to completion model ○ When a piece of the virtual topology is needed, simulations are parked ● Lock-free algorithms where sharding is not possible

Slide 64

Slide 64 text

Scaling ● Horizontally scale gateways ● Consider ○ 6 GWs doing 100kfps, 100 Computes doing 25kfps ○ Retention time: 30min ○ Global flow state dataset size: 177GB ● Can’t hold everything in memory ○ Need partitioning ○ Static, by partitioning announced routes ○ Dynamically ● Have to send a lot of flow state messages ○ Can use IP multicast

Slide 65

Slide 65 text

Dynamic partitioning C1 VM GW4 GW1 GW3 GW2 Internet/ WAN NIC NIC NIC NIC GW5 NIC 1 2 4 7 8 6 3 Forward Deliver Flow record 5 Return Deliver ● Consistent hashing ● One extra inter-DC hop ○ For the first packet ● Special case TCP ○ Only for North -> South ● eBPf in the kernel

Slide 66

Slide 66 text

Questions?