Slide 1

Slide 1 text

MidoNet and the Open vSwitch Datapath Duarte Nunes duarte@midokura.com @duarte_nunes

Slide 2

Slide 2 text

Agenda ● MidoNet ○ Architecture ○ Agent ● Distributed state ○ Device state ○ Flow state ● Relationship with datapath ○ Netlink library ○ Performance ○ Flow bookkeeping

Slide 3

Slide 3 text

Bare Metal Server Bare Metal Server MidoNet transform this... VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM IP Fabric

Slide 4

Slide 4 text

Bare Metal Server Bare Metal Server ...into this... VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM FW LB FW LB Internet/ WAN FW

Slide 5

Slide 5 text

Bare Metal Server Bare Metal Server Packet processing VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM FW LB FW LB Internet/ WAN FW

Slide 6

Slide 6 text

Bare Metal Server Bare Metal Server VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM IP Fabric midonet nsdb 2 midonet nsdb 3 midonet nsdb 1 midonet gateway 2 midonet gateway 3 midonet gateway 1 IP Fabric IP Fabric Internet/ WAN Physical view

Slide 7

Slide 7 text

MidoNet ● Fully distributed architecture ● All traffic processed at the edges, i.e., where it ingresses the physical network ○ virtual devices become distributed ○ a packet can traverse a particular virtual device at any host in the cloud ○ distributed virtual bridges, routers, NATs, FWs, LBs, etc. ● No SPOF ● No middle boxes ● Horizontally scalable L2 and L3 Gateways

Slide 8

Slide 8 text

Gateway 1 MidoNet Hosts Quagga, bgpd OVS kmod IP3 eth0 eth1 VXLAN Tunnel Port Internet/WAN port1 port2 port3, veth0 veth1 MidoNet Agent (Java Daemon) Compute 1 VM VM VM VM VM VM VM VM IP Fabric OVS kmod IP1 VXLAN Tunnel Port eth0 port5, tap12345 MidoNet Agent (Java Daemon)

Slide 9

Slide 9 text

Flow computation and tunneling ● Flows are computed at the ingress host ○ by simulating a packet’s path through the virtual topology ○ without fetching any information off-box (~99% of the time) ● Just-in-time flow computation ● If the egress port is on a different host, then the packet is tunneled ○ the tunnel key encodes the egress port ○ no computation is needed at the egress

Slide 10

Slide 10 text

Inside the Agent Flow table Flow state ARP broker CPU Flow table Flow state ARP broker CPU Flow table Flow state ARP broker CPU Flow table Flow state ARP broker CPU Upcall Output Simulation Datapath Backchannel Backchannel Backchannel Backchannel Virtual Topology User Kernel queue userspace packet packet execution, flow create and delete

Slide 11

Slide 11 text

Device state ● ZooKeeper serves the virtual network topology ○ reliable subscription to topology changes ● Agents fetch, cache, and “watch” virtual devices on-demand to process packets ● Packets naturally traverse the same virtual device at different hosts ● This affects device state: ○ a virtual bridge learns a MAC-port mapping a host and needs to read it in other hosts ○ a virtual router emits an ARP request out of one host and receives the reply on another host ● Store device state tables (ARP, MAC-learning, routes) in ZooKeeper ○ interested agents subscribe to tables to get updates ○ the owner of an entry manages its lifecycle ○ use ZK Ephemeral nodes so entries go away if a host fails

Slide 12

Slide 12 text

ARP Table VM VM ARP Table IP Fabric VM VM

Slide 13

Slide 13 text

ARP Table VM VM ARP Table IP Fabric VM VM

Slide 14

Slide 14 text

ARP Table VM VM VM ARP Table IP Fabric Encapsulated ARP request VM

Slide 15

Slide 15 text

ARP Table VM VM ARP Table IP Fabric ARP reply handled locally and written to ZK ZK notification VM VM

Slide 16

Slide 16 text

Encapsulated packet ARP Table VM VM ARP Table IP Fabric VM VM

Slide 17

Slide 17 text

Flow state ● Per-flow L4 state, e.g. connection tracking or NAT ● Forward and return flows are typically handled by different hosts ○ thus, they need to share state ● Tricky to leverage megaflows ○ agent needs to generate this state, replicate it

Slide 18

Slide 18 text

Sharing state - Peer-to-peer handoff Node 2 Node 1 1. New flow arrives 4. Tunnel the packet 5. Deliver the packet Node 4 (possible asym. fwd. path) Node 3 (possible asym. ret. path) 2. Check or create local state 3. Replicate the flow state to interested set

Slide 19

Slide 19 text

1. Return flow arrives 4. Deliver the packet Sharing state - Peer-to-peer handoff Node 2 Node 1 3. Tunnel the packet Node 4 (possible asym. fwd. path) Node 3 (possible asym. ret. path) 2. Lookup local state

Slide 20

Slide 20 text

Sharing state - Peer-to-peer handoff Node 2 Node 1 3. Tunnel the packet 4. Deliver the packet Node 3 (possible asym. ret. path) 2. Lookup local state 1. Exiting flow arrives at different node Node 4 (possible asym. fwd. path)

Slide 21

Slide 21 text

Netlink requests ● JVM netlink library, implements rtnetlink and odp ● Replies and notifications are modeled as asynchronous, observable streams ● A simulation entails packet execution, and flow create and delete operations ● Flow create ○ optimistic, not ack’ed or echo’ed ○ errors are ignored ○ may result in duplicates ● Flow delete ○ echo’d to get stats

Slide 22

Slide 22 text

NetlinkRequestBroker ... Array of Observers indexed by seq NL Socket Publisher ... Pre-allocated buffer split into fixed size chunks Writer Reader

Slide 23

Slide 23 text

● Packet Execution ○ 2.747 ± 0.241 us/op ● Flow creation ○ 5.476 ± 0.356 us/op ● Concurrent flow creation (2 threads) ○ 24.960 ± 2.138 us/op ○ ouch ● Flow creation + deletion ○ 11.873 ± 1.321 us/op ○ 88k ops/s ● Flow creation + deletion through broker ○ 12.380 ± 1.449 us/op Performance CPU: Intel(R) Xeon(R) @ 2.40GHz Number of CPUs: 16 Threads per core: 2 Cores per socket: 4 Sockets: 2 NUMA node(s): 2 L1 cache: 128K L2 cache: 1MB L3 cache: 12MB System memory: 24GB

Slide 24

Slide 24 text

Flow bookkeeping ● All flows have a hard time expiration ○ also important for the distributed flow state mechanism ● No idle expiration ○ flow gets would be too costly ● Invalidations ○ all flows are indexed by the set of tags applied during their simulation ○ e.g., the ID of each traversed device is a tag ○ this allows flows to be removed upon virtual topology changes

Slide 25

Slide 25 text

Some tricks OVS kmod VXLAN Tunnel Port port2, vethRecirc IP fwd. VXLAN Tunnel Port ● Megaflow bypass by setting a bit in the tunnel key ○ Force packet into userspace for flow tracing ● Double encapsulation for overlay tunnels

Slide 26

Slide 26 text

Conntrack? ● Synchronize conntrack state ○ How? How often? ○ Will the state be available to the egress host when simulating the return flow? ● Confine flow state to the compute host ○ Same host must process forward and return flows ○ This means doing a simulation in the gateway and re-doing it in the compute ○ More load on computes ○ SPoF

Slide 27

Slide 27 text

Questions?

Slide 28

Slide 28 text

Thank you!