ONOS Distributed Core and performance- AT&T talk (May 7th)

ONOS Distributed Core ONOS Performance Madan Jampani [email protected] May 7th,
2015 AT&T Talk

Control Plane: Design Principles • Distributed ◦ Divide and conquer
the problem space • Symmetric ◦ Instances are identical with respect to form and function • Fault-tolerant ◦ Handle failures seamlessly through built-in redundancy • Decentralized ◦ System is self-organizing and lacks any centralized control • Incrementally Scalable ◦ Capacity can be introduced gradually • Operational Simplicity ◦ Easy to deploy, no special hardware, no synchronized clocks, etc.

Key Challenges • Preserve simple SDN abstractions without compromising performance
at scale • Match problems to solutions that can meet their consistency/availability/durability needs • Provide strong consistency at scale • Expose distributed state management and coordination primitives as key application building blocks

App App App App App App App App App System
Model • Each controller manages a portion of network • Controllers communicate with each other via RPC • All core services are accessible on any instance • Applications are distribution transparent

Topology (Global Network View) • Partitioned by Controller ◦ no
single controller has direct visibility over entire network • Reasonable size ◦ can fit into main memory on a given controller • Read-intensive ◦ low latency access to GNV is critical • Consistent with Environment ◦ incorporate network updates as soon as possible

Topology State Management 1 2 3 3 4 4 (term,seq)

Flow State • Per switch state • Exhibits strong read
locality • Authoritative source of what data plane should contain • Backup data for quick recovery • High level application edicts • Intents are immutable and durable while Intent states are eventually consistent • Topology events can trigger intent rerouting enmasse Intent State

Flow State • Primary/backup replication. Switch master is primary •
Backup location is the node that will most likely succeed the current master. • Fully replicated using an Eventually Consistent Map • Partitioned Logical Queues enable synchronization free execution Intent State

State Management Primitives • EventuallyConsistentMap<K, V> • ConsistentMap<K, V> •
LeadershipService • AtomicCounter * • DistributedSet * * Will be available in Cardinal release

Performance Metrics • Device & link sensing latency ◦ measure
how fast can controller react to environment changes, such as switch or port down to rebuild the network graph and notify apps • Flow rule operations throughput ◦ measure how many flow rule operations can be issued against the controller and characterize relationship of throughput with cluster size • Intent operations throughput ◦ measure how many intent operations can be issued against controller cluster and characterize relationship of throughput with cluster size • Intent operations latency ◦ measure how fast can the controller react to environment changes and reprovision intents on the data-plane and characterize scalability

Topology Latency • Verify: ◦ observe effect of distributed state
management on latency ◦ react faster to negative events than positive ones • Results consist of multiple parts: ◦ Switch up latency ◦ Switch down latency ◦ Link up/down latency • Experimental setup: ◦ Two OVS switches connected to each other. ◦ Events are generated from the switch ◦ Elapsed time is measured from the switch until ONOS triggers a corresponding topology event

Device & Link Sensing Latency

Switch Up Latency • Most of the time is spent
waiting for the switch to respond to a features request. (~53ms) • ONOS spends under 25ms with most of it’s time electing a master for the device. ◦ Which is a strongly consistent operation

Switch Down Latency • Significantly faster because there is no
negotiation with the switch • A terminating TCP connection unequivocally indicates that the switch is gone

Link Up/Down Latency • The increase from single to multi
instance is being investigated • Since we use LLDP to discover links, it takes longer to discover a link coming up than going down • Port down event trigger immediate teardown of the link.

Flow Throughput • ONOS may have to provision flows for
many devices • Objective is to understand how flow installation scales: ◦ with increased east/west communication with the cluster ◦ number of devices connected to each instance • Experimental setup: ◦ Constant number of flows ◦ Constant number of devices attached to cluster ◦ Mastership evenly distributed ◦ Variable number for flow installers ◦ Variable number separate device masters traversed.

Flow Rule Operations Throughput

Flow Throughput results • Single instance can install over 500K
flows per second • ONOS can handle 3M local and 2M non local flow installations • With 1-3 ONOS instances, the flow setup rate remains constant no matter how many neighbours are involved • With more than 3 instances injecting load the flow performance drops off due to extra coordination requires.

Intent Framework Performance • Intents are high-level, network-level policy definitions
◦ e.g. provide connectivity between two hosts, or route all traffic that matches this prefix to this edge port • All objects are distributed for high availability ◦ Synchronously written to at least one other node ◦ Work is divided among all instances in the cluster • Intents must be compiled into device-specific rules ◦ Paths are computed and selected, reservations made, etc. • Device-specific rules are installed ◦ Leveraging other asynchronous subsystems (e.g. Flow Rule Service) • Intents react to network events ("reroute") ◦ e.g. device or link failure, host movement, etc.

Intent Framework Latency • API calls are asynchronous and return
after storing the intent • After an intent has been submitted, the framework starts compiling and installing • An event is generated after the framework confirms that the policy has been written to the devices • Experiment shows how quickly an application's policy can be reflected in the network ("install" and "withdrawn"), as well as how long it takes to react to a network event ("reroute")

Intent Latency Experiment

Intent Latency Results • Less than 100ms to install or
withdraw a batch of intents • Less than 50ms to process and react to network events ◦ Slightly faster because intent objects are already replicated

Intent Framework Throughput • Dynamic networks undergo changes of policies
(e.g. forwarding decision) on an ongoing basis • Framework needs to be able to cope with a stream of requests and catch-up if it ever falls behind • Capacity needs to scale with growth of the cluster

Intent Throughput Experiment

Intent Throughput Results • Processing clearly scales as cluster size
increases

ONOS Distributed Core and performance- AT&T tal...

ONOS Distributed Core and performance- AT&T talk (May 7th)

ONOS Project

More Decks by ONOS Project

Other Decks in Technology

Featured

Transcript

ONOS Distributed Core ONOS Performance Madan Jampani [email protected] May 7th,

Control Plane: Design Principles • Distributed ◦ Divide and conquer

Key Challenges • Preserve simple SDN abstractions without compromising performance

App App App App App App App App App System

Topology (Global Network View) • Partitioned by Controller ◦ no

Topology State Management 1 2 3 3 4 4 (term,seq)

Flow State • Per switch state • Exhibits strong read

Flow State • Primary/backup replication. Switch master is primary •

State Management Primitives • EventuallyConsistentMap<K, V> • ConsistentMap<K, V> •

Performance Metrics • Device & link sensing latency ◦ measure

Topology Latency • Verify: ◦ observe effect of distributed state

Device & Link Sensing Latency

Switch Up Latency • Most of the time is spent

Switch Down Latency • Significantly faster because there is no

Link Up/Down Latency • The increase from single to multi

Flow Throughput • ONOS may have to provision flows for

Flow Rule Operations Throughput

Flow Throughput results • Single instance can install over 500K

Intent Framework Performance • Intents are high-level, network-level policy definitions

Intent Framework Latency • API calls are asynchronous and return

Intent Latency Experiment

Intent Latency Results • Less than 100ms to install or

Intent Framework Throughput • Dynamic networks undergo changes of policies

Intent Throughput Experiment

Intent Throughput Results • Processing clearly scales as cluster size