$30 off During Our Annual Pro Sale. View details »

ONOS Distributed Core and performance- AT&T talk (May 7th)

ONOS Distributed Core and performance- AT&T talk (May 7th)

Madan Jampani, ON.Lab

ONOS Project

May 07, 2015
Tweet

More Decks by ONOS Project

Other Decks in Technology

Transcript

  1. ONOS Distributed Core ONOS Performance Madan Jampani madan@onlab.us May 7th,

    2015 AT&T Talk
  2. Control Plane: Design Principles • Distributed ◦ Divide and conquer

    the problem space • Symmetric ◦ Instances are identical with respect to form and function • Fault-tolerant ◦ Handle failures seamlessly through built-in redundancy • Decentralized ◦ System is self-organizing and lacks any centralized control • Incrementally Scalable ◦ Capacity can be introduced gradually • Operational Simplicity ◦ Easy to deploy, no special hardware, no synchronized clocks, etc.
  3. Key Challenges • Preserve simple SDN abstractions without compromising performance

    at scale • Match problems to solutions that can meet their consistency/availability/durability needs • Provide strong consistency at scale • Expose distributed state management and coordination primitives as key application building blocks
  4. App App App App App App App App App System

    Model • Each controller manages a portion of network • Controllers communicate with each other via RPC • All core services are accessible on any instance • Applications are distribution transparent
  5. Topology (Global Network View) • Partitioned by Controller ◦ no

    single controller has direct visibility over entire network • Reasonable size ◦ can fit into main memory on a given controller • Read-intensive ◦ low latency access to GNV is critical • Consistent with Environment ◦ incorporate network updates as soon as possible
  6. Topology State Management 1 2 3 3 4 4 (term,seq)

  7. Flow State • Per switch state • Exhibits strong read

    locality • Authoritative source of what data plane should contain • Backup data for quick recovery • High level application edicts • Intents are immutable and durable while Intent states are eventually consistent • Topology events can trigger intent rerouting enmasse Intent State
  8. Flow State • Primary/backup replication. Switch master is primary •

    Backup location is the node that will most likely succeed the current master. • Fully replicated using an Eventually Consistent Map • Partitioned Logical Queues enable synchronization free execution Intent State
  9. State Management Primitives • EventuallyConsistentMap<K, V> • ConsistentMap<K, V> •

    LeadershipService • AtomicCounter * • DistributedSet * * Will be available in Cardinal release
  10. Performance Metrics • Device & link sensing latency ◦ measure

    how fast can controller react to environment changes, such as switch or port down to rebuild the network graph and notify apps • Flow rule operations throughput ◦ measure how many flow rule operations can be issued against the controller and characterize relationship of throughput with cluster size • Intent operations throughput ◦ measure how many intent operations can be issued against controller cluster and characterize relationship of throughput with cluster size • Intent operations latency ◦ measure how fast can the controller react to environment changes and reprovision intents on the data-plane and characterize scalability
  11. Topology Latency • Verify: ◦ observe effect of distributed state

    management on latency ◦ react faster to negative events than positive ones • Results consist of multiple parts: ◦ Switch up latency ◦ Switch down latency ◦ Link up/down latency • Experimental setup: ◦ Two OVS switches connected to each other. ◦ Events are generated from the switch ◦ Elapsed time is measured from the switch until ONOS triggers a corresponding topology event
  12. Device & Link Sensing Latency

  13. Switch Up Latency • Most of the time is spent

    waiting for the switch to respond to a features request. (~53ms) • ONOS spends under 25ms with most of it’s time electing a master for the device. ◦ Which is a strongly consistent operation
  14. Switch Down Latency • Significantly faster because there is no

    negotiation with the switch • A terminating TCP connection unequivocally indicates that the switch is gone
  15. Link Up/Down Latency • The increase from single to multi

    instance is being investigated • Since we use LLDP to discover links, it takes longer to discover a link coming up than going down • Port down event trigger immediate teardown of the link.
  16. Flow Throughput • ONOS may have to provision flows for

    many devices • Objective is to understand how flow installation scales: ◦ with increased east/west communication with the cluster ◦ number of devices connected to each instance • Experimental setup: ◦ Constant number of flows ◦ Constant number of devices attached to cluster ◦ Mastership evenly distributed ◦ Variable number for flow installers ◦ Variable number separate device masters traversed.
  17. Flow Rule Operations Throughput

  18. Flow Throughput results • Single instance can install over 500K

    flows per second • ONOS can handle 3M local and 2M non local flow installations • With 1-3 ONOS instances, the flow setup rate remains constant no matter how many neighbours are involved • With more than 3 instances injecting load the flow performance drops off due to extra coordination requires.
  19. Intent Framework Performance • Intents are high-level, network-level policy definitions

    ◦ e.g. provide connectivity between two hosts, or route all traffic that matches this prefix to this edge port • All objects are distributed for high availability ◦ Synchronously written to at least one other node ◦ Work is divided among all instances in the cluster • Intents must be compiled into device-specific rules ◦ Paths are computed and selected, reservations made, etc. • Device-specific rules are installed ◦ Leveraging other asynchronous subsystems (e.g. Flow Rule Service) • Intents react to network events ("reroute") ◦ e.g. device or link failure, host movement, etc.
  20. Intent Framework Latency • API calls are asynchronous and return

    after storing the intent • After an intent has been submitted, the framework starts compiling and installing • An event is generated after the framework confirms that the policy has been written to the devices • Experiment shows how quickly an application's policy can be reflected in the network ("install" and "withdrawn"), as well as how long it takes to react to a network event ("reroute")
  21. Intent Latency Experiment

  22. Intent Latency Results • Less than 100ms to install or

    withdraw a batch of intents • Less than 50ms to process and react to network events ◦ Slightly faster because intent objects are already replicated
  23. Intent Framework Throughput • Dynamic networks undergo changes of policies

    (e.g. forwarding decision) on an ongoing basis • Framework needs to be able to cope with a stream of requests and catch-up if it ever falls behind • Capacity needs to scale with growth of the cluster
  24. Intent Throughput Experiment

  25. Intent Throughput Results • Processing clearly scales as cluster size

    increases
  26. None