Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SIG-NETWORK Deep Dive 2019 San Diego

Bowei Du
November 21, 2019

SIG-NETWORK Deep Dive 2019 San Diego

Networking is hard! This talk will start with some background on Kubernetes networking. Attendees who are not already comfortable with the "hows and whys" of basic networking in Kubernetes can get a bit of a primer before we dive deep on a few of the more recent developments and efforts in the networking space.

Bowei Du, Google
Rob Scott, Google
Dan Williams, Red Hat

https://kccncna19.sched.com/event/UakP/sig-network-intro-deep-dive-tim-hockin-google-vallery-lancey-lyft

Thursday November 21, 2019 10:55am - 12:25pm
Room 33ABC - San Diego Convention Center Upper Level
Maintainer Track Sessions
Experience Level Any
Session Slides Included Yes

Bowei Du

November 21, 2019
Tweet

More Decks by Bowei Du

Other Decks in Programming

Transcript

  1. Deep dive into what is new in the networking space:

    • EndpointSlice • Ingress V1, Service and Ingress evolution • Dual Stack • Service Topology • SCTP
  2. EndpointSlice: Inspiration * Assuming 10k endpoints with each endpoint representing

    ~1KB worth of data. • Kubernetes clusters keep getting bigger • Endpoints became a bottleneck • Each network endpoint for a Service needs to fit in a single Endpoints resource ◦ etcd default size limit prevents them from storing more than ~5k network endpoints • Any change to Endpoints resource needs to be transmitted to every Node ◦ In a 5k node cluster a single endpoint change could result in ~5GB data transferred*
  3. EndpointSlice Pod 1 Pod 2 Pod 3 Endpoints Pod 4

    Pod 5 Pod 6 Pod 7 Pod 8 Endpoint Slices Pod 1 Pod 2 Pod 5 Pod 6 Pod 3 Pod 4 Pod 7 Pod 8
  4. apiVersion: "v1" kind: "Endpoints" metadata: name: "example" subsets: - ports:

    - name: "http" protocol: "TCP" port: 80 addresses: - ip: "192.0.2.42" hostname: "pod-1" nodeName: "node1" notReadyAddresses: []
  5. apiVersion: "discovery.k8s.io/v1beta1" kind: "EndpointSlice" metadata: name: "example-abc" labels: endpointslice.kubernetes.io/managed-by: "endpointslice-controller.k8s.io"

    kubernetes.io/service-name: "example" addressType: "IPv6" ports: - name: "http" protocol: "TCP" port: 80 endpoints: - addresses: - "2001:db8::1234:5678" conditions: ready: true hostname: "pod-1" topology: kubernetes.io/hostname: "node-1" topology.kubernetes.io/zone: "us-west2-a"
  6. EndpointSlice Differences • Specify and validate address types (IPv4, IPv6,

    or FQDN) • Ports have an AppProtocol field • EndpointSlice endpoints contain conditions ◦ currently just ready boolean ◦ more flexible than addresses and notReadyAddresses • EndpointSlice endpoints contain topology fields, currently Node name, zone, and region • Not enabled by default, require EndpointSlice feature gate
  7. EndpointSlice Performance • EndpointSlice controller worked well at scale, a

    bit faster than Endpoints controller • EndpointSlice integration with kube-proxy was not the same, needed more testing: ◦ Spin up cluster with kubetest ◦ Modify kube-proxy manifest on at least one node to enable profiling ◦ Use kubectl port forward and go tool pprof to profile kube-proxy ◦ Use kubescript to push custom builds of kube-proxy to specific nodes ◦ Final results gathered on a 150 node cluster over a 15 minute window scaling from 0-10k TCP endpoints for a service ◦ 4 versions of kube-proxy (1.16, 1.16+slices, 1.17, 1.17+slices)
  8. EndpointSlice Performance • endpoint.IP() was used for sorting, taking 41%

    of total kube-proxy CPU time • endpointslicecache.EndpointsMap() was called on each EndpointSlice update, took 45% of total kube-proxy CPU time • detectStaleConnections() was taking 82% of total kube-proxy CPU time but was only actually useful for UDP connections
  9. EndpointSlice Performance Implementation CPU time % of baseline Endpoints 1.16

    (baseline) 116.7 100% Endpoints 1.17 22.1 18.9% EndpointSlice 1.16 312.5 260% EndpointSlice 1.17 6.4 5.4%
  10. EndpointSlice Timeline • Design was proposed at KubeCon EU 2019

    • KEP was formally created at the beginning of June • KEP was formally approved at the end of July • Alpha implementation made it into 1.16 release at the end of August • Beta implementation made it into 1.17 release in November • Want to be enabled by default in 1.18 • Targeting GA in 1.19
  11. EndpointSlice: Help Wanted • Does it work for you? Can

    you break it? Is it missing something? • We need more thorough test coverage. • To the users and implementers: Migrate use of Endpoints to EndpointSlices: ◦ Windows kube-proxy ◦ Ingress controllers ◦ DNS ◦ More?
  12. Evolving Landscape for L4/L7 Cloud LBs GCP, AWS, Azure, ...

    Middle Proxies nginx, envoy, haproxy, ... Transparent “Proxies” sidecars, kube-proxy... ‍♀ Infrastructure Provider ‍‍ Cluster Operator / NetOps ‍‍ Application Developer Proxy/LB providers and controllers More complex roles and personas
  13. Service/Ingress Evolution Ingress: very wide support among different implementations; “good

    enough” for a non-trivial number of users. Plan: Clean up the spec and take type to GA Ingress Service Ingress Class Gateway Class Gateway *Route Service (proposed) existing
  14. V1 GA Clean up the object model: IngressClass Tweaks/fix to

    the specification: • backend to defaultBackend • Path-based prefixes/regex • Hostname wildcards. Add flexibility that will be hard to change later: • Alternate backend types
  15. Service/Ingress Evolution Ingress Service Ingress Class Gateway Class Gateway *Route

    Service (proposed) existing Better model the personas and roles involved with services and load-balancing. Support modern load-balancing features while maintaining portability (or maybe “predictability”) Have standard mechanisms for extension for API growth / implementation / vendor-specific behaviors.
  16. Gateway/Route schema Gateway Class Gateway *Route Service * * m

    n kind: GatewayClass name: internet-lb provider: acme.io/cloud parameters: apiGroup: acme.io/cloud kind: Parameters name: ... kind: GatewayClass name: private-lb provider: acme.io/cloud parameters: apiGroup: acme.io/cloud kind: GatewayParameters name: ... apiGroup: acme.io/cloud kind: Parameters public: true apiGroup: acme.io/cloud kind: Parameters public: false
  17. Gateway/Route schema Gateway Class Gateway *Route Service * * m

    n kind: GatewayClass name: internet-lb ... kind: Gateway namespace: net-ops name: the-gateway class: internet-lb listeners: - port: 80 protocol: http routes: - kind: HTTPRoute name: my-app kind: HTTPRoute name: my-app rules: - path: /my-app ... gateways: - namespace: net-ops name: the-gateway kind: Service name: my-app
  18. How to get involved API sketch is here: link, talk

    today @ 3:20 pm: link Working group (coming soon, info will go out): - Bi-weekly meetings - SIG-NETWORK mailing list (for now) - Slack channel - github.com/kubernetes-sigs/service-apis Help wanted: - Feedback on the proposal (users AND implementers) - Experimental implementations
  19. Dual Stack: Goals • Support Pods with both IPv4 and

    IPv6 addresses: ◦ should be able to communicate IPv4 to IPv4 and IPv6 to IPv6 ◦ should be able to access external servers on IPv4 and IPv6 addresses ◦ same pod can be targeted by IPv4 and IPv6 Services • NodePorts and External IPs should support IPv4 and IPv6 addresses • Services, Endpoints, and EndpointSlices can be either IPv4 or IPv6 • Maintain backwards compatibility with IPv4-only and IPv6-only clusters
  20. Dual Stack: Complexity Dual stack implementation is a huge project,

    it affects: • Service, Ingress • Endpoints, EndpointSlice • Node • CRI, CNI, runtimes • Cluster configs • … and anything in the ecosystem that reads/writes/munges IPs (bash scripts anyone?)
  21. Dual Stack: Alpha • First alpha release was 1.16, updates

    in 1.17 • All functionality hidden behind the IPv6DualStack feature gate • Changes: ◦ Pods have a new PodIPs attribute that will have IPv4 and IPv6 addresses when dual stack feature gate is enabled ◦ Services have a new IPFamily attribute that can be either IPv4 or IPv6 ◦ Endpoints will only contain addresses matching the Service IPFamily ◦ EndpointSlices have IPv4 and IPv6 address types and will match the Service IPFamily
  22. Dual Stack: In Practice Pod meta: name: foo spec: ...

    status: podIPs: - 10.1.2.3 - 2001:db8::1234:5678 Service meta: name: bar-v4 spec: IPFamily: IPv4 ... clusterIP: 10.2.3.4 EndpointSlice meta: name: bar-v4-asdf spec: addressType: IPv4 endpoints: - targetRef: {pod foo} addresses: - 10.1.2.3
  23. Dual Stack: In Practice Pod meta: name: foo spec: ...

    status: podIPs: - 10.1.2.3 - 2001:db8::1234:5678 Service meta: name: bar-v6 spec: IPFamily: IPv6 ... clusterIP: 2001:db8::2345:6789 Service meta: name: bar-v4 spec: IPFamily: IPv4 ... clusterIP: 10.2.3.4 EndpointSlice meta: name: bar-v4-asdf spec: addressType: IPv4 endpoints: - targetRef: {pod foo} addresses: - 10.1.2.3 EndpointSlice meta: name: bar-v6-asdf spec: addressType: IPv6 endpoints: - targetRef: {pod foo} addresses: - 2001:db8::1234:5678
  24. Dual Stack: Kube Proxy • Kube Proxy integration is required

    to make all of this work • MetaProxier design: ◦ Run separate proxiers for each IP family ◦ Each proxier runs in IPv4 or IPv6 mode and only Endpoints matching that IP family are sent to that proxier • Currently this is limited to the IPVS proxier • Work underway for iptables • No support for Windows proxiers
  25. Dual Stack: Timeline November 2014 First bits of IPv6 support

    (kube #2147) 2017 IPv6 single-stack effort starts June 2018 Initial dual-stack proposal (community #2254) February 2019 Dual-stack updated to a KEP (#808) May 2019 KEP #808 accepted targeting Alpha for v1.16 August 2019 Single-stack IPv6 promoted to Beta, dual-stack alpha for v1.16 November 2019 Dual-stack remains Alpha for v1.17 Early 2020 Dual-stack promoted to Beta for v1.18 (?)
  26. Dual Stack: Help Wanted • There’s a lot more to

    do, lots of opportunities to help • Kube Proxy needs dual stack implementation for Windows • For what has been released, does it work for you? • Does it work well with EndpointSlices enabled? • Have we missed something?
  27. Service Topology • "talk to the backend that's local to

    me" • What does "local" really mean? ◦ Same {node, rack, failure zone, region, cloud provider, …} • KEP #640 accepted December 2018 • Initial feature-gated (Alpha) implementation merged last week (#72046) ◦ Adds an ordered TopologyKeys slice to Service.Spec ▪ topologyKeys: ["kubernetes.io/hostname", "topology.kubernetes.io/zone"] ◦ Proxy matches pod and node topology label values in TopologyKey order • Future: PodLocator resource added to address scalability issues
  28. SCTP • Stream Control Transmission Protocol (RFC 4960) ◦ Like

    a mashup of TCP and UDP with multi-homing and redundant paths ◦ Widely used in the telecommunications space • Implementation ◦ Sits alongside TCP and UDP as a core Kubernetes protocol ◦ Support added to NetworkPolicy, Service, Endpoints, HostPort • Schedule ◦ Promoted to Alpha in Kubernetes v1.12 ◦ Promote to Beta in Kubernetes v1.18?
  29. EndpointSlice Massive scale services and endpoints Ingress V1, Service and

    Ingress evolution Better L4/L7 LB modeling of roles, raised bar on feature support, extensibility. Dual Stack You can use both IPv4 and IPv6 in your clusters. Service Topology Control over inter-zone/region traffic SCTP Protocol support completeness. What does this all mean?