SIG-NETWORK Deep Dive 2019 San Diego

Bowei Du, Google Rob Scott, Google Dan Williams, Red Hat
SIG-NETWORK Deep Dive

Deep dive into what is new in the networking space:
• EndpointSlice • Ingress V1, Service and Ingress evolution • Dual Stack • Service Topology • SCTP

EndpointSlice

EndpointSlice: Inspiration * Assuming 10k endpoints with each endpoint representing
~1KB worth of data. • Kubernetes clusters keep getting bigger • Endpoints became a bottleneck • Each network endpoint for a Service needs to fit in a single Endpoints resource ◦ etcd default size limit prevents them from storing more than ~5k network endpoints • Any change to Endpoints resource needs to be transmitted to every Node ◦ In a 5k node cluster a single endpoint change could result in ~5GB data transferred*

EndpointSlice Pod 1 Pod 2 Pod 3 Endpoints Pod 4
Pod 5 Pod 6 Pod 7 Pod 8 Endpoint Slices Pod 1 Pod 2 Pod 5 Pod 6 Pod 3 Pod 4 Pod 7 Pod 8

apiVersion: "v1" kind: "Endpoints" metadata: name: "example" subsets: - ports:
- name: "http" protocol: "TCP" port: 80 addresses: - ip: "192.0.2.42" hostname: "pod-1" nodeName: "node1" notReadyAddresses: []

apiVersion: "discovery.k8s.io/v1beta1" kind: "EndpointSlice" metadata: name: "example-abc" labels: endpointslice.kubernetes.io/managed-by: "endpointslice-controller.k8s.io"
kubernetes.io/service-name: "example" addressType: "IPv6" ports: - name: "http" protocol: "TCP" port: 80 endpoints: - addresses: - "2001:db8::1234:5678" conditions: ready: true hostname: "pod-1" topology: kubernetes.io/hostname: "node-1" topology.kubernetes.io/zone: "us-west2-a"

EndpointSlice Differences • Specify and validate address types (IPv4, IPv6,
or FQDN) • Ports have an AppProtocol field • EndpointSlice endpoints contain conditions ◦ currently just ready boolean ◦ more flexible than addresses and notReadyAddresses • EndpointSlice endpoints contain topology fields, currently Node name, zone, and region • Not enabled by default, require EndpointSlice feature gate

EndpointSlice Performance • EndpointSlice controller worked well at scale, a
bit faster than Endpoints controller • EndpointSlice integration with kube-proxy was not the same, needed more testing: ◦ Spin up cluster with kubetest ◦ Modify kube-proxy manifest on at least one node to enable profiling ◦ Use kubectl port forward and go tool pprof to profile kube-proxy ◦ Use kubescript to push custom builds of kube-proxy to specific nodes ◦ Final results gathered on a 150 node cluster over a 15 minute window scaling from 0-10k TCP endpoints for a service ◦ 4 versions of kube-proxy (1.16, 1.16+slices, 1.17, 1.17+slices)

EndpointSlice Performance • endpoint.IP() was used for sorting, taking 41%
of total kube-proxy CPU time • endpointslicecache.EndpointsMap() was called on each EndpointSlice update, took 45% of total kube-proxy CPU time • detectStaleConnections() was taking 82% of total kube-proxy CPU time but was only actually useful for UDP connections

EndpointSlice Performance Kube-Proxy Memory Consumption

EndpointSlice Performance Kube-Proxy CPU Usage

EndpointSlice Performance Kube-Proxy CPU Time

EndpointSlice Performance Implementation CPU time % of baseline Endpoints 1.16
(baseline) 116.7 100% Endpoints 1.17 22.1 18.9% EndpointSlice 1.16 312.5 260% EndpointSlice 1.17 6.4 5.4%

EndpointSlice Timeline • Design was proposed at KubeCon EU 2019
• KEP was formally created at the beginning of June • KEP was formally approved at the end of July • Alpha implementation made it into 1.16 release at the end of August • Beta implementation made it into 1.17 release in November • Want to be enabled by default in 1.18 • Targeting GA in 1.19

EndpointSlice: Help Wanted • Does it work for you? Can
you break it? Is it missing something? • We need more thorough test coverage. • To the users and implementers: Migrate use of Endpoints to EndpointSlices: ◦ Windows kube-proxy ◦ Ingress controllers ◦ DNS ◦ More?

Ingress API GA Service/Ingress API evolution

Evolving Landscape for L4/L7 Cloud LBs GCP, AWS, Azure, ...
Middle Proxies nginx, envoy, haproxy, ... Transparent “Proxies” sidecars, kube-proxy... ‍♀ Infrastructure Provider ‍‍ Cluster Operator / NetOps ‍‍ Application Developer Proxy/LB providers and controllers More complex roles and personas

Service/Ingress Evolution Ingress: very wide support among different implementations; “good
enough” for a non-trivial number of users. Plan: Clean up the spec and take type to GA Ingress Service Ingress Class Gateway Class Gateway *Route Service (proposed) existing

V1 GA Clean up the object model: IngressClass Tweaks/fix to
the specification: • backend to defaultBackend • Path-based prefixes/regex • Hostname wildcards. Add flexibility that will be hard to change later: • Alternate backend types

Service/Ingress Evolution Ingress Service Ingress Class Gateway Class Gateway *Route
Service (proposed) existing Better model the personas and roles involved with services and load-balancing. Support modern load-balancing features while maintaining portability (or maybe “predictability”) Have standard mechanisms for extension for API growth / implementation / vendor-specific behaviors.

Gateway/Route schema Gateway Class Gateway *Route Service * * m
n kind: GatewayClass name: internet-lb provider: acme.io/cloud parameters: apiGroup: acme.io/cloud kind: Parameters name: ... kind: GatewayClass name: private-lb provider: acme.io/cloud parameters: apiGroup: acme.io/cloud kind: GatewayParameters name: ... apiGroup: acme.io/cloud kind: Parameters public: true apiGroup: acme.io/cloud kind: Parameters public: false

Gateway/Route schema Gateway Class Gateway *Route Service * * m
n kind: GatewayClass name: internet-lb ... kind: Gateway namespace: net-ops name: the-gateway class: internet-lb listeners: - port: 80 protocol: http routes: - kind: HTTPRoute name: my-app kind: HTTPRoute name: my-app rules: - path: /my-app ... gateways: - namespace: net-ops name: the-gateway kind: Service name: my-app

How to get involved API sketch is here: link, talk
today @ 3:20 pm: link Working group (coming soon, info will go out): - Bi-weekly meetings - SIG-NETWORK mailing list (for now) - Slack channel - github.com/kubernetes-sigs/service-apis Help wanted: - Feedback on the proposal (users AND implementers) - Experimental implementations

Dual Stack (IPv4 + IPv6)

Dual Stack: Goals • Support Pods with both IPv4 and
IPv6 addresses: ◦ should be able to communicate IPv4 to IPv4 and IPv6 to IPv6 ◦ should be able to access external servers on IPv4 and IPv6 addresses ◦ same pod can be targeted by IPv4 and IPv6 Services • NodePorts and External IPs should support IPv4 and IPv6 addresses • Services, Endpoints, and EndpointSlices can be either IPv4 or IPv6 • Maintain backwards compatibility with IPv4-only and IPv6-only clusters

Dual Stack: Complexity Dual stack implementation is a huge project,
it affects: • Service, Ingress • Endpoints, EndpointSlice • Node • CRI, CNI, runtimes • Cluster configs • … and anything in the ecosystem that reads/writes/munges IPs (bash scripts anyone?)

Dual Stack: Alpha • First alpha release was 1.16, updates
in 1.17 • All functionality hidden behind the IPv6DualStack feature gate • Changes: ◦ Pods have a new PodIPs attribute that will have IPv4 and IPv6 addresses when dual stack feature gate is enabled ◦ Services have a new IPFamily attribute that can be either IPv4 or IPv6 ◦ Endpoints will only contain addresses matching the Service IPFamily ◦ EndpointSlices have IPv4 and IPv6 address types and will match the Service IPFamily

Dual Stack: In Practice Pod meta: name: foo spec: ...
status: podIPs: - 10.1.2.3 - 2001:db8::1234:5678 Service meta: name: bar-v4 spec: IPFamily: IPv4 ... clusterIP: 10.2.3.4 EndpointSlice meta: name: bar-v4-asdf spec: addressType: IPv4 endpoints: - targetRef: {pod foo} addresses: - 10.1.2.3

Dual Stack: In Practice Pod meta: name: foo spec: ...
status: podIPs: - 10.1.2.3 - 2001:db8::1234:5678 Service meta: name: bar-v6 spec: IPFamily: IPv6 ... clusterIP: 2001:db8::2345:6789 Service meta: name: bar-v4 spec: IPFamily: IPv4 ... clusterIP: 10.2.3.4 EndpointSlice meta: name: bar-v4-asdf spec: addressType: IPv4 endpoints: - targetRef: {pod foo} addresses: - 10.1.2.3 EndpointSlice meta: name: bar-v6-asdf spec: addressType: IPv6 endpoints: - targetRef: {pod foo} addresses: - 2001:db8::1234:5678

Dual Stack: Kube Proxy • Kube Proxy integration is required
to make all of this work • MetaProxier design: ◦ Run separate proxiers for each IP family ◦ Each proxier runs in IPv4 or IPv6 mode and only Endpoints matching that IP family are sent to that proxier • Currently this is limited to the IPVS proxier • Work underway for iptables • No support for Windows proxiers

Dual Stack: Timeline November 2014 First bits of IPv6 support
(kube #2147) 2017 IPv6 single-stack effort starts June 2018 Initial dual-stack proposal (community #2254) February 2019 Dual-stack updated to a KEP (#808) May 2019 KEP #808 accepted targeting Alpha for v1.16 August 2019 Single-stack IPv6 promoted to Beta, dual-stack alpha for v1.16 November 2019 Dual-stack remains Alpha for v1.17 Early 2020 Dual-stack promoted to Beta for v1.18 (?)

Dual Stack: Help Wanted • There’s a lot more to
do, lots of opportunities to help • Kube Proxy needs dual stack implementation for Windows • For what has been released, does it work for you? • Does it work well with EndpointSlices enabled? • Have we missed something?

Service Topology

Service Topology • "talk to the backend that's local to
me" • What does "local" really mean? ◦ Same {node, rack, failure zone, region, cloud provider, …} • KEP #640 accepted December 2018 • Initial feature-gated (Alpha) implementation merged last week (#72046) ◦ Adds an ordered TopologyKeys slice to Service.Spec ▪ topologyKeys: ["kubernetes.io/hostname", "topology.kubernetes.io/zone"] ◦ Proxy matches pod and node topology label values in TopologyKey order • Future: PodLocator resource added to address scalability issues

SCTP • Stream Control Transmission Protocol (RFC 4960) ◦ Like
a mashup of TCP and UDP with multi-homing and redundant paths ◦ Widely used in the telecommunications space • Implementation ◦ Sits alongside TCP and UDP as a core Kubernetes protocol ◦ Support added to NetworkPolicy, Service, Endpoints, HostPort • Schedule ◦ Promoted to Alpha in Kubernetes v1.12 ◦ Promote to Beta in Kubernetes v1.18?

Conclusion

EndpointSlice Massive scale services and endpoints Ingress V1, Service and
Ingress evolution Better L4/L7 LB modeling of roles, raised bar on feature support, extensibility. Dual Stack You can use both IPv4 and IPv6 in your clusters. Service Topology Control over inter-zone/region traffic SCTP Protocol support completeness. What does this all mean?

Q/A Where to find us kubernetes/community/wiki/SIG-Network Slack #sig-network Meeting https://zoom.us/j/361123509
14:00 US Pacific every other Thursday

SIG-NETWORK Deep Dive 2019 San Diego

SIG-NETWORK Deep Dive 2019 San Diego

More Decks by Bowei Du

Other Decks in Programming

Featured

Transcript