Slide 1

Slide 1 text

Bowei Du <@bowei> Tim Hockin <@thockin> Vallery Lancey <@vllry> SIG-Network Intro & Deep-dive 1

Slide 2

Slide 2 text

Agenda Part 1: Intro ● An overview of the SIG and the “basics” ● If you are new to Kubernetes or not very familiar with the things that our SIG deal with - this is for you! Part 2: Deep-dive ● A deeper look at some of the newest work that the SIG has been doing ● If you are already comfortable with Kubernetes networking concepts, and want to see what’s next - this is for you! 2

Slide 3

Slide 3 text

Agenda Part 1: Intro 3

Slide 4

Slide 4 text

What, When, Where Responsible for the Kubernetes network components ● Pod networking within and between nodes ● Service abstractions ● Ingress and egress ● Network policies and access control Zoom meeting: Every other Thursday, at 21:00 UTC Slack: #sig-network (slack.k8s.io) https://git.k8s.io/community/sig-network (Don’t worry, we’ll show this again at the end) 4

Slide 5

Slide 5 text

APIs Service, Endpoints, EndpointSlice ● Service registration & discovery Ingress ● L7 HTTP routing Gateway ● Next-generation HTTP routing and service ingress NetworkPolicy ● Application “firewall” 5

Slide 6

Slide 6 text

Components Kubelet CNI implementation ● Low-level network drivers and how they are used Kube-proxy ● Implements Service API Controllers ● Endpoints and EndpointSlice ● Service load-balancers ● IPAM DNS ● Name-based discovery 6

Slide 7

Slide 7 text

Networking model All Pods can reach all other Pods, across Nodes Sounds simple, right? Many implementations ● Flat ● Overlays (e.g. VXLAN) ● Routing config (e.g. BGP) One of the more common things people struggle with 7

Slide 8

Slide 8 text

Services: problem Pod client Serving app Pod svr-1 Pod svr-2 Pod svr-3 Client connects to a server instance - Which one? 8

Slide 9

Slide 9 text

Services: problem Pod client Serving app Pod svr-1 Pod svr-2 Pod svr-3 Server instance goes down for some reason 9

Slide 10

Slide 10 text

Services: problem Pod client Serving app Pod svr-1 Pod svr-2 Pod svr-3 Client has to connect to a different server instance - Again, which one? 10

Slide 11

Slide 11 text

Services: abstraction Pod client Serving app Pod svr-1 Pod svr-2 Pod svr-3 Client connects to the abstract Service Service Service “hides” backend details 11

Slide 12

Slide 12 text

Services Pod IPs are ephemeral “I have a group of servers and I need clients to find them” Services “expose” a group of pods ● Durable VIP (or not, if you choose) ● Port and protocol ● Used to build service discovery ● Can include load balancing (but doesn’t have to) 12

Slide 13

Slide 13 text

Node Services: what really happens? Pod client Serving app Pod svr-1 Pod svr-2 Pod svr-3 Client connects to the abstract Service Proxy (iptables, ipvs, etc) Service “hides” backend details 13

Slide 14

Slide 14 text

Services: what really happens? Pod client Client does DNS query (DNS is a Service, too) Proxy DNS Pod svr 14

Slide 15

Slide 15 text

Services: what really happens? Pod client Proxy DNS Pod svr DNS returns service VIP 15

Slide 16

Slide 16 text

Services: what really happens? Pod client Proxy DNS Pod svr Client connects to VIP 16

Slide 17

Slide 17 text

Services: what really happens? Pod client Proxy DNS Pod svr Proxy translates VIP to pod IP 17

Slide 18

Slide 18 text

Services: what really happens? Pod client Proxy DNS Pod svr Service Endpoints Controller Async: controllers use service and endpoints APIs to populate DNS and proxies 18

Slide 19

Slide 19 text

Node Services: what really happens? Pod client Serving app Pod svr-1 Pod svr-2 Pod svr-3 Client connects to the service VIP Proxy Service “hides” backend details 19

Slide 20

Slide 20 text

Node Services: what really happens? Pod client Serving app Pod svr-1 Pod svr-2 Pod svr-3 Proxy Backend goes down 20

Slide 21

Slide 21 text

Node Services: what really happens? Pod client Serving app Pod svr-1 Pod svr-2 Pod svr-3 Client re-connects to the service VIP Proxy Service “hides” backend details 21

Slide 22

Slide 22 text

Services: what you specify kind: Service apiVersion: v1 metadata: name: my-service namespace: default Spec: selector: app: my-app ports: - port: 80 targetPort: 9376 Used for discovery (e.g. DNS) Which pods to use Logical port (for clients) Port on the backend pods 22

Slide 23

Slide 23 text

Services: what you get kind: Service apiVersion: v1 metadata: name: my-service namespace: default Spec: type: ClusterIP clusterIP: 10.9.3.76 selector: app: my-app ports: - protocol: TCP port: 80 targetPort: 9376 Default Allocated Default 23

Slide 24

Slide 24 text

Endpoints Represents the list of IPs “behind” a Service ● Usually Pods, but not always Recall that Service had port and targetPort fields ● Can “remap” ports Generally managed by the system ● But can be manually managed in some cases 24

Slide 25

Slide 25 text

Endpoints controller(s) Service { name: foo selector: app: foo ports: - port: 80 targetPort: 9376 } 25

Slide 26

Slide 26 text

Endpoints controller(s) Pod { labels: app: foo ip: 10.1.0.1 } Service { name: foo selector: app: foo ports: - port: 80 targetPort: 9376 } Pod { labels: app: bar ip: 10.1.0.2 } Pod { labels: app: foo ip: 10.1.9.3 } Pod { labels: app: qux ip: 10.1.1.8 } Pod { labels: app: foo ip: 10.1.7.6 } 26

Slide 27

Slide 27 text

Endpoints controller(s) Pod { labels: app: foo ip: 10.1.0.1 } Service { name: foo selector: app: foo ports: - port: 80 targetPort: 9376 } Pod { labels: app: bar ip: 10.1.0.2 } Pod { labels: app: foo ip: 10.1.9.3 } Pod { labels: app: qux ip: 10.1.1.8 } Pod { labels: app: foo ip: 10.1.7.6 } 27

Slide 28

Slide 28 text

Endpoints controller(s) Pod { labels: app: foo ip: 10.1.0.1 } Service { name: foo selector: app: foo ports: - port: 80 targetPort: 9376 } Pod { labels: app: bar ip: 10.1.0.2 } Pod { labels: app: foo ip: 10.1.9.3 } Pod { labels: app: qux ip: 10.1.1.8 } Pod { labels: app: foo ip: 10.1.7.6 } Endpoints { name: foo ports: - port: 9376 addresses: - 10.1.0.1 - 10.1.7.6 - 10.1.9.3 } 28

Slide 29

Slide 29 text

DNS Starts with a specification ● A, AAAA, SRV, PTR record formats Generally runs as pods in the cluster ● But doesn’t have to Generally exposed by a Service VIP ● But doesn’t have to be Containers are configured by kubelet to use kube-dns ● Search paths make using it even easier Default implementation is CoreDNS 29

Slide 30

Slide 30 text

Services: DNS The name of your service The namespace your service lives in my-service.default.svc.cluster.local The cluster’s DNS zone Indicates a service name 30

Slide 31

Slide 31 text

kube-proxy Default implementation of Services ● But can be replaced! Runs on every Node in the cluster Uses the node as a proxy for traffic from pods on that node ● iptables, IPVS, winkernel, or userspace options ● Linux: iptables & IPVS are best choice (in-kernel) Transparent to consumers 31

Slide 32

Slide 32 text

Kube-proxy: control path Watch Services and Endpoints Apply some filters ● E.g. ignore “headless” services Link Endpoints (backends) with Services (frontends) Accumulate changes to both Update node rules 32

Slide 33

Slide 33 text

Kube-proxy: data path Recognize service traffic ● E.g. Destination VIP and port Choose a backend ● Consider client affinity if requested Rewrite packets to new destination (DNAT) Un-DNAT on response 33

Slide 34

Slide 34 text

Kube-proxy: FAQ Q: Why not just use DNS-RR? A: DNS clients are generally “broken” and don’t handle changes to DNS records well. This provides a stable IP while backends change Q: My clients are enlightened, can I opt-out? A: Yes! Headless Services get a DNS name but no VIP. 34

Slide 35

Slide 35 text

Service LoadBalancers Services are also how you configure L4 load-balancers Different LBs work in different ways, too broad for this talk Integrations with most cloud providers 35

Slide 36

Slide 36 text

Ingress Describes an HTTP proxy and routing rules ● Simple API - match hostnames and URL paths ● Too simple, more on this later Targets a Service for each rule Kubernetes defines the API, but implementations are 3rd party Integrations with most clouds and popular software LBs 36

Slide 37

Slide 37 text

Ingress Ingress { hostname: foo.com paths: - path: /foo service: foo-svc - path: /bar service: bar-svc } Service: { name: foo-svc selector: app: foo } Service: { name: bar-svc selector: app: bar } Pod { labels: app: foo ip: 10.1.0.1 } Pod { labels: app: bar ip: 10.1.0.2 } Pod { labels: app: bar ip: 10.1.9.3 } Pod { labels: app: foo ip: 10.1.7.6 } 37

Slide 38

Slide 38 text

Ingress FAQ Q: How is this different from Service LoadBalancer? A: Service LB API does not provide for HTTP - no hostnames, no paths, no TLS, etc. Q: Why isn’t there a controller “in the box”? A: We didn’t want to be “picking winners” among the software LBs. That may have been a mistake, honestly. 38

Slide 39

Slide 39 text

NetworkPolicy Describes the allowed call-graph for communications ● E.g. frontends can talk to backends, backends to DB, but never frontends to DB Like Ingress, implementations are 3rd-party ● Often highly coupled to low-level network drivers Very simple rules - focused on app-owners rather than cluster or network admins ● We may need a related-but-different API for the cluster operators 39

Slide 40

Slide 40 text

NetworkPolicy DB DB DB BE BE FE FE FE FE FE 40

Slide 41

Slide 41 text

NetworkPolicy DB DB DB BE BE FE FE FE FE FE 41

Slide 42

Slide 42 text

Agenda Part 2: Deep-dive 42

Slide 43

Slide 43 text

Deep-dive On-going work in the SIG: ● NodeLocal DNS ● EndpointSlice ● Services (Gateway API, MultiClusterService) ● IPv{4,6} Dual stack 43

Slide 44

Slide 44 text

NodeLocal DNS Kubernetes DNS resource cost is high: ● Expansion due to alias names (“my-service”, “my-service.ns”, ...) ● Application density (e.g. microservices) ● DNS-heavy application libraries (e.g. Node.JS) ● CONNTRACK entries due to UDP Solution? NodeLocal DNS (GA v1.18) ● Run a cache on every node ● Careful: per-node overhead can easily dominate in large clusters As a system-critical service in a Daemonset, we need to be careful about high-availability during upgrades, failures. 44

Slide 45

Slide 45 text

Node NodeLocal DNS kube-dns / CoreDNS kube-dns / CoreDNS kube-dns / CoreDNS App Pods App Pods My Pod DNS: 10.0.0.10 kube-dns kube-proxy 10.0.0.10 45

Slide 46

Slide 46 text

Node NodeLocal DNS kube-dns / CoreDNS kube-dns / CoreDNS kube-dns / CoreDNS NodeLocalDNS App Pods App Pods My Pod DNS: 10.0.0.10 kube-dns kube-proxy kube-dns-upstream dummy iface 10.0.0.10 10.0.x.x NOTRACK 10.0.0.10 169.x.x.x 46

Slide 47

Slide 47 text

Node NodeLocal DNS kube-dns / CoreDNS kube-dns / CoreDNS kube-dns / CoreDNS NodeLocalDNS App Pods App Pods My Pod DNS: 10.0.0.10 kube-dns kube-proxy kube-dns-upstream dummy iface 10.0.0.10 10.0.x.x NOTRACK 10.0.0.10 169.x.x.x 47

Slide 48

Slide 48 text

DNS We can do better though: ● Proposal: push alias expansion into the server as an API (enhancements/pull/967) ● Refactor the DNS naming scheme altogether? 48

Slide 49

Slide 49 text

EndpointSlice Larger clusters (think 15k nodes) and very large Services lead to API scalability issues: ● Size of a single object in etcd ● Amount of data sent to watchers ● etcd DB activity Source: Scale Kubernetes Service Endpoints 100X, (Tyczynski, Xia) 49

Slide 50

Slide 50 text

EndpointSlice 10.x.x.x 10.x.x.x 10.x.x.x 10.x.x.x 10.x.x.x … 10.x.x.x 10.x.x.x 10.x.x.x 10.x.x.x 10.x.x.x … 10.x.x.x 10.x.x.x 10.x.x.x 10.x.x.x 10.x.x.x … 10.x.x.x 10.x.x.x 10.x.x.x 10.x.x.x 10.x.x.x … EP EP EP EP Update kube-proxy kube-proxy kube-proxy 50

Slide 51

Slide 51 text

… EndpointSlice 10.x.x.x 10.x.x.x 10.x.x.x … EPS kube-proxy kube-proxy kube-proxy EPS 10.x.x.x … 10.x.x.x 10.x.x.x 10.x.x.x … EPS EPS 10.x.x.x … 10.x.x.x 10.x.x.x 10.x.x.x … EPS EPS 10.x.x.x … 10.x.x.x 10.x.x.x 10.x.x.x … EPS EPS 10.x.x.x 51 Update

Slide 52

Slide 52 text

kind: Service metadata: name: foo spec: … kind: EndpointSlice metadata: name: foo-xfz1 labels: kubernetes.io/service-name: foo endpoints: - addresses: - ip: 10.1.0.7 … kind: EndpointSlice metadata: name: foo-fzew2 labels: kubernetes.io/service-name: foo endpoints: - addresses: - ip: 10.1.0.1 … kind: EndpointSlice … EndpointSlice Controllers EndpointSlices controller: slices from Service selector. Linked to the Service via kubernetes.io/service-name label EndpointSliceMirroring controller: slices from selectorless Service’s Other users can set endpointslice.kubernetes.io/managed-by 52

Slide 53

Slide 53 text

EndpointSlice Update algorithm is an optimization problem: ● Keep number of slices low ● Minimize changes to slices per update ● Keep amount of data sent low Current algorithm 1. Remove stale endpoints in existing slices 2. Fill new endpoints in free space 3. Create new slices only if no more room No active rebalancing -- claim: too much churn, open area 53

Slide 54

Slide 54 text

v1.20 EndpointSlice v1.19 v1.18 v1.17 Beta EndpointSlice controller available Beta EndpointSlice controller enabled no kube-proxy Beta EndpointSlice controller EndpointSliceMirror Windows kube-proxy enabled GA 54

Slide 55

Slide 55 text

Services across Clusters As Kubernetes installations get bigger - multiple clusters is becoming the norm ● LOTS of reasons for this: HA, blast radius, geography, etc. Services have always been a cluster-centric abstraction Starting to work through how to export and extend Services across clusters 55

Slide 56

Slide 56 text

Cluster A Services across Clusters Namespace frontend Service: { name: fe-svc } Cluster B Namespace backend Service: { name: be-svc } Pod Pod Pod Pod 56

Slide 57

Slide 57 text

ServiceExport Service { metadata: Name: be-svc spec: type: ClusterIP clusterIP: 1.2.3.4 } ServiceExport { metadata: Name: be-svc } 57

Slide 58

Slide 58 text

Group Cluster A Services across Clusters Namespace frontend Service: { name: fe-svc } Cluster B Namespace backend Service: { name: be-svc } Pod Pod Pod Pod 58

Slide 59

Slide 59 text

Group Cluster A Services across Clusters Cluster B Namespace frontend Service: { name: fe-svc } Namespace backend Service: { name: be-svc } Pod Pod Pod Pod 59

Slide 60

Slide 60 text

Group Cluster A Services across Clusters Cluster B Namespace frontend Service: { name: fe-svc } Namespace backend Service: { name: be-svc } ServiceImport: { name: be-svc } Pod Pod Pod Pod 60

Slide 61

Slide 61 text

ServiceImports: DNS The name of your service The namespace your service lives in be-svc.backend.supercluster.local The multi-cluster DNS zone (TBD) 61

Slide 62

Slide 62 text

Group Cluster B Cluster A Namespace backend Service: { name: be-svc } Services across Clusters Namespace frontend Service: { name: fe-svc } Service: { name: be-svc } ServiceImport Pod Pod Pod Pod Pod Pod ServiceImport 62

Slide 63

Slide 63 text

Services across Clusters This is mostly KEP-ware right now Still hammering out API, names, etc Still working out some semantics (e.g. conflicts) 63

Slide 64

Slide 64 text

IPv{4,6} Dual Stack Some users need IPv4 and IPv6 at the same time ● Kubernetes only supports 1 Pod IP Some users need Services with both IP families ● Kubernetes only supports 1 Service IP This is a small, but important change to several APIs Wasn’t this work done already? Yes, but we found some problems, needed a major reboot 64

Slide 65

Slide 65 text

IPv{4,6} Dual Stack Pod { status: podIP: 1.2.3.4 } 65

Slide 66

Slide 66 text

IPv{4,6} Dual Stack Pod { status: podIP: 1.2.3.4 podIPs: - 1.2.3.4 - 1234:5678::0001 } same new 66

Slide 67

Slide 67 text

IPv{4,6} Dual Stack Node { spec: podCIDR: 10.9.8.0/24 } 67

Slide 68

Slide 68 text

IPv{4,6} Dual Stack Node { spec: podCIDR: 10.9.8.0/24 podCIDRs: - 10.9.8.0/24 - 1234:5678::/96 } same new 68

Slide 69

Slide 69 text

IPv{4,6} Dual Stack Service { spec: type: ClusterIP clusterIP: 1.2.3.4 } 69

Slide 70

Slide 70 text

IPv{4,6} Dual Stack Service { spec: type: ClusterIP ipFamilyPolicy: PreferDualStack ipFamilies: [ IPv4, IPv6 ] clusterIP: 1.2.3.4 clusterIPs: - 1.2.3.4 - 1234::5678::0001 } same new new 70

Slide 71

Slide 71 text

IPv{4,6} Dual Stack Can express various requirements: ● “I need single-stack” ● “I’d like dual-stack, if it is available” ● “I need dual-stack” Defaults to single-stack if users doesn’t express a requirement Works for headless Services, NodePorts, and LBs (if cloud-provider supports it) Shooting for second alpha in 1.20 71

Slide 72

Slide 72 text

Services V+1 Service resource describes many things: ● Method of exposure (ClusterIP, NodePort, LoadBalancer) ● Grouping of Pods (e.g. selector) ● Attributes (ExternalTrafficPolicy, SessionAffinity, …) Evolving and extending the resource becomes harder and harder due to interactions between fields… Evolution of L7 Ingress API: role-based resource modeling, extensibility (Headless) ClusterIP NodePort LoadBalancer Service hierarchy 72

Slide 73

Slide 73 text

Services V+1 Idea: Decouple along role, concept axes: Roles: ● ‍♀Infrastructure Provider ● ‍ Cluster Operator / NetOps ● ‍ Application Developer Concepts: ● Grouping, selection ● Routing, protocol specific attributes ● Exposure and access 73 Gateway Class Gateway *Route Service

Slide 74

Slide 74 text

Services V+1 74 Gateway Class Gateway *Route Service ‍♀ Infrastructure Provider Defines a kind of Service access for the cluster (e.g. “internal-proxy”, “internet-lb”, …) Similar to StorageClass, abstracts implementation of mechanism from the consumer. kind: GatewayClass metadata: name: cluster-gateway spec: controller: "acme.io/gateway-controller" parametersRef: name: internet-gateway

Slide 75

Slide 75 text

Services V+1 75 Gateway Class Gateway *Route Service ‍ Cluster Operator / NetOps How the Service(s) are access by the user (e.g. port, protocol, addresses) Keystone resource: 1-1 with configuration of the infrastructure: ● Spawn a software LB ● Add a configuration stanza to LB. ● Program the SDN May be “underspecified”: defaults based on GatewayClass.

Slide 76

Slide 76 text

Services V+1 76 Gateway Class Gateway *Route Service kind: Gateway metadata: name: my-gateway spec: class: cluster-gateway # How Gateway is to be accessed (e.g. via Port 80) listeners: - port: 80 routes: - routeSelector: # Which Routes are linked to this Gateway foo: bar

Slide 77

Slide 77 text

Services V+1 77 Gateway Class Gateway *Route Service ‍ Application Developer Application routing, composition, e.g. “/search” → service-service, “/store” → store-service. Family of Resource types by protocol (TCPRoute, HTTPRoute, …) to solve issue of single, closed union type and extensibility. kind: HTTPRoute metadata: name: my-app spec: rules: - match: {path: “/store”} action: {fowardTo: {targetRef: “store-service”}}

Slide 78

Slide 78 text

Services V+1 78 Gateway Class Gateway *Route Service What about Service? ● Grouping, selection ● V1 functionality still works -- but hopefully will not have to add significantly to existing surface area.

Slide 79

Slide 79 text

Services V+1 79 Gateway Class Gateway *Route Service * * m n kind: GatewayClass name: internet-lb ... kind: Gateway namespace: net-ops name: the-gateway class: internet-lb listeners: - port: 80 protocol: http routes: - kind: HTTPRoute name: my-app kind: HTTPRoute name: my-app rules: - path: /my-app ... gateways: - namespace: net-ops name: the-gateway kind: Service name: my-app

Slide 80

Slide 80 text

80 Services V+1 Initial v1alpha1 cut soon: ● Basic applications, data types ● GatewayClass for interoperation between controllers. ● Gateway + Route ○ HTTP, TCP ○ HTTPS + server certificates+secrets ● Implementability: ○ Merging style (multiple Gateways hosted on single* proxy infra) ○ Provisioning/Cloud (Gateways mapped to externally managed resources)

Slide 81

Slide 81 text

Agenda Wrapping up 81

Slide 82

Slide 82 text

Issues https://issues.k8s.io File bugs, cleanup ideas, and feature requests Find issues to help with! ● Especially those labelled “good first issue” and “help wanted”. ● Triage issues (is this a real bug?) labelled “triage/unresolved”. 82

Slide 83

Slide 83 text

Enhancements https://git.k8s.io/enhancements/keps/sig-network “Enhancements” are user-visible changes (features + functional changes) ● Participate in enhancement dialogue and planning ○ More eyeballs are always welcome ● Submit enhancement proposals of your own! 83

Slide 84

Slide 84 text

Get involved! https://git.k8s.io/community/sig-network Zoom meeting: Every other Thursday, 21:00 UTC Slack: #sig-network (slack.k8s.io) Mailing List: https://groups.google.com/forum/#!forum/kubernetes-sig-network 84

Slide 85

Slide 85 text

85