Slide 1

Slide 1 text

Kubernetes Scheduler 的兩兩三事 SDN x Cloud Native #5

Slide 2

Slide 2 text

About Me ⽩白凱仁(Kyle Bai) • Interested in emerging technologies. • COSCUP, Kubernetes Day and OpenStack Day Speaker. • OpenStack and Kubernetes Projects Contributor(100+ PR). • Certified Kubernetes Administrator. @kairen([email protected])

Slide 3

Slide 3 text

Kubernetes Scheduler • The Kubernetes scheduler is a policy-rich, topology-aware, workload- specific function that significantly impacts availability, performance, and capacity. • The Kubernetes scheduler is in charge of scheduling pods onto nodes. Basically it works like this: • You create a pod. • The scheduler notices that the new pod you created doesn’t have a node assigned to it. • The scheduler assigns a node to the pod P.S. It basically just needs to make sure every pod has a node assigned to it.

Slide 4

Slide 4 text

How does the scheduler work? • The scheduler watches Kubernetes API, performs iterative steps to converge: Current cluster state => Declarative cluster model. • Scheduler keeps its cache updated by receiving events from the API server.

Slide 5

Slide 5 text

How does the scheduler places Pods? The user creates a Pod via the API Server and the API server writes it to etcd.

Slide 6

Slide 6 text

How does the scheduler places Pods? The scheduler notices an “unbound” Pod and decides which node to run that Pod on. It writes that binding back to the API Server.

Slide 7

Slide 7 text

How does the scheduler places Pods? The Kubelet notices a change in the set of Pods that are bound to its node. It, in turn, runs the container via the container runtime (i.e. Docker).

Slide 8

Slide 8 text

How does the scheduler places Pods? The Kubelet monitors the status of the Pod via the container runtime. As things change, the Kubelet will reflect the current status back to the API Server.

Slide 9

Slide 9 text

Zooming in the scheduler job ❶ Watch for pods that: • Are in PENDING phase • Have no Pod.Spec.NodeName assigned • Are explicitly requesting our scheduler (default otherwise) ❶

Slide 10

Slide 10 text

Zooming in the scheduler job ❷ Node selection algorithm(Filter and Rank): • PodFitsHostPorts • … • LeastRequestedPriority • … ❷

Slide 11

Slide 11 text

Zooming in the scheduler job ❸ Post Pod <===> Node binding to the API Server ❸

Slide 12

Slide 12 text

Zooming in the scheduler job ❹ Profit!!! ❹

Slide 13

Slide 13 text

The basic behavior for Scheduler • Scheduling : Filter, followed by ranking. • Filter => Predicate func. • Rank => Priority func. • For each pod: • Filter nodes with at least required resources. • Assign the pod to the “best” node. Best is defined with highest priority. • If multiple nodes have the same highest priority, choose at random.

Slide 14

Slide 14 text

The basic behavior for Scheduler Host 1 Host 2 Host 3 Host 4 Host 5 Host 6

Slide 15

Slide 15 text

The basic behavior for Scheduler Host 1 Host 2 Host 3 Host 4 Host 5 Host 6 Host 1 Host 2 Host 3 Host 4 Host 5 Host 6 Predicate

Slide 16

Slide 16 text

The basic behavior for Scheduler Host 1 Host 2 Host 3 Host 4 Host 5 Host 6 Host 1 Host 2 Host 3 Host 4 Host 5 Host 6 Host 2 Host 3 Host 4 Host 5 Predicate Priority

Slide 17

Slide 17 text

The basic behavior for Scheduler Host 1 Host 2 Host 3 Host 4 Host 5 Host 6 Host 1 Host 2 Host 3 Host 4 Host 5 Host 6 Host 2 Host 3 Host 4 Host 5 Predicate Priority Host 3 Select

Slide 18

Slide 18 text

Scheduler Logic Diagram

Slide 19

Slide 19 text

Filtering(Predicate functions) the nodes • The purpose of filtering the nodes is to filter out the nodes that do not meet certain requirements of the Pod. • Currently, there are several "predicates" implementing different filtering policies, including: • NoDiskConflict • PodFitsResources • PodFitsHostPorts • PodFitsHost • PodSelectorMatches • CheckNodeDiskPressure • NoVolumeZoneConflict • MatchNodeSelector • MaxEBSVolumeCount • MaxGCEPDVolumeCount • CheckNodeMemoryPressure

Slide 20

Slide 20 text

Volume filters • Do pod requested volumes zones fit the nodes zone? • Can the node attach to the volumes? • Are there mounted volumes conflicts? • Are there additional volume topology constraints? Volume filters Resource filters Topology filters

Slide 21

Slide 21 text

Resource filters • Does pod requested resources (CPU, RAM GPU, etc) fit the node’s available resources? • Can pod requested ports be opened on the node? • Is there no memory or disk pressure on the node? Volume filters Resource filters Topology filters

Slide 22

Slide 22 text

Topology filters • Is the pod requested to run on this node? • Are there inter-pod affinity constraints? • Does the node match the pod’s node selector? • Can the pod tolerate the node’s taints? Volume filters Resource filters Topology filters

Slide 23

Slide 23 text

Ranking(Priority functions) the nodes • Kubernetes prioritizes the remaining nodes to find the "best" one for the Pod. The prioritization is performed by a set of priority functions. • For example, suppose there are two priority functions, priorityFunc1 and priorityFunc2 with weighting factors weight1 and weight2 respectively, the final score of some NodeA is: finalScoreNodeA = (weight1 * priorityFunc1) + (weight2 * priorityFunc2)

Slide 24

Slide 24 text

Ranking(Priority functions) the nodes • Currently, Scheduler provides some practical priority functions, including: • Least Requested Priority • Balanced Resource Allocation • Selector Spread Priority • Calculate Anti-Affinity Priority • Image Locality Priority • Node Affinity Priority

Slide 25

Slide 25 text

Scheduling Scenarios - Resources • CPU, RAM, other (GPU) • Reserved resources • Requests and limits • Guaranteed • Best-Effort • Burstable Pod spec: resouces: request: 1 limit: 1 Node A Node B GPU0 GPU1

Slide 26

Slide 26 text

Scheduling Scenarios - Constraints
 • Specify Pod.spec.nodeName field value. • Labels and node selectors • Taints and tolerations Pod spec: nodeName: NodeA Node A Node B GPU0 GPU1

Slide 27

Slide 27 text

Scheduling Scenarios - Constraints
 • Specify Pod.spec.nodeName field value. • Labels and node selectors • Taints and tolerations Pod(DaemonSet) spec: nodeSelector: backup: pornhub-data Node B metadata: labels: backup: pornhub-data Node A metadata: labels: backup: pornhub-data Node C metadata: labels: backup: pornhub-data

Slide 28

Slide 28 text

Taints and Tolerations Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. Node conditions: • Key: condition category. • Value: specific condition. • Operator: value wildcard • Equal or Exists • Effect • NoSchedule: filter at scheduling time. • PreferNoSchedule: prioritize at scheduling time. • NoExecute: filter at scheduling time, evict if executing. • TolerationSeconds: time to tolerate “NoExecute” taint.

Slide 29

Slide 29 text

Scheduling Scenarios - Constraints
 • Specify Pod.spec.nodeName field value. • Labels and node selectors • Taints and tolerations Pod(DaemonSet) spec: tolerations: - key: error value: disk operator: Equal effect: NoExecute tolerationSeconds: 60 Node B spec: taints: - effect: NoSchedule key: error2 value: disk timeAdded: null Node A spec: taints: - effect: NoSchedule key: error value: disk timeAdded: null Node C spec: taints: - effect: NoSchedule key: error value: disk timeAdded: null

Slide 30

Slide 30 text

Affinity Kubernetes also has a more nuanced way of setting affinity called nodeAffinity and podAffinity. Take automatic or user-defined metadata to dictate where to schedule pods.

Slide 31

Slide 31 text

Scheduling Scenarios - Affinity
 • Node Affinity/Anti-Affinity • Anti: operator is NotIn. • Pod Affinity/Anti-Affinity Pod spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "" operator: In values: ["us-central1-a"] preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: tier operator: In values: - backend Node B metadata: labels: us-central2-a tier: backend Node A metadata: labels: us-central1-a tier: backend Node C metadata: labels: us-central1-a tier: frontend

Slide 32

Slide 32 text

Scheduling Scenarios - Affinity
 • Node Affinity/Anti-Affinity • Anti: operator is NotIn. • Pod Affinity/Anti-Affinity Pod spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "" operator: In values: ["us-central1-a"] preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: tier operator: In values: - backend Node B metadata: labels: us-central2-a tier: backend Node A metadata: labels: us-central1-a tier: backend Node C metadata: labels: us-central1-a tier: frontend

Slide 33

Slide 33 text

Scheduling Scenarios - Affinity
 • Node Affinity/Anti-Affinity • Anti: operator is NotIn. • Pod Affinity/Anti-Affinity Pod(Replicas:2) spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: security operator: In values: - S1 topologyKey: Node B Pod metadata: labels: security: S2 Node C Pod metadata: labels: security: S1 us-central1-2 us-central1-a Node A Pod metadata: labels: security: S1 us-central1-a

Slide 34

Slide 34 text

Scheduling Scenarios - Affinity
 • Node Affinity/Anti-Affinity • Anti: operator is NotIn. • Pod Affinity/Anti-Affinity Pod(Replicas:2) spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: security operator: In values: - S2 topologyKey: Node B Pod metadata: labels: security: S2 Node C Pod metadata: labels: security: S1 NodeC NodeB Node A Pod metadata: labels: security: S1 NodeA

Slide 35

Slide 35 text

Extending Scheduler by policys • You can change the default scheduler policy by specifying --policy- config-file to the kube-scheduler. • If you want to use custom scheduler for your pod instead of the default kube-scheduler, specify spec.schedulerName

Slide 36

Slide 36 text

Advanced Kubernetes Scheduling • Resource Quality of Service proposal. • Resource limits and Oversubscription. • Admission control limit range proposal. • Pod Priority and Preemption.

Slide 37

Slide 37 text

Experimental features - Pod Priority • Pod priority (alpha in 1.10+, present since 1.8+) • Preemption: Evict less important pods (if needed) to fit important ones. • Scheduling priority (since 1.9) in the queue of Pending pods. • Out of resource eviction: If the node starts to run out of resources it will evict less important pods first. • PriorityClassName: system-node-critical(ds, sts), system-cluster-critical(dp).

Slide 38

Slide 38 text

Experimental features - TaintBasedEvictions • TaintBasedEvictions (alpha) • NoExecute: Representing node problems dynamically using taints. • tolerationSeconds: If your pod has “expensive” local state and there is a chance of recovery, you can tolerate the node failure for a while.

Slide 39

Slide 39 text

Scheduler Status

Slide 40

Slide 40 text

References • google-intermediate-skill-level • • • • • • • •

Slide 41

Slide 41 text

Thank you for your attention!! Q & A