Kubernetes Scheduler 兩三事

Kubernetes Scheduler 的兩兩三事 SDN x Cloud Native #5

About Me ⽩白凱仁(Kyle Bai) • Interested in emerging technologies. •
COSCUP, Kubernetes Day and OpenStack Day Speaker. • OpenStack and Kubernetes Projects Contributor(100+ PR). • Certified Kubernetes Administrator. @kairen([email protected]) https://kairen.github.io/

Kubernetes Scheduler • The Kubernetes scheduler is a policy-rich, topology-aware,
workload- specific function that significantly impacts availability, performance, and capacity. • The Kubernetes scheduler is in charge of scheduling pods onto nodes. Basically it works like this: • You create a pod. • The scheduler notices that the new pod you created doesn’t have a node assigned to it. • The scheduler assigns a node to the pod P.S. It basically just needs to make sure every pod has a node assigned to it.

How does the scheduler work? • The scheduler watches Kubernetes
API, performs iterative steps to converge: Current cluster state => Declarative cluster model. • Scheduler keeps its cache updated by receiving events from the API server.

How does the scheduler places Pods? The user creates a
Pod via the API Server and the API server writes it to etcd.

How does the scheduler places Pods? The scheduler notices an
“unbound” Pod and decides which node to run that Pod on. It writes that binding back to the API Server.

How does the scheduler places Pods? The Kubelet notices a
change in the set of Pods that are bound to its node. It, in turn, runs the container via the container runtime (i.e. Docker).

How does the scheduler places Pods? The Kubelet monitors the
status of the Pod via the container runtime. As things change, the Kubelet will reflect the current status back to the API Server.

Zooming in the scheduler job ❶ Watch for pods that:
• Are in PENDING phase • Have no Pod.Spec.NodeName assigned • Are explicitly requesting our scheduler (default otherwise) ❶

Zooming in the scheduler job ❷ Node selection algorithm(Filter and
Rank): • PodFitsHostPorts • … • LeastRequestedPriority • … ❷

Zooming in the scheduler job ❸ Post Pod <===> Node
binding to the API Server ❸

Zooming in the scheduler job ❹ Profit!!! ❹

The basic behavior for Scheduler • Scheduling : Filter, followed
by ranking. • Filter => Predicate func. • Rank => Priority func. • For each pod: • Filter nodes with at least required resources. • Assign the pod to the “best” node. Best is defined with highest priority. • If multiple nodes have the same highest priority, choose at random.

The basic behavior for Scheduler Host 1 Host 2 Host
3 Host 4 Host 5 Host 6

3 Host 4 Host 5 Host 6 Host 1 Host 2 Host 3 Host 4 Host 5 Host 6 Predicate

3 Host 4 Host 5 Host 6 Host 1 Host 2 Host 3 Host 4 Host 5 Host 6 Host 2 Host 3 Host 4 Host 5 Predicate Priority

3 Host 4 Host 5 Host 6 Host 1 Host 2 Host 3 Host 4 Host 5 Host 6 Host 2 Host 3 Host 4 Host 5 Predicate Priority Host 3 Select

Scheduler Logic Diagram

Filtering(Predicate functions) the nodes • The purpose of filtering the
nodes is to filter out the nodes that do not meet certain requirements of the Pod. • Currently, there are several "predicates" implementing different filtering policies, including: • NoDiskConflict • PodFitsResources • PodFitsHostPorts • PodFitsHost • PodSelectorMatches • CheckNodeDiskPressure • NoVolumeZoneConflict • MatchNodeSelector • MaxEBSVolumeCount • MaxGCEPDVolumeCount • CheckNodeMemoryPressure

Volume filters • Do pod requested volumes zones fit the
nodes zone? • Can the node attach to the volumes? • Are there mounted volumes conflicts? • Are there additional volume topology constraints? Volume filters Resource filters Topology filters

Resource filters • Does pod requested resources (CPU, RAM GPU,
etc) fit the node’s available resources? • Can pod requested ports be opened on the node? • Is there no memory or disk pressure on the node? Volume filters Resource filters Topology filters

Topology filters • Is the pod requested to run on
this node? • Are there inter-pod affinity constraints? • Does the node match the pod’s node selector? • Can the pod tolerate the node’s taints? Volume filters Resource filters Topology filters

Ranking(Priority functions) the nodes • Kubernetes prioritizes the remaining nodes
to find the "best" one for the Pod. The prioritization is performed by a set of priority functions. • For example, suppose there are two priority functions, priorityFunc1 and priorityFunc2 with weighting factors weight1 and weight2 respectively, the final score of some NodeA is: finalScoreNodeA = (weight1 * priorityFunc1) + (weight2 * priorityFunc2)

Ranking(Priority functions) the nodes • Currently, Scheduler provides some practical
priority functions, including: • Least Requested Priority • Balanced Resource Allocation • Selector Spread Priority • Calculate Anti-Affinity Priority • Image Locality Priority • Node Affinity Priority

Scheduling Scenarios - Resources • CPU, RAM, other (GPU) •
Reserved resources • Requests and limits • Guaranteed • Best-Effort • Burstable Pod spec: resouces: request: nvidia.com/gpu: 1 limit: nvidia.com/gpu: 1 Node A Node B GPU0 GPU1

Scheduling Scenarios - Constraints  • Specify Pod.spec.nodeName field value. •
Labels and node selectors • Taints and tolerations Pod spec: nodeName: NodeA Node A Node B GPU0 GPU1

Labels and node selectors • Taints and tolerations Pod(DaemonSet) spec: nodeSelector: backup: pornhub-data Node B metadata: labels: backup: pornhub-data Node A metadata: labels: backup: pornhub-data Node C metadata: labels: backup: pornhub-data

Taints and Tolerations Taints and tolerations work together to ensure
that pods are not scheduled onto inappropriate nodes. Node conditions: • Key: condition category. • Value: specific condition. • Operator: value wildcard • Equal or Exists • Effect • NoSchedule: filter at scheduling time. • PreferNoSchedule: prioritize at scheduling time. • NoExecute: filter at scheduling time, evict if executing. • TolerationSeconds: time to tolerate “NoExecute” taint.

Labels and node selectors • Taints and tolerations Pod(DaemonSet) spec: tolerations: - key: error value: disk operator: Equal effect: NoExecute tolerationSeconds: 60 Node B spec: taints: - effect: NoSchedule key: error2 value: disk timeAdded: null Node A spec: taints: - effect: NoSchedule key: error value: disk timeAdded: null Node C spec: taints: - effect: NoSchedule key: error value: disk timeAdded: null

Affinity Kubernetes also has a more nuanced way of setting
affinity called nodeAffinity and podAffinity. Take automatic or user-defined metadata to dictate where to schedule pods.

Scheduling Scenarios - Affinity  • Node Affinity/Anti-Affinity • Anti: operator
is NotIn. • Pod Affinity/Anti-Affinity Pod spec: afﬁnity: nodeAfﬁnity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "kubernetes.io/zone" operator: In values: ["us-central1-a"] preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: tier operator: In values: - backend Node B metadata: labels: kubernetes.io/zone: us-central2-a tier: backend Node A metadata: labels: kubernetes.io/zone: us-central1-a tier: backend Node C metadata: labels: kubernetes.io/zone: us-central1-a tier: frontend

is NotIn. • Pod Affinity/Anti-Affinity Pod(Replicas:2) spec: afﬁnity: podAfﬁnity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: security operator: In values: - S1 topologyKey: kubernetes.io/zone Node B Pod metadata: labels: security: S2 Node C Pod metadata: labels: security: S1 kubernetes.io/zone: us-central1-2 kubernetes.io/zone: us-central1-a Node A Pod metadata: labels: security: S1 kubernetes.io/zone: us-central1-a

is NotIn. • Pod Affinity/Anti-Affinity Pod(Replicas:2) spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: security operator: In values: - S2 topologyKey: kubernetes.io/hostname Node B Pod metadata: labels: security: S2 Node C Pod metadata: labels: security: S1 kubernetes.io/hostname: NodeC kubernetes.io/hostname: NodeB Node A Pod metadata: labels: security: S1 kubernetes.io/hostname: NodeA

Extending Scheduler by policys • You can change the default
scheduler policy by specifying --policy- config-file to the kube-scheduler. • If you want to use custom scheduler for your pod instead of the default kube-scheduler, specify spec.schedulerName

Advanced Kubernetes Scheduling • Resource Quality of Service proposal. •
Resource limits and Oversubscription. • Admission control limit range proposal. • Pod Priority and Preemption.

Experimental features - Pod Priority • Pod priority (alpha in
1.10+, present since 1.8+) • Preemption: Evict less important pods (if needed) to fit important ones. • Scheduling priority (since 1.9) in the queue of Pending pods. • Out of resource eviction: If the node starts to run out of resources it will evict less important pods first. • PriorityClassName: system-node-critical(ds, sts), system-cluster-critical(dp).

Experimental features - TaintBasedEvictions • TaintBasedEvictions (alpha) • NoExecute: Representing
node problems dynamically using taints. • tolerationSeconds: If your pod has “expensive” local state and there is a chance of recovery, you can tolerate the node failure for a while.

Scheduler Status

References • https://kccnceu18.sched.com/event/Drn7/sig-scheduling-deep-dive-bobby-salamat-jonathan-basseri- google-intermediate-skill-level • https://blog.heptio.com/core-kubernetes-jazz-improv-over-orchestration-a7903ea92ca • https://github.com/kubernetes/community/blob/master/contributors/devel/scheduler_algorithm.md • http://releases.k8s.io/HEAD/pkg/scheduler/algorithm/predicates/predicates.go
• https://github.com/kubernetes/kubernetes/tree/HEAD/pkg/scheduler/algorithm/priorities/ • https://thenewstack.io/implementing-advanced-scheduling-techniques-with-kubernetes/ • https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions • https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/ • https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-nodes-by-condition

Thank you for your attention!! Q & A

Kubernetes Scheduler 兩三事

Kubernetes Scheduler 兩三事

More Decks by Kyle Bai

Other Decks in Technology

Featured

Transcript