Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kubernetes Scheduler 兩三事

Kubernetes Scheduler 兩三事

Kyle Bai

May 19, 2018
Tweet

More Decks by Kyle Bai

Other Decks in Technology

Transcript

  1. About Me ⽩白凱仁(Kyle Bai) • Interested in emerging technologies. •

    COSCUP, Kubernetes Day and OpenStack Day Speaker. • OpenStack and Kubernetes Projects Contributor(100+ PR). • Certified Kubernetes Administrator. @kairen([email protected]) https://kairen.github.io/
  2. Kubernetes Scheduler • The Kubernetes scheduler is a policy-rich, topology-aware,

    workload- specific function that significantly impacts availability, performance, and capacity. • The Kubernetes scheduler is in charge of scheduling pods onto nodes. Basically it works like this: • You create a pod. • The scheduler notices that the new pod you created doesn’t have a node assigned to it. • The scheduler assigns a node to the pod P.S. It basically just needs to make sure every pod has a node assigned to it.
  3. How does the scheduler work? • The scheduler watches Kubernetes

    API, performs iterative steps to converge: Current cluster state => Declarative cluster model. • Scheduler keeps its cache updated by receiving events from the API server.
  4. How does the scheduler places Pods? The user creates a

    Pod via the API Server and the API server writes it to etcd.
  5. How does the scheduler places Pods? The scheduler notices an

    “unbound” Pod and decides which node to run that Pod on. It writes that binding back to the API Server.
  6. How does the scheduler places Pods? The Kubelet notices a

    change in the set of Pods that are bound to its node. It, in turn, runs the container via the container runtime (i.e. Docker).
  7. How does the scheduler places Pods? The Kubelet monitors the

    status of the Pod via the container runtime. As things change, the Kubelet will reflect the current status back to the API Server.
  8. Zooming in the scheduler job ❶ Watch for pods that:

    • Are in PENDING phase • Have no Pod.Spec.NodeName assigned • Are explicitly requesting our scheduler (default otherwise) ❶
  9. Zooming in the scheduler job ❷ Node selection algorithm(Filter and

    Rank): • PodFitsHostPorts • … • LeastRequestedPriority • … ❷
  10. The basic behavior for Scheduler • Scheduling : Filter, followed

    by ranking. • Filter => Predicate func. • Rank => Priority func. • For each pod: • Filter nodes with at least required resources. • Assign the pod to the “best” node. Best is defined with highest priority. • If multiple nodes have the same highest priority, choose at random.
  11. The basic behavior for Scheduler Host 1 Host 2 Host

    3 Host 4 Host 5 Host 6 Host 1 Host 2 Host 3 Host 4 Host 5 Host 6 Predicate
  12. The basic behavior for Scheduler Host 1 Host 2 Host

    3 Host 4 Host 5 Host 6 Host 1 Host 2 Host 3 Host 4 Host 5 Host 6 Host 2 Host 3 Host 4 Host 5 Predicate Priority
  13. The basic behavior for Scheduler Host 1 Host 2 Host

    3 Host 4 Host 5 Host 6 Host 1 Host 2 Host 3 Host 4 Host 5 Host 6 Host 2 Host 3 Host 4 Host 5 Predicate Priority Host 3 Select
  14. Filtering(Predicate functions) the nodes • The purpose of filtering the

    nodes is to filter out the nodes that do not meet certain requirements of the Pod. • Currently, there are several "predicates" implementing different filtering policies, including: • NoDiskConflict • PodFitsResources • PodFitsHostPorts • PodFitsHost • PodSelectorMatches • CheckNodeDiskPressure • NoVolumeZoneConflict • MatchNodeSelector • MaxEBSVolumeCount • MaxGCEPDVolumeCount • CheckNodeMemoryPressure
  15. Volume filters • Do pod requested volumes zones fit the

    nodes zone? • Can the node attach to the volumes? • Are there mounted volumes conflicts? • Are there additional volume topology constraints? Volume filters Resource filters Topology filters
  16. Resource filters • Does pod requested resources (CPU, RAM GPU,

    etc) fit the node’s available resources? • Can pod requested ports be opened on the node? • Is there no memory or disk pressure on the node? Volume filters Resource filters Topology filters
  17. Topology filters • Is the pod requested to run on

    this node? • Are there inter-pod affinity constraints? • Does the node match the pod’s node selector? • Can the pod tolerate the node’s taints? Volume filters Resource filters Topology filters
  18. Ranking(Priority functions) the nodes • Kubernetes prioritizes the remaining nodes

    to find the "best" one for the Pod. The prioritization is performed by a set of priority functions. • For example, suppose there are two priority functions, priorityFunc1 and priorityFunc2 with weighting factors weight1 and weight2 respectively, the final score of some NodeA is: finalScoreNodeA = (weight1 * priorityFunc1) + (weight2 * priorityFunc2)
  19. Ranking(Priority functions) the nodes • Currently, Scheduler provides some practical

    priority functions, including: • Least Requested Priority • Balanced Resource Allocation • Selector Spread Priority • Calculate Anti-Affinity Priority • Image Locality Priority • Node Affinity Priority
  20. Scheduling Scenarios - Resources • CPU, RAM, other (GPU) •

    Reserved resources • Requests and limits • Guaranteed • Best-Effort • Burstable Pod spec: resouces: request: nvidia.com/gpu: 1 limit: nvidia.com/gpu: 1 Node A Node B GPU0 GPU1
  21. Scheduling Scenarios - Constraints
 • Specify Pod.spec.nodeName field value. •

    Labels and node selectors • Taints and tolerations Pod spec: nodeName: NodeA Node A Node B GPU0 GPU1
  22. Scheduling Scenarios - Constraints
 • Specify Pod.spec.nodeName field value. •

    Labels and node selectors • Taints and tolerations Pod(DaemonSet) spec: nodeSelector: backup: pornhub-data Node B metadata: labels: backup: pornhub-data Node A metadata: labels: backup: pornhub-data Node C metadata: labels: backup: pornhub-data
  23. Taints and Tolerations Taints and tolerations work together to ensure

    that pods are not scheduled onto inappropriate nodes. Node conditions: • Key: condition category. • Value: specific condition. • Operator: value wildcard • Equal or Exists • Effect • NoSchedule: filter at scheduling time. • PreferNoSchedule: prioritize at scheduling time. • NoExecute: filter at scheduling time, evict if executing. • TolerationSeconds: time to tolerate “NoExecute” taint.
  24. Scheduling Scenarios - Constraints
 • Specify Pod.spec.nodeName field value. •

    Labels and node selectors • Taints and tolerations Pod(DaemonSet) spec: tolerations: - key: error value: disk operator: Equal effect: NoExecute tolerationSeconds: 60 Node B spec: taints: - effect: NoSchedule key: error2 value: disk timeAdded: null Node A spec: taints: - effect: NoSchedule key: error value: disk timeAdded: null Node C spec: taints: - effect: NoSchedule key: error value: disk timeAdded: null
  25. Affinity Kubernetes also has a more nuanced way of setting

    affinity called nodeAffinity and podAffinity. Take automatic or user-defined metadata to dictate where to schedule pods.
  26. Scheduling Scenarios - Affinity
 • Node Affinity/Anti-Affinity • Anti: operator

    is NotIn. • Pod Affinity/Anti-Affinity Pod spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "kubernetes.io/zone" operator: In values: ["us-central1-a"] preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: tier operator: In values: - backend Node B metadata: labels: kubernetes.io/zone: us-central2-a tier: backend Node A metadata: labels: kubernetes.io/zone: us-central1-a tier: backend Node C metadata: labels: kubernetes.io/zone: us-central1-a tier: frontend
  27. Scheduling Scenarios - Affinity
 • Node Affinity/Anti-Affinity • Anti: operator

    is NotIn. • Pod Affinity/Anti-Affinity Pod spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "kubernetes.io/zone" operator: In values: ["us-central1-a"] preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: tier operator: In values: - backend Node B metadata: labels: kubernetes.io/zone: us-central2-a tier: backend Node A metadata: labels: kubernetes.io/zone: us-central1-a tier: backend Node C metadata: labels: kubernetes.io/zone: us-central1-a tier: frontend
  28. Scheduling Scenarios - Affinity
 • Node Affinity/Anti-Affinity • Anti: operator

    is NotIn. • Pod Affinity/Anti-Affinity Pod(Replicas:2) spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: security operator: In values: - S1 topologyKey: kubernetes.io/zone Node B Pod metadata: labels: security: S2 Node C Pod metadata: labels: security: S1 kubernetes.io/zone: us-central1-2 kubernetes.io/zone: us-central1-a Node A Pod metadata: labels: security: S1 kubernetes.io/zone: us-central1-a
  29. Scheduling Scenarios - Affinity
 • Node Affinity/Anti-Affinity • Anti: operator

    is NotIn. • Pod Affinity/Anti-Affinity Pod(Replicas:2) spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: security operator: In values: - S2 topologyKey: kubernetes.io/hostname Node B Pod metadata: labels: security: S2 Node C Pod metadata: labels: security: S1 kubernetes.io/hostname: NodeC kubernetes.io/hostname: NodeB Node A Pod metadata: labels: security: S1 kubernetes.io/hostname: NodeA
  30. Extending Scheduler by policys • You can change the default

    scheduler policy by specifying --policy- config-file to the kube-scheduler. • If you want to use custom scheduler for your pod instead of the default kube-scheduler, specify spec.schedulerName
  31. Advanced Kubernetes Scheduling • Resource Quality of Service proposal. •

    Resource limits and Oversubscription. • Admission control limit range proposal. • Pod Priority and Preemption.
  32. Experimental features - Pod Priority • Pod priority (alpha in

    1.10+, present since 1.8+) • Preemption: Evict less important pods (if needed) to fit important ones. • Scheduling priority (since 1.9) in the queue of Pending pods. • Out of resource eviction: If the node starts to run out of resources it will evict less important pods first. • PriorityClassName: system-node-critical(ds, sts), system-cluster-critical(dp).
  33. Experimental features - TaintBasedEvictions • TaintBasedEvictions (alpha) • NoExecute: Representing

    node problems dynamically using taints. • tolerationSeconds: If your pod has “expensive” local state and there is a chance of recovery, you can tolerate the node failure for a while.
  34. References • https://kccnceu18.sched.com/event/Drn7/sig-scheduling-deep-dive-bobby-salamat-jonathan-basseri- google-intermediate-skill-level • https://blog.heptio.com/core-kubernetes-jazz-improv-over-orchestration-a7903ea92ca • https://github.com/kubernetes/community/blob/master/contributors/devel/scheduler_algorithm.md • http://releases.k8s.io/HEAD/pkg/scheduler/algorithm/predicates/predicates.go

    • https://github.com/kubernetes/kubernetes/tree/HEAD/pkg/scheduler/algorithm/priorities/ • https://thenewstack.io/implementing-advanced-scheduling-techniques-with-kubernetes/ • https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions • https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/ • https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-nodes-by-condition