Slide 1

Slide 1 text

猫でもわかる Pod Preemption チェシャ猫 (@y_taka_23) Kubernetes Meetup #10 (2018/03/08) #k8sjp

Slide 2

Slide 2 text

目次 ● 導入の動機 ○ Preemption とは何か? 何が得られるのか? ● アルゴリズム概観 ○ 内部的にはどうやって動作しているのか? ● Preempt 判定の限界 ○ 複数条件が組み合わさったとき何が起こるのか? #k8sjp

Slide 3

Slide 3 text

Pod B: Req. = 5 Node 2: Capacity = 5/30 Pod A: Req. = 10 Node 1: Capacity = 10/30 Pod Y: Req. = 20 Pod X: Req. = 15 Pod Z: Req. = 10 Scheduler’s main loop Pod Queue

Slide 4

Slide 4 text

Pod B: Req. = 5 Node 2: Capacity = 5/30 Pod X: Req. = 15 Pod A: Req. = 10 Node 1: Capacity = 25/30 Pod Z: Req. = 10 Pod Y: Req. = 20 Scheduler’s main loop Pod Queue

Slide 5

Slide 5 text

Pod Y: Req. = 20 Pod B: Req. = 5 Node 2: Capacity = 25/30 Pod X: Req. = 15 Pod A: Req. = 10 Node 1: Capacity = 25/30 Pod Z: Req. = 10 Scheduler’s main loop Pod Queue

Slide 6

Slide 6 text

Pod Y: Req. = 20 Pod B: Req. = 5 Node 2: Capacity = 25/30 Pod X: Req. = 15 Pod A: Req. = 10 Node 1: Capacity = 25/30 Pod Z: Req. = 10 Scheduler’s main loop Pod Queue

Slide 7

Slide 7 text

Priority を付加する #k8sjp

Slide 8

Slide 8 text

Pod Y: Req. = 20, Pri. = 20 Pod B: Req. = 5, Pri. = 30 Node 2: Capacity = 25/30 Pod X: Req. = 15, Pri. = 10 Pod A: Req. = 10, Pri. = 20 Node 1: Capacity = 25/30 Pod Z: Req. = 10, Pri. = 20 Scheduler’s main loop Pod Queue

Slide 9

Slide 9 text

Pod Y: Req. = 20, Pri. = 20 Pod B: Req. = 5, Pri. = 30 Node 2: Capacity = 25/30 Pod X: Req. = 15, Pri. = 10 Pod A: Req. = 10, Pri. = 20 Node 1: Capacity = 25/30 Pod Z: Req. = 10, Pri. = 20 Scheduler’s main loop Pod Queue

Slide 10

Slide 10 text

Pod Y: Req. = 20, Pri. = 20 Pod B: Req. = 5, Pri. = 30 Node 2: Capacity = 25/30 Pod A: Req. = 10, Pri. = 20 Node 1: Capacity = 10/30 Pod Z: Req. = 10, Pri. = 20 Scheduler’s main loop Pod Queue

Slide 11

Slide 11 text

Pod Y: Req. = 20, Pri. = 20 Pod B: Req. = 5, Pri. = 30 Node 2: Capacity = 25/30 Pod Z: Req. = 10, Pri. = 20 Pod A: Req. = 10, Pri. = 20 Node 1: Capacity = 20/30 Scheduler’s main loop Pod Queue

Slide 12

Slide 12 text

Priority / Preemption の意義 ● 重要な Pod がブロックされない ○ 従来は積極的に Pod を追い出す方法がなかった ○ 特に Node がスケールできない環境では嬉しい ○ スケールできても Node の立ち上がりは遅い ● コストが予測可能になる ○ Priority なしでは無限のリソースが必要 ○ 急激なスパイクにも対応できる #k8sjp

Slide 13

Slide 13 text

Pod B: Req. = 10, Pri. = 10 Pod C: Req. = 5, Pri. = 5 Pod A: Req. = 15, Pri. = 10 Node 1: Capacity = 30/30 Pod H: Req. = 20, Pri. = 10 Pod G: Req. = 10, Pri. = 30 Node 3: Capacity = 30/30 Pod E: Req. = 15, Pri. = 20 Pod F: Req. = 5, Pri. = 10 Pod D: Req. = 10, Pri. = 30 Node 2: Capacity = 30/30 Pod X: Req. = 10, Pri. = 20 Preemptor

Slide 14

Slide 14 text

Node の選択は 2 段階 #k8sjp

Slide 15

Slide 15 text

Step 1. Node のフィルタリング ● Preempt の可能性がある Node を選ぶ ○ 配置したい Pod より優先度の低い Pod を 一旦すべて追い出したと仮定 ○ その状態で目的の Pod が配置できるか? ● リソース量以外も考慮 ○ NodeSelector や Taint などの指定 ○ Node Affinity / Inter-Pod Affinity / Anti-Affinity の指定 #k8sjp

Slide 16

Slide 16 text

Pod B: Req. = 10, Pri. = 10 Pod C: Req. = 5, Pri. = 5 Pod A: Req. = 15, Pri. = 10 Node 1: Capacity = 30/30 Pod H: Req. = 20, Pri. = 10 Pod G: Req. = 10, Pri. = 30 Node 3: Capacity = 30/30 Pod E: Req. = 15, Pri. = 20 Pod F: Req. = 5, Pri. = 10 Pod D: Req. = 10, Pri. = 30 Node 2: Capacity = 30/30 Pod X: Req. = 10, Pri. = 20 Preemptor

Slide 17

Slide 17 text

Node 1: Capacity = 0/30 Pod X: Req. = 10, Pri. = 20 Preemptor

Slide 18

Slide 18 text

Pod E: Req. = 15, Pri. = 20 Pod D: Req. = 10, Pri. = 30 Node 2: Capacity = 25/30 Pod X: Req. = 10, Pri. = 20 Preemptor

Slide 19

Slide 19 text

Pod G: Req. = 10, Pri. = 30 Node 3: Capacity = 10/30 Pod X: Req. = 10, Pri. = 20 Preemptor

Slide 20

Slide 20 text

Pod B: Req. = 10, Pri. = 10 Pod C: Req. = 5, Pri. = 5 Pod A: Req. = 15, Pri. = 10 Node 1: Capacity = 30/30 Pod H: Req. = 20, Pri. = 10 Pod G: Req. = 10, Pri. = 30 Node 3: Capacity = 30/30 Pod E: Req. = 15, Pri. = 20 Pod F: Req. = 5, Pri. = 10 Pod D: Req. = 10, Pri. = 30 Node 2: Capacity = 30/30 Pod X: Req. = 10, Pri. = 20 Preemptor

Slide 21

Slide 21 text

Step 2. Node の順位付け ● 最低限追い出すべき Pod を求める ○ 可能な範囲で Pod を戻してみる ● 減点が少ない Node を選ぶ ○ Pod Disruption Budget (PDB) 違反となる個数 ○ PDB 違反となる Pod の Priority の最大値 ○ 追い出される Pod の Priority の合計値 ○ 追い出される Pod の個数 #k8sjp

Slide 22

Slide 22 text

Pod B: Req. = 10, Pri. = 10 Pod C: Req. = 5, Pri. = 5 Pod A: Req. = 15, Pri. = 10 Node 1: Capacity = 15/30 Pod H: Req. = 20, Pri. = 10 Pod G: Req. = 10, Pri. = 30 Node 3: Capacity = 10/30 Pod E: Req. = 15, Pri. = 20 Pod F: Req. = 5, Pri. = 10 Pod D: Req. = 10, Pri. = 30 Node 2: Capacity = 30/30 Pod X: Req. = 10, Pri. = 20 Preemptor

Slide 23

Slide 23 text

Pod B: Req. = 10, Pri. = 10 Pod C: Req. = 5, Pri. = 5 Pod A: Req. = 15, Pri. = 10 Node 1: Capacity = 30/30 Pod X: Req. = 10, Pri. = 20 Pod G: Req. = 10, Pri. = 30 Node 3: Capacity = 20/30 Pod E: Req. = 15, Pri. = 20 Pod F: Req. = 5, Pri. = 10 Pod D: Req. = 10, Pri. = 30 Node 2: Capacity = 30/30 Preemptor

Slide 24

Slide 24 text

このアルゴリズムは完璧? #k8sjp

Slide 25

Slide 25 text

Affinity + Preemption ● Pod の Affinity ○ 指定した他の Pod が存在することを要求 ○ 例:Redis は Web アプリと同じ Node に割り当て ● 高優先度 → 低優先度の Affinity ○ 低優先度だと「一旦すべて追い出す」の対象に ○ フィルタリングの段階で候補から外れてしまう ○ 低優先度 → 高優先度なら大丈夫 #k8sjp

Slide 26

Slide 26 text

Anti-Affinity + Preemption ● Pod の Anti-Affinity ○ 指定した他の Pod が存在しないことを要求 ○ 例:DB クラスタを AZ ごとに分散させる ● Node を跨いだ Anti-Affinity ○ 「一旦すべて追い出す」のは該当 Node のみ ○ 同じ AZ 内、他の Node 上にある Pod が邪魔 #k8sjp

Slide 27

Slide 27 text

まとめ ● 重要な Pod に優先してリソース確保 ○ コストが制御可能で、際限なく増大しない ● Node のフィルタリング + 順位付け ○ まず全部追い出してみて、それから改めて精査 ● Affinity との組み合わせは要注意 ○ 本来 preempt できるはずでも発動しないケース #k8sjp

Slide 28

Slide 28 text

Preempt the Preemption! Presented by チェシャ猫 (@y_taka_23) #k8sjp